Import festival syllabus #44

blinklet · 2024-08-22T21:37:08Z

Need to import the local festival syllabus, which contains information from the provincial syllabus plus fees:

class number
description (which is an aggregation of class information)
fee

The document to look at are:

Music Festival Reference Material/03 Syllabus/Full-syllabus-FINAL-2.pdf
Music Festival Reference Material/03 Syllabus/QCMF-fees-syllablist-final.pdf

blinklet · 2024-08-23T00:16:25Z

To implement this, I am considering changing the schema of the classes table. I need the following fields:

Discipline (Piano, Strings, etc)
Instrument (Piano, Violin, etc)
Type (Solo, Duet, Large Group, etc)
Category (None, Masterwork Classes, Graded Solo Classes, etc)
Category Type (Sonata, "Graded Study Class", "Unaccompanied Bach", etc)
Category Details ("One Movement", "Complete", , etc)
Age Category (Junior, Senior, None, etc)
Age Detail ("10 and under", etc)
Level ("Grade 8 to 10", "First Year of Study", "Minimum RCM Grade 3", etc)

I should probably create tables for "Discipline", "Instrument", "Category", and "Age Category" so I can capture the "Information applying to all classes" information for the Instrument and "Additional information" for the Category. In addition, capture the standard age ranges for the Age Categories (which can be overridden by the Age Detail).

For PEI Queens County Music Festival, the Syllabus descriptions consist of the following fields, in order, separated by dashes:

Instrument (Piano, Violin, etc)
Type (Solo, Duet, Large Group, etc)
Category Type (Sonata, "Graded Study Class", "Unaccompanied Bach", etc)
Category Details ("One Movement", "Complete", etc)
Age Category (Junior, Senior, None, etc)
Age Detail ("10 and under", etc)
Level ("Grade 8 to 10", "First Year of Study", "Minimum RCM Grade 3", etc)
Repertoire code ("List A", "List B", etc)

Where the value of a field is "None", nothing is printed and there is still only one dash before the next field.
So, for example:

Piano Solo - Sonata - One movement - Intermediate - Grade 5 to Grade 8
Violin Solo - Unaccompanied Bach - Junior
Violin Solo - Graded Classes - Grade 3 - List B

In some cases, the syllabus breaks this rule so I will have to "enforce" a change in the syllabus and ask if the client is OK with that.

blinklet · 2024-08-23T00:27:08Z

Next, how to ingest the existing Syllabus. Could I use NLP?

blinklet · 2024-08-23T12:57:16Z

For now, just use the file, Music Festival Reference Material/03 Syllabus/QCMF-fees-syllablist-final.pdf. It is all one simple table with course number, description (which is usable as a title), and fee. This code will be "throw-away" code that is useful only during the development phase where I am trying to quickly build a test data set.

Procedure will be:

import the syllabus with fees
import the Google form results

In the future, before the first release, create a new schema for the classes and use the setup pages to manually add each class (maybe find way to add multiple classes in the same page).

blinklet · 2024-08-23T13:22:01Z

I installed Excalibur, the web-interface for Camelot.

$ apt install ghostscript python3-tk
$ python3 -m venv .venv2
$ . .venv2/bin/activate
(.venv2) $ pip install excalibur-py
(.venv2) $ excalibur initdb
(.venv2) $ excalibur webserver

When I ran the excalibur initdb command, I saw the Python error: ImportError: cannot import name 'MutableMapping' from 'collections' (/usr/lib/python3.12/collections/__init__.py). It seems that the Excalibur project is using an extremely old version of Flask and that relies on functionality that was deprecated in Python 3.10. I could probably hack around this but I want something that works "as is".

Cancelling the Excalibur effort. Next, will try Camelot command line interface.

blinklet · 2024-08-23T16:16:18Z

Installed Camelot:

$ apt install ghostscript python3-tk
$ python3 -m venv .venv2
$ . .venv2/bin/activate
(.venv2) $ pip install tk
(.venv2) $ pip install "camelot-py[base]"

Test the install by running Camelot command:

(.venv2) $ camelot --version

This raises an import error. It seems that the install documentation left out a dependency. I need to install OpenCV:

$ (.venv) pip install opencv-python-headless

Now, the camelot --version command works.

Then, I ran the command:

(.venv2) $ camelot -o test.csv -f csv stream QCMF-fees-syllablist-final.pdf

I got a deprecation error: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

Worked around this with the following commands:

(.venv2) $ pip uninstall 'camelot-py[base]'
(.venv2) $ pip uninstall PyPDF2
(.venv2) $ pip install PyPDF2==2.12.1
(.venv2) $ pip install 'camelot-py[base]'

Now the Camelot command seems to work. Camelot seems to use the filename in the"-o" option as a basis for output filenames and adds page and table information to the filename. Opened the test-page-1-table-1.csv file and saw it only processed the first page. Looks like that's the default. I specified the "-p" option and tried again:

(.venv2) $ camelot -o test.csv -f csv -p 1-end stream QCMF-fees-syllablist-final.pdf

This time, Camelot processed all the pages, but created a separate CSV file for each page. There seems to be no option in the Camelot CLI to output a single large CSV. So, I solved the problem with the following Bash script:

for file in *.csv; do
    sed -i '1,4d' "$file"
done
cat *.csv > combined.csv

Now, I can edit the file in a spreadsheet program.

blinklet · 2024-08-23T18:23:37Z

I decided to not use the Camelot CLI and, instead, incorporate the Camelot library into my program and directly read the PDF file. This lets me include headers, split suffixes from the class numbers, add a header row, etc. I wrote a quick script to test the process and it works reliably. The script is shown below:

import camelot
import csv
import pandas as pd

input_pdf = 'QCMF-fees-syllablist-final.pdf'
output_csv = 'processed.csv'

tables = camelot.read_pdf(input_pdf, pages='all', flavor='stream', strip_text='\n')

combined_df = pd.DataFrame()

for table in tables:
    df = table.df  # turn table into a pandas dataframe
    
    # Drop the first 4 rows of each table
    df = df.drop(df.index[:4])  
    
    df.columns = ["Number","Description", "Fee", "Festival"]
    df["Suffix"] = pd.Series(dtype="string")
    
    df.Suffix = df.Number.str[4:]
    df.Number = df.Number.str[0:4]
    
    df = df[["Number", "Suffix", "Description", "Fee", "Festival"]]
    
    combined_df = pd.concat([combined_df, df], ignore_index=True)

combined_df.to_csv(output_csv, index=False, quoting=csv.QUOTE_ALL)
print(f"Processed file saved as {output_csv}")

I also found a reliable way to install Camelot: use the "cv" option instead of "base". This avoids the opencv dependency issue. I still also need to pin the version of PyPDF:

(.venv) $ pip install "camelot-py[cv]"
(.venv) $ pip install PyPDF2==2.12.1

Camelot works really well but I am concerned it is not well maintained. I will use it, for now, but need to keep an eye out for a better solution. It looks like the only other easy-to-use Python library for reading tables in PDFs is tabula-py...

blinklet · 2024-08-26T18:01:28Z

In a Flask web-app, the Camelot library does not support "file-like-objects" in memory. So, I switched to using tabula-py.

See PR for the details.

blinklet mentioned this issue Aug 26, 2024

Added syllabus file download page #49

Merged

blinklet linked a pull request Aug 26, 2024 that will close this issue

Added syllabus file download page #49

Merged

blinklet closed this as completed in #49 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import festival syllabus #44

Import festival syllabus #44

blinklet commented Aug 22, 2024

blinklet commented Aug 23, 2024 •

edited

Loading

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024 •

edited

Loading

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024 •

edited

Loading

blinklet commented Aug 26, 2024

Import festival syllabus #44

Import festival syllabus #44

Comments

blinklet commented Aug 22, 2024

blinklet commented Aug 23, 2024 • edited Loading

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024 • edited Loading

blinklet commented Aug 23, 2024

blinklet commented Aug 23, 2024 • edited Loading

blinklet commented Aug 26, 2024

blinklet commented Aug 23, 2024 •

edited

Loading

blinklet commented Aug 23, 2024 •

edited

Loading

blinklet commented Aug 23, 2024 •

edited

Loading