-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import festival syllabus #44
Comments
To implement this, I am considering changing the schema of the classes table. I need the following fields:
I should probably create tables for "Discipline", "Instrument", "Category", and "Age Category" so I can capture the "Information applying to all classes" information for the Instrument and "Additional information" for the Category. In addition, capture the standard age ranges for the Age Categories (which can be overridden by the Age Detail). For PEI Queens County Music Festival, the Syllabus descriptions consist of the following fields, in order, separated by dashes:
Where the value of a field is "None", nothing is printed and there is still only one dash before the next field.
In some cases, the syllabus breaks this rule so I will have to "enforce" a change in the syllabus and ask if the client is OK with that. |
Next, how to ingest the existing Syllabus. Could I use NLP? |
For now, just use the file, Music Festival Reference Material/03 Syllabus/QCMF-fees-syllablist-final.pdf. It is all one simple table with course number, description (which is usable as a title), and fee. This code will be "throw-away" code that is useful only during the development phase where I am trying to quickly build a test data set. Procedure will be:
In the future, before the first release, create a new schema for the classes and use the setup pages to manually add each class (maybe find way to add multiple classes in the same page). |
I installed Excalibur, the web-interface for Camelot.
When I ran the Cancelling the Excalibur effort. Next, will try Camelot command line interface. |
Installed Camelot:
Test the install by running Camelot command:
This raises an import error. It seems that the install documentation left out a dependency. I need to install OpenCV:
Now, the Then, I ran the command:
I got a deprecation error: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead. Worked around this with the following commands:
Now the Camelot command seems to work. Camelot seems to use the filename in the"-o" option as a basis for output filenames and adds page and table information to the filename. Opened the
This time, Camelot processed all the pages, but created a separate CSV file for each page. There seems to be no option in the Camelot CLI to output a single large CSV. So, I solved the problem with the following Bash script:
Now, I can edit the file in a spreadsheet program. |
I decided to not use the Camelot CLI and, instead, incorporate the Camelot library into my program and directly read the PDF file. This lets me include headers, split suffixes from the class numbers, add a header row, etc. I wrote a quick script to test the process and it works reliably. The script is shown below:
I also found a reliable way to install Camelot: use the "cv" option instead of "base". This avoids the opencv dependency issue. I still also need to pin the version of PyPDF:
Camelot works really well but I am concerned it is not well maintained. I will use it, for now, but need to keep an eye out for a better solution. It looks like the only other easy-to-use Python library for reading tables in PDFs is tabula-py... |
Need to import the local festival syllabus, which contains information from the provincial syllabus plus fees:
The document to look at are:
The text was updated successfully, but these errors were encountered: