-
Notifications
You must be signed in to change notification settings - Fork 7
Usecase: StAZH
This page describes the first experiment swith the StAZH collection
Contact person: JLM
We selected the StAZH collection as advised. (Not too difficult, availability of a HTR model and an expert person)
See collection StAZH 3081
The labels are: 'catch-word', 'header', 'heading', 'marginalia', 'page-number'
One problem is the consistency with the segmentation. We observed that these elements may correspond to 1 or 2 TextRegion , or to 1 TextLine. This is a technical problem, because currently we annotate at a single depth of the structure, e.g. TextRegion (but in near future we have some ideas and plan to extend the underlying ML techno. We'll see... )
Annotated:
- MM_1_001
- MM_1_005
- MM_1_012
- MM_1_017
- MM_1_025
- MM_1_028
- MM_1_032
- MM_1_036
- MM_1_040
- MM_1_044
- MM_1_048
- MM_2_235
- MM_2_231
- MM_2_068
CODE in https://github.com/Transkribus/TranskribusDU
NOTE: the transcript has been manually produced! (text and segmentation). So this dataset does not reflect what the DU will face in the future.
- in Transkribus or using the TranskribusPyClient commands, 3 collections were created:
READDU_JL_TRN : to contain annotated training documents (001, 005, 032, 036, 040) READDU_JL_TST : to contain test documents (annotated) (012) READJDU_JL_PRD : to contain the documents to be annotated automatically (012, 017, 068, 231, 235)
- Using TranskribusPyClient, we download each collection on disk (preferably without image since we use images only for qualitative human evaluation)
- A training is done. (See usecases/DU_StAZH.py) It results in a model stored on disk
- A test is done using the test collection.
- loading test graphs C:\Local\meunier\git\TranskribusDU\usecases\StAZH\trnskrbs_3832\col\8251.mpxml - 58 nodes, 75 edges) 1 graphs loaded - computing features on test set done - predicting on test set done Line=True class, column=Prediction OTHER [[21 1 2 0 1] header [ 0 6 2 0 0] heading [ 0 0 1 0 0] marginalia [ 0 0 0 15 0] page-number [ 0 0 0 0 9]] precision recall f1-score support OTHER 1.00 0.84 0.91 25 header 0.86 0.75 0.80 8 heading 0.20 1.00 0.33 1 marginalia 1.00 1.00 1.00 15 page-number 0.90 1.00 0.95 9 avg / total 0.95 0.90 0.92 58
- An automated annotation is done, resulting in a new transcript to be added onto the document in Transkribus. (TO be done, upload python code not working yet)
Same dataset, a few more test documents, end-to-end experimentation from Transkribus to Transkribus, but still based on manual segmentation and transcription.
Update your PYTHONPATH variable
export PYTHONPATH=<YOURPATH>/TranskribusDU/TranskribusDU:<YOURPATH>/TranskribusPyClient/src
Making a TRAINING sandbox collection on Transkribus. Technically, the train and test collections are only read, but not modified. So I could work from the original collection, actually. Also FYI: since I'm using cygwin on Windows, I've a python.sh script dealing with file path conversions...
> ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TRN --> 3820
Adding some annotated document to it
#--- do_addDocToCollec.py <colId> [ <docId> | <docIdFrom>-<docIdTo> ]+ > ./python.sh TranskribusCommands/do_addDocToCollec.py 3820 7749 7750
Downloading the XML on my machine
> ./python.sh TranskribusCommands/Transkribus_downloader.py 3820 --noimage - Done, see in .\trnskrbs_3820
got this on disk
> ls trnskrbs_3820 col/ config.txt* out/ ref/ run/ xml/ > ls trnskrbs_3820/col 7749/ 7749.mpxml* 7749_max.ts* 7750/ 7750.mpxml* 7750_max.ts* trp.json*
Training! (I train a model named trn3820, which will be stored in folder mdl-StAZH_a based on collection in folder trnskrbs_3820
You can use either a CRF (--crf) model or a Neural Network model (--ecn), based on Tensorflow
#--- DU_StAZH.py <model-name> <model-directory> [--trn <col-dir>]+ [--tst <col-dir>]+ [--prd <col-dir>]+ --ecn|crf > ./python.sh usecases/DU_StAZH.py mdl-StAZH_a trn3820 --trn trnskrbs_3820 --ecn
again we create a test collection and populate it also with annotated document to compute some performance score of the model
> ./python.sh TranskribusCommands/Transkribus_downloader.py 3832 - Done, see in .\trnskrbs_3832
TESTING!
> ./python.sh usecase/DU_StAZH.py mdl-StAZH_a trn3820 --tst ./trnskrbs_3832 -------------------------------------------------- Trained model 'mdl-StAZH_a' in folder 'trn3820' Test collection(s):['C:\\tmp_READ\\tuto\\trnskrbs_3832\\col'] -------------------------------------------------- - loading a crf.Model_SSVM_AD3.Model_SSVM_AD3 model - loading pre-computed data from: trn3820\mdl-StAZH_a_model.pkl file found on disk: trn3820\mdl-StAZH_a_model.pkl file is fresh - loading pre-computed data from: trn3820\mdl-StAZH_a_transf.pkl file found on disk: trn3820\mdl-StAZH_a_transf.pkl file is fresh done - classes: ['OTHER', 'catch-word', 'header', 'heading', 'marginalia', 'page-number'] - loading test graphs C:\tmp_READ\tuto\trnskrbs_3832\col\8251.mpxml - 58 nodes, 75 edges) 1 graphs loaded - computing features on test set #features nodes=521 edges=532 done - predicting on test set done Line=True class, column=Prediction OTHER [[21 1 2 0 1] header [ 0 6 2 0 0] heading [ 0 0 1 0 0] marginalia [ 0 0 0 15 0] page-number [ 0 0 0 0 9]] precision recall f1-score support
OTHER 1.00 0.84 0.91 25 header 0.86 0.75 0.80 8 heading 0.20 1.00 0.33 1 marginalia 1.00 1.00 1.00 15 page-number 0.90 1.00 0.95 9
avg / total 0.95 0.90 0.92 58
(unweighted) Accuracy score = 0.90
Now create the collection where I'll apply the model
> ./python.sh TranskribusCommands/do_createCollec.py READDU_JL_PRD -->3829
so here, I copy the documents to the new collection because I'll upload a new transcript produced by the model. (at this stage, I do not want to impact the "real" document.)
> ./python.sh TranskribusCommands/Transkribus_downloader.py 3829 --noimage ---> - Done, see in .\trnskrbs_3829
> ./python.sh usecases/DU_StAZH.py mdl-StAZH_a trn3820 --run ./trnskrbs_3829 --> - done
we produced some ..._du.mpxml files
> ls trnskrbs_3829/col 8620/ 8620_max.ts* 8621_du.mpxml* 8622.mpxml* 8623/ 8623_max.ts* 8624_du.mpxml* 8620.mpxml* 8621/ 8621_max.ts* 8622_du.mpxml* 8623.mpxml* 8624/ 8624_max.ts* 8620_du.mpxml* 8621.mpxml* 8622/ 8622_max.ts* 8623_du.mpxml* 8624.mpxml* trp.json*
now upload to Transkribus
Actually, this collection was also annotated, so we can compute a score on it > ./python.sh tasks/DU_StAZH_a.py mdl-StAZH_a trn3820 --tst trnskrbs_3829 Line=True class, column=Prediction OTHER [[176 11 6 8 1 13] catch-word [ 0 0 0 0 0 0] header [ 0 0 38 3 0 2] heading [ 0 0 0 2 0 0] marginalia [ 0 0 0 0 62 0] page-number [ 0 0 0 0 0 48]] precision recall f1-score support OTHER 1.00 0.82 0.90 215 catch-word 0.00 0.00 0.00 0 header 0.86 0.88 0.87 43 heading 0.15 1.00 0.27 2 marginalia 0.98 1.00 0.99 62 page-number 0.76 1.00 0.86 48 avg / total 0.95 0.88 0.90 370 (unweighted) Accuracy score = 0.88
Open your document. Go to Metadata/Structural. You should see the annotations
./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TRN ./python.sh TranskribusCommands/do_addDocToCollec.py 3820 7749 7750 ./python.sh TranskribusCommands/Transkribus_downloader.py 3820 --noimage ./python.sh usecases/DU_StAZH_a.py ./mdl-StAZH_a MyModel --trn trnskrbs_3820
./python.sh TranskribusCommands/do_createCollec.py READDU_JL_TST ./python.sh TranskribusCommands/do_addDocToCollec.py 3832 8251 ./python.sh TranskribusCommands/Transkribus_downloader.py 3832 ./python.sh usecases/StAZH/DU_StAZH.py ./mdl-StAZH_a MyModel --tst trnskrbs_3832
./python.sh TranskribusCommands/do_createCollec.py READDU_JL_PRD ./python.sh TranskribusCommands/do_copyDocToCollec.py 3829 8251 8252 8564-8566 ./python.sh TranskribusCommands/Transkribus_downloader.py 3829 --noimage ./python.sh usecases/StAZH/DU_StAZH_a.py ./mdl-StAZH_a MyModel --run trnskrbs_3829 ./python.sh TranskribusCommands/TranskribusDU_transcriptUploader.py ./trnskrbs_3829 3829