-
Notifications
You must be signed in to change notification settings - Fork 7
Annemieke's Usecase: StAB
We discuss here the StAb usecase of Annemieke, using NAVER LABS Europe tools.
This work was done with commit 5ab2cfa, Oct 29th 2021. But it should work fine with more recent version of the code.
We need to clone the github repo, for Transkribus pythonic API and toolset, as well as the Document Understanding code.
git clone https://github.com/Transkribus/TranskribusPyClient.git git clone https://github.com/Transkribus/TranskribusDU.git DocumentUnderstanding-github
We advise you to set an environment variable, which is used in the rest of this document:
export GIT=...../DocumentUnderstanding-github
And the PYTHONPATH:
export PYTHONPATH=$GIT/TranskribusDU
Using Anaconda...
# creating a separated virtual environment conda create -n tf2 conda activate tf2 # installing everything conda install future scikit-learn pytest lxml scipy rtree shapely matplotlib conda install pygeos --channel conda-forge conda install tensorflow
Then install Python3 and follow these instructions from the DU repo to install the required Python libraries.
Here, for example:
> python --version Python 3.7.10 > python -c 'import tensorflow as tf;print(tf.__version__)' 2.0.0
Some up-side down pages
collection 120778 Mandatenbuch, doc StAB_A_I_495_Mandaatbuch_15 ID:443135, around pages 250
In this case, given an image, the GT XMl and the HTR XML are in two separate files, which is unfortunate. So we will have to merge them, by projecting the annotations into the XML produced by the HTR tool, to obtain the groundtruth (GT) collections.
For the collection StAB_Policeybuch_LA it seems we have the following mapping
120777 (HTR) | 111796 (GT) |
---|---|
449495 | 744848 |
449493 | 744837 |
443905 | 744836 |
444284 | 744835 |
449477 | 744833 |
442595 | 744832 |
For the collection StAB_Mandatenbuch_LA it seems we have the following mapping
120778 (HTR) | 111801 (GT) | name |
---|---|---|
439969 | 744860 | LA_StaB_A_I_481_Mandaatbuch_3 |
443189 | 744859 | LA_StAB_A_I_497 Mandaatbuch_17 |
441731 | 744858 | LA_StaB_A_I_484_Mandaatbuch_6 |
444208 | 744857 | LA_StAB_A_I_512_Mandaatbuch_32 |
444203 | 744856 | LA_StAB_A_I_514_Mandaatbuch_34 |
443765 | 744855 | LA_StAB_A_I_503_Mandaatbuch_23 |
744850 | 744850 | StAB_A_I_507 Mandaatbuch_27 |
742265 | 742265 | StAB_A_I_508_Mandaatbuch_28 |
742263 | 742263 | StAB_A_I_506_Mandaatbuch_26 |
441746 | 630254 | StaB_A_I_490_Mandaatbuch_10_LA |
439940 | 629338 | StaB_A_I_480_Mandaatbuch_2_LA |
TRPYC=~/Documents/git/TranskribusPyClient/src/TranskribusCommands CREDENTIALS="--login [email protected] --pwd XXX"
LA tool was applied on 21/10/2021. GT pages are the last in progress pages before 20/10/2021 Use do_transcript.py and store the selected pages in DOCID.trp
for i in 744848 744837 744836 744835 744833 744832 ; do python $TRPYC/do_transcript.py 111796 $i $CREDENTIALS --before 2021-10-20 --last_fil --status IN_PROGRESS --trp $i.trp done
#Mandatenbuch for i in 744860 744859 744858 744857 744856 744855 744850 742265 742263 630254 629338 do python $TRPYC/do_transcript.py 111801 $i $CREDENTIALS --before 2021-10-20 --last_fil --status IN_PROGRESS --trp $i.trp done
We first download the annotated and corresponding HTR collections.
# download the annotated collection into an automatically created folder, without the images (which we do not use) # StAB_Policeybuch_LA for i in 744848 744837 744836 744835 744833 744832 do python $TRPYC/Transkribus_downloader.py --noImage 111796 --trp $i.trp $CREDENTIALS --force done
> ls trnskrbs_111796/col 744832 744832_max.ts 744833.mpxml 744835 744835_max.ts 744836.mpxml 744837 744837_max.ts 744848.mpxml trp.json 744832.mpxml 744833 744833_max.ts 744835.mpxml 744836 744836_max.ts 744837.mpxml 744848 744848_max.ts
# download the corresponding non-annotated books (and only them) # NOTE: WARNING : 2 documents share the same name: StAB_A_I_468_Policeybuch_14 (I'm taking the 449495) #for docid in 449495 449493 443905 444284 449477 442595 #do #python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage --docid $docid 120777 #done # ACTUALLY, it is better to download the whole collection python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage 120777
In the GT pages, we have things like: (Policeiburch_14, p5)
<PcGts ... > <Metadata> <Creator>Transkribus</creator> <Created>2020-08-11T10:25:28.732+02:00</created> <LastChange>2021-10-05T19:53:45.113+02:00</lastchange> <TranskribusMetadata docId="744848" pageId="28255428" pageNr="5"/> </metadata> <Page imageFilename="IMG_0495.JPG" imageWidth="3888" imageHeight="2592"> <TextRegion type="page-number" id="region_1633456404101_3562" custom="readingOrder {index:0;} structure {type:page-number;}"> <Coords points="1008,240 907,240 907,159 1008,159"/> </textregion> .....
In the HTRed pages, we have:
<PcGts ... > <Metadata> <Creator>prov=...</creator> <Created>2020-08-11T10:25:28.732+02:00</created> <LastChange>2020-10-16T13:21:28.894+02:00</lastchange> </metadata> <Page imageFilename="IMG_0495.JPG" imageWidth="3888" imageHeight="2592"> <TextRegion orientation="0.0" id="r1" custom="readingOrder {index:0;}"> <Coords points="927,113 927,784 1049,784 1049,113"/> <TextLine id="r1l1" custom="readingOrder {index:0;}"> <Coords points="922,239 943,239 958,229 970,229 977,220 983,223 994,219 991,164 978,165 972,162 966,169 955,171 941,183 919,183"/> <Baseline points="927,216 957,216 987,213"/> <TextEquiv> <Unicode>4.</unicode> </textequiv> </textline> .....
Let's project. We have a problem: IDs of HTR and GT documents do not match... Luckily the file name per page is preserved.
So we issue that sort of commands, one per document:
mkdir gt.120777.449495 python $SRC/tasks/project_GT_by_location.py trnskrbs_120777/col/449495 trnskrbs_111796/col/744848 gt.120777.449495 --xpArea2=.//pg:TextRegion --pxml
Alternatively, a SHELL script can do them all:
SRC=~/Documents/git/DocumentUnderstanding-github/TranskribusDU/ HTR=120777 GT=111796 for ab in "449495 744848" "449493 744837" "443905 744836" "444284 744835" "449477 744833" "442595 744832" do set $ab a=$1 b=$2 echo "htr=$a gt=$b" outdir=gt.$HTR.$a mkdir $outdir python $SRC/tasks/project_GT_by_location.py trnskrbs_$HTR/col/$a trnskrbs_$GT/col/$b $outdir --xpArea2=.//pg:TextRegion --pxml done
=> we get one folder per document, with a sub-folder 'col' that contain the annotated pages (with text from HTR and annotation from groundtruth).
Let's do similarly for the Mandatenbuch collection:
python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage 111801 python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage 120778 HTR=120778 GT=111801 for ab in "439969 744860" "443189 744859" "441731 744858" "444208 744857" "444203 744856" "443765 744855" "744850 744850" "742265 742265" "742263 742263" "441746 630254" "439940 629338" do set $ab a=$1 b=$2 echo "htr=$a gt=$b" outdir=gt.$HTR.$a mkdir $outdir python $SRC/tasks/project_GT_by_location.py trnskrbs_$HTR/col/$a trnskrbs_$GT/col/$b $outdir --xpArea2=.//pg:TextRegion --pxml done
For the collection StAB_Policeybuch_LA we got those HTR files with projected annotations (most documents have no annotation)
120777 (HTR) | 111796 (GT) | Number of annotated pages |
---|---|---|
449495 | 744848 | 17 |
449493 | 744837 | 24 |
443905 | 744836 | 20 |
444284 | 744835 | 26 |
449477 | 744833 | 26 |
442595 | 744832 | 42 |
> python $SRC/tasks/DU_analyze_collection.py 'gt.120777.*/col' '*.pxml' ------------------------------------------------------------ ----- 155 documents, 155 pages [...] ----- Label frequency for ALL 5 objects of interest: ['TextRegion', 'GraphicRegion', 'CharRegion', 'RelationType', 'TextLine'] -- TextRegion 1083 occurences 889 labelled - caption 1 occurences ( 0.1%) ( 0.1%) - heading 189 occurences ( 17%) ( 21%) - page-number 289 occurences ( 27%) ( 33%) - paragraph 410 occurences ( 38%) ( 46%) - <unlabeled> 194 occurences ( 18%) -- GraphicRegion 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- CharRegion 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- RelationType 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- TextLine 8172 occurences 0 labelled - <unlabeled> 8172 occurences ( 100%) ------------------------------------------------------------
We get XML like:
<TextRegion type="heading" id="region_1633456402500_3559" custom="readingOrder {index:1;} structure {type:heading;}"> <Coords points="2122,553 1034,553 1034,175 2122,175"/> <TextLine id="r2l1" custom="readingOrder {index:0;}"> <Coords points="1433,324 1492,326 1537,299 1628,293 1686,298 1743,331 1766,331 1761,228 1673,222 1612,256 1561,219 1541,221 1457,287 1431,292"/> <Baseline points="1439,305 1465,301 1492,297 1518,295 1545,293 1571,292 1598,291 1624,291 1651,290 1677,290 1704,289 1730,287 1757,285"/> <TextEquiv> <Unicode>Irdonnanz</unicode> </textequiv> </textline> <TextLine id="r2l2" custom="readingOrder {index:1;}"> <Coords points="1103,438 1822,388 2123,406 2121,337 1101,376"/> <Baseline points="1109,417 1159,417 1209,417 1260,416 1310,414 1361,413 1411,411 1461,408 1512,405 1562,403 1613,401 1663,399 1713,396 1764,395 1814,393 1865,392 1915,392 1965,392 2016,393 2066,395 2117,398"/> <TextEquiv> <Unicode>Wagen den Bewehren und deren Qualibres and</unicode> </textequiv> </textline>
> python $SRC/tasks/DU_analyze_collection.py 'gt.120778.*/col' '*.pxml' 199 files ------------------------------------------------------------ ----- 199 documents, 199 pages [...] ----- Label frequency for ALL 5 objects of interest: ['TextRegion', 'GraphicRegion', 'CharRegion', 'RelationType', 'TextLine'] -- TextRegion 1336 occurences 1015 labelled - caption 5 occurences ( 0.4%) ( 0.5%) - heading 201 occurences ( 15%) ( 20%) - marginalia 1 occurences ( 0.1%) ( 0.1%) - page-number 335 occurences ( 25%) ( 33%) - paragraph 473 occurences ( 35%) ( 47%) - <unlabeled> 321 occurences ( 24%) -- GraphicRegion 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- CharRegion 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- RelationType 0 occurences 0 labelled - <unlabeled> 0 occurences ( n/a) -- TextLine 8089 occurences 0 labelled - <unlabeled> 8089 occurences ( 100%)
We merge all annotated HTR files in on main folder, renaming page files to make them globally uniq.
mkdir gt.120x.all mkdir gt.120x.all/col for i in `/bin/ls -d gt.12077?.*` do echo $i cp $i/col/IMG_*.pxml gt.120x.all/col rename IMG_ $i.IMG_ gt.120x.all/col/IMG_*.pxml done
We now have 354 .pxml files in the folder gt.120x.all/col. They have names like: gt.120778.439940.IMG_20200724_101422.pxml
Then we partition the set of pages in 3 parts, with 75 for the train, 15% for the validation, and 10% for test.
python $SRC/tasks/DU_split_collection.py gt.120x.all 75,15,10 # rename the parts, for convenience. Note that Part_3 is the 75% (reverse order...) mv gt.120x.all_part_3 gt120x_trn mv gt.120x.all_part_2 gt120x_vld mv gt.120x.all_part_1 gt120x_tst # counting files in each part for i in `ls -d gt120x_???`; do echo $i;ls $i/col|wc -l; done gt120x_trn/ 264 gt120x_tst/ 36 gt120x_vld/ 54
Now we are ready for training!
We will tag using those categories: heading, paragraph, page-number.
Everything else will be 'other'. Actually, some labels are missing and we will see "None" labels in the evaluation.
One problem is that at the TextLine level, the label statistics is unbalanced:
Labels count: [ 451 886 10268 365] (237 graphs) Labels : ['tag_OTHER', 'tag_heading', 'tag_paragraph', 'tag_page-number']
Not sure how negatively will it affect the models. (Predicting always 'paragraph' would lead to 86% accuracy... :-)
The text is difficult to exploit, since some documents do not have any text.
The command is:
python $GIT/usecases/StAB/DU_Tagger.py models 452398 --trn gt120x_trn --vld gt120x_vld --tst gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxmlWhat it means:
- we create a model "452398" stored in sub-directory "models" using the dataset created in previous sections of this document.
- the (arbitrary) graph we create is of "g1o" style.
- we use the EdgeConvolutionNetwork (ECN) neural model, in the given configuration
- the extension of the files to process is .pxml
The result we got is:
TEST REPORT FOR: 452398 Line=True class, column=Prediction tag_OTHER [[ 11 2 17 2] tag_heading [ 0 69 26 4] tag_paragraph [ 22 17 1602 4] tag_page-number [ 3 0 2 62]] (unweighted) Accuracy score = 94.63 % trace=1744 sum=1843 precision recall f1-score support tag_OTHER 0.306 0.344 0.324 32 tag_heading 0.784 0.697 0.738 99 tag_paragraph 0.973 0.974 0.973 1645 tag_page-number 0.861 0.925 0.892 67 avg / total 0.947 0.946 0.946 1843
NOTE: it is worth doing ~3 trainings and to keep the best model.
Here we do not use the text.
The command is:
python $GIT/usecases/StAB/DU_Tagger.py models 411597 --trn gt120x_trn --vld gt120x_vld --tst gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --TextRegion --no_textWhat we changed:
- the model name!
- added the options --TextRegion --no_text
The result we got is:
TEST REPORT FOR: 411597 Line=True class, column=Prediction TR_tag_OTHER [[33 0 0 1] TR_tag_heading [ 1 28 0 0] TR_tag_paragraph [ 1 3 82 0] TR_tag_page-number [ 1 0 0 69]] (unweighted) Accuracy score = 96.80 % trace=212 sum=219 precision recall f1-score support TR_tag_OTHER 0.917 0.971 0.943 34 TR_tag_heading 0.903 0.966 0.933 29 TR_tag_paragraph 1.000 0.953 0.976 86 TR_tag_page-number 0.986 0.986 0.986 70 avg / total 0.970 0.968 0.968 219
This is great! But let's see how good the segmentation is... (Because on "production" files, we will have to predict the TextRegion in order to tag them...)
The command is:
# NOTE: to evaluate the clustering, we need to produce some output, using --run instead of --tst, and with the option --eval_region python $GIT/usecases/StAB/DU_Seg.py models 452773 --trn gt120x_trn --vld gt120x_vld --run gt120x_tst --eval_region --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json
The outcome is:
ALL_region @simil 0.66 P 78.30 R 80.58 F1 79.43 ok=166 err=46 miss=40 ALL_region @simil 0.80 P 74.06 R 76.21 F1 75.12 ok=157 err=55 miss=49 ALL_region @simil 1.00 P 60.85 R 62.62 F1 61.72 ok=129 err=83 miss=77
How to read this metric?
- given a similarity threshold, e.g. 0.80, we consider a cluster to be valid if its IoU (intersection over union) exceeds 0.8 . Roughly speaking, the IoU indicates the proportion of cluster content that is valid. More precisely, the IoU is the ratio of its intersection with a GT cluster to the union with it.
- given a similarity threshold, we determine which predicted clusters are valid, and we can compute the usual precision, recall, F1 score.
Here we continue to use the same test set: gt120x_tst
Here we will use the iption --run of the tools to produce the output XML. This is particularly important for the segmentation task, because in --run mode, we destroy old TextRegion and create new, predicted, ones.
First let's tag the TextLine of the XML data, and copy it in another folder, where we will segment it.
python $GIT/usecases/StAB/DU_Tagger.py models 452398 --run gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml mkdir tag_out_pxml tag_out_pxml/col cp gt120x_tst/col/*_du.mpxml tag_out_pxml/col # rename it as it was an input file rename _du.mpxml .pxml tag_out_pxml/col/*
Secondly, we segment it:
python $GIT/usecases/StAB/DU_Seg.py models 452773 --run tag_out_pxml --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json
Finally, let's measure the quality at TextLine level (which should be similar to what we got)
python $GIT/TranskribusDU/tasks/eval_classif.py tag_out_pxml/col .mpxml //pc:TextLine ./@type_gt ./@type --- Fri Oct 29 13:44:06 2021--------------------------------- TEST REPORT FOR: Classification Line=True class, column=Prediction None [[ 11 0 0 2 16] heading [ 0 69 0 4 26] marginalia [ 0 2 0 0 1] page-number [ 3 0 0 62 2] paragraph [ 22 17 0 4 1602]] (unweighted) Accuracy score = 94.63 % trace=1744 sum=1843 precision recall f1-score support None 0.306 0.379 0.338 29 heading 0.784 0.697 0.738 99 marginalia 0.000 0.000 0.000 3 page-number 0.861 0.925 0.892 67 paragraph 0.973 0.974 0.973 1645 avg / total 0.946 0.946 0.946 1843 --------------------------------------------------------------
We see the presence of "None" labels. This is due to some missing annotations.
Ok, let's segment and put the result in a folder as if it was some input data:
python $GIT/usecases/StAB/DU_Seg.py models 452773 --run gt120x_tst --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json mkdir seg_out_pxml seg_out_pxml/col cp gt120x_tst/col/*_du.mpxml seg_out_pxml/col rename _du.mpxml .pxml seg_out_pxml/col/*
Now, tag **at TextRegion** level.
NOTE: since the annotation is in TextRegion's attributes, and since we destroy those TextRegion to create new ones (not necessary matching the original ones :-( ), we copy those attributes on the inner TextLine, before TextRegion destruction.
So the annotation is on TextLine, now.
python $GIT/usecases/StAB/DU_Tagger.py models 411597 --run seg_out_pxml --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --TextRegion --no_text
Evaluation:
python $GIT/TranskribusDU/tasks/eval_classif.py seg_out_pxml/col .mpxml //pc:TextLine/@type_gt //pc:TextLine/@type --- Fri Oct 29 09:36:16 2021--------------------------------- TEST REPORT FOR: Classification Line=True class, column=Prediction None [[ 22 1 0 4 2] heading [ 1 56 0 4 38] marginalia [ 3 0 0 0 0] page-number [ 5 4 0 57 1] paragraph [ 83 122 0 11 1429]] (unweighted) Accuracy score = 84.86 % trace=1564 sum=1843 precision recall f1-score support None 0.193 0.759 0.308 29 heading 0.306 0.566 0.397 99 marginalia 0.000 0.000 0.000 3 page-number 0.750 0.851 0.797 67 paragraph 0.972 0.869 0.917 1645 avg / total 0.914 0.849 0.874 1843
On this usecase, given that headings seems critical for the data exploitation, the TextLine tagging would allow to catch 69 headings out of 99, while 4 are predicted as page-number, and 26 as paragraphs.
We will use almost all available data to train.
For instance, we will train using both the train set and the validation set, using a large part of the test set as validation set. We will keep only a few test files to make sure we do not create something crap, by (big) mistake.
Creating a minimal test set:
cp -r gt120x_tst gt120x_prod_tst_trn mkdir -p gt120x_prod_tst_tst/col pushd gt120x_prod_tst_trn/col rm *.mpxml # old output files, if any #move a few files to the test folder, for sanity check mv gt.120777.442595.IMG_1092.pxml gt.120777.444284.IMG_9205.pxml ../../gt120x_prod_tst_tst/col mv gt.120778.439940.IMG_20200724_101311.pxml gt.120778.443189.IMG_4149.pxml ../../gt120x_prod_tst_tst/col popd # count ls gt120x_prod_tst_trn/col|wc -l ls gt120x_prod_tst_tst/col|wc -l # run the current model on this new test collection to see its performance python $GIT/usecases/StAB/DU_Tagger.py models 452398 --tst gt120x_prod_tst_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --- Thu Dec 16 10:03:47 2021--------------------------------- TEST REPORT FOR: 452398 Line=True class, column=Prediction tag_OTHER [[ 46 0 0 0] tag_heading [ 0 11 3 1] tag_paragraph [ 1 1 115 0] tag_page-number [ 0 0 1 4]] (unweighted) Accuracy score = 96.17 % trace=176 sum=183 precision recall f1-score support tag_OTHER 0.979 1.000 0.989 46 tag_heading 0.917 0.733 0.815 15 tag_paragraph 0.966 0.983 0.975 117 tag_page-number 0.800 0.800 0.800 5 avg / total 0.961 0.962 0.960 183 python $GIT/usecases/StAB/DU_Seg.py models 708931 --run gt120x_prod_tst_tst --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json --eval_region (unweighted) Accuracy score = 81.04 % trace=218 sum=269 ... ALL_region @simil 0.66 P 72.22 R 20.63 F1 32.10 ok=13 err=5 miss=50 ALL_region @simil 0.80 P 66.67 R 19.05 F1 29.63 ok=12 err=6 miss=51 ALL_region @simil 1.00 P 61.11 R 17.46 F1 27.16 ok=11 err=7 miss=52 #NOTE 1 : we were not lucky in our file selection, because 2 out of 4 files are totally failed... #NOTE 2 : due to disk crash, we lost the model 452773 (mentioned above) and used 708931 instead. #NOTE3 this model did on training: (unweighted) BEST PERFORMANCE (acc= 95.91 %) on valid set at Epoch 240
Now we train a few models, keeping only the best one (selected by looking at the train loss (on the validation set) and/or the performance on mini test set)
python $GIT/usecases/StAB/DU_Tagger.py models prod_tag_model --trn gt120x_trn --trn gt120x_prod_tst_trn --vld gt120x_vld --tst gt120x_prod_tst_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --no_text python $GIT/usecases/StAB/DU_Seg.py models prod_seg_model --trn gt120x_trn --trn gt120x_prod_tst_trn --vld gt120x_vld --tst gt120x_prod_tst_tst --run gt120x_prod_tst_tst --eval_region --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json
A trick to select the model, assuming you stored the log on the trainings in some folder:
#for the tagging model egrep -e '(BEST PERFORMANCE|Accuracy score|tag_heading)' tag_log_files/* #for the segmentation model egrep -e '(BEST PERFORMANCE|Accuracy score|ALL_region)' seg_log_files/*
Strangely, no way to beat the best tagging model so far: 452398 . It is best on OTHER and on par on the other tags. Let's keep it.
For the segmentation we got a slightly better model: 1274574
BEST PERFORMANCE (acc= 96.24 %) on valid set at Epoch 170 log/1274574:(unweighted) Accuracy score = 81.04 % trace=218 sum=269 log/1274574:ALL_region @simil 0.66 P 62.07 R 28.57 F1 39.13 ok=18 err=11 miss=45 log/1274574:ALL_region @simil 0.80 P 58.62 R 26.98 F1 36.96 ok=17 err=12 miss=46 log/1274574:ALL_region @simil 1.00 P 55.17 R 25.40 F1 34.78 ok=16 err=13 miss=47
We will use model 452398 for tagging and 1274574 for segmenting.
NEWS: segmentation is done after tagging. Hence the segmentation step can use the labels of the textlines. Textual features are no longer needed. Final models: for tagging 720137 for segmentation 1281545
How to apply the pipeline on your full collection
python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 129687 --noimage $CREDENTIALS
python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 129686 --noimage $CREDENTIALS
python $GIT/usecases/StAB/DU_Tagger.py models 452398 --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --run PROD/trnskrbs_129686 --run PROD/trnskrbs_129687
mkdir -p {trnskrbs_129686,trnskrbs_129687}_out/col # for upload: need to copy <COL>/col/<DOCID>/.*$.trp.json # otherwise Warning: cannot set Parent-ID because file not found: /trnskrbs_129686_out/col/865313/trp.json
cp trnskrbs_129687/col/*_du.mpxml trnskrbs_129687_out/col/ cp trnskrbs_129686/col/*_du.mpxml trnskrbs_129686_out/col/
rename _du.mpxml .mpxml trnskrbs_129686_out/col/*.mpxml rename _du.mpxml .mpxml trnskrbs_129687_out/col/*.mpxml
python $GIT/usecases/StAB/DU_Seg.py models 708931 --g1o --ecn --run PROD/trnskrbs_129686_out --run PROD/trnskrbs_129687_out
cd PROD python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/TranskribusDU_transcriptUploader.py trnskrbs_129686_out/ 129686 $CREDENTIALS
python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/TranskribusDU_transcriptUploader.py trnskrbs_129687_out/ 129687 $CREDENTIALS
We have shown:
1. how to tag TextLine, or TextRegion
2. how to segment them, i.e. forming TextRegion.
3. how to pipe the two processing
On this use case, it turns out that tagging TextLine and then forming new TextRegion works better.
The alternative was to create TextRegion and tag those TextRegion.