Annemieke's Usecase: StAB

We discuss here the StAb usecase of Annemieke, using NAVER LABS Europe tools.

Table of Contents SW Installation Git Python and the required modules Collections Notes Pre-processing for DU Training Select the GT pages Download Project the annotations to the HTRed documents Policeibuch Mandatenbuch Some statistics StAB_Policeybuch_LA StAB_Mandatenbuch_LA Split in train / validation / test Training Tagging Tagging TextLine Tagging TextRegion Segmentation of TextLine nodes into TextRegions (aka Clustering) Testing the whole pipe Tag then Segment Segment then Tag Better to tag at TextLine level !! Train production models Production Download the collections Process the collections Upload the collections Conclusion

SW Installation

Git

This work was done with commit 5ab2cfa, Oct 29th 2021. But it should work fine with more recent version of the code.

We need to clone the github repo, for Transkribus pythonic API and toolset, as well as the Document Understanding code.

  git clone https://github.com/Transkribus/TranskribusPyClient.git
  git clone https://github.com/Transkribus/TranskribusDU.git DocumentUnderstanding-github

We advise you to set an environment variable, which is used in the rest of this document:

  export GIT=...../DocumentUnderstanding-github

And the PYTHONPATH:

  export PYTHONPATH=$GIT/TranskribusDU

Python and the required modules

Using Anaconda...

# creating a separated virtual environment
conda create -n tf2
conda activate tf2

# installing everything
conda install future scikit-learn pytest lxml scipy rtree shapely matplotlib
conda install pygeos --channel conda-forge
conda install tensorflow

Then install Python3 and follow these instructions from the DU repo to install the required Python libraries.

Here, for example:

  > python --version
  Python 3.7.10
  > python -c 'import tensorflow as tf;print(tf.__version__)'
  2.0.0

Collections

Notes

Some up-side down pages

collection 120778 Mandatenbuch, doc StAB_A_I_495_Mandaatbuch_15 ID:443135, around pages 250

Pre-processing for DU Training

In this case, given an image, the GT XMl and the HTR XML are in two separate files, which is unfortunate. So we will have to merge them, by projecting the annotations into the XML produced by the HTR tool, to obtain the groundtruth (GT) collections.

For the collection StAB_Policeybuch_LA it seems we have the following mapping

120777 (HTR)	111796 (GT)
449495	744848
449493	744837
443905	744836
444284	744835
449477	744833
442595	744832

For the collection StAB_Mandatenbuch_LA it seems we have the following mapping

120778 (HTR)	111801 (GT)	name
439969	744860	LA_StaB_A_I_481_Mandaatbuch_3
443189	744859	LA_StAB_A_I_497 Mandaatbuch_17
441731	744858	LA_StaB_A_I_484_Mandaatbuch_6
444208	744857	LA_StAB_A_I_512_Mandaatbuch_32
444203	744856	LA_StAB_A_I_514_Mandaatbuch_34
443765	744855	LA_StAB_A_I_503_Mandaatbuch_23
744850	744850	StAB_A_I_507 Mandaatbuch_27
742265	742265	StAB_A_I_508_Mandaatbuch_28
742263	742263	StAB_A_I_506_Mandaatbuch_26
441746	630254	StaB_A_I_490_Mandaatbuch_10_LA
439940	629338	StaB_A_I_480_Mandaatbuch_2_LA

Select the GT pages

 TRPYC=~/Documents/git/TranskribusPyClient/src/TranskribusCommands
 CREDENTIALS="--login [email protected] --pwd XXX"

LA tool was applied on 21/10/2021. GT pages are the last in progress pages before 20/10/2021 Use do_transcript.py and store the selected pages in DOCID.trp

 for i in 744848 744837  744836 744835 744833 744832 ; 
 do python $TRPYC/do_transcript.py 111796  $i   $CREDENTIALS --before 2021-10-20 --last_fil --status IN_PROGRESS --trp $i.trp
 done

 #Mandatenbuch 
 for i in 744860 744859 744858 744857 744856 744855 744850 742265 742263 630254 629338 
 do python $TRPYC/do_transcript.py 111801  $i   $CREDENTIALS --before 2021-10-20 --last_fil --status IN_PROGRESS --trp $i.trp
 done

Download

We first download the annotated and corresponding HTR collections.

 # download the annotated collection into an automatically created folder, without the images (which we do not use)
 # StAB_Policeybuch_LA
 for i in 744848 744837  744836 744835 744833 744832 
 do  python $TRPYC/Transkribus_downloader.py --noImage 111796 --trp $i.trp $CREDENTIALS --force
 done

 > ls trnskrbs_111796/col
 744832          744832_max.ts   744833.mpxml    744835          744835_max.ts   744836.mpxml    744837          744837_max.ts   744848.mpxml    trp.json
 744832.mpxml    744833          744833_max.ts   744835.mpxml    744836          744836_max.ts   744837.mpxml    744848          744848_max.ts

  # download the corresponding non-annotated books (and only them)
  # NOTE: WARNING : 2 documents share the same name: StAB_A_I_468_Policeybuch_14 (I'm taking the 449495)
  #for docid in 449495 449493 443905 444284 449477 442595
  #do
  #python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage --docid $docid 120777
  #done
  # ACTUALLY, it is better to download the whole collection
  python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage 120777

Project the annotations to the HTRed documents

In the GT pages, we have things like: (Policeiburch_14, p5)

 <PcGts ... >
    <Metadata>
        <Creator>Transkribus</creator>
        <Created>2020-08-11T10:25:28.732+02:00</created>
        <LastChange>2021-10-05T19:53:45.113+02:00</lastchange>
        <TranskribusMetadata docId="744848" pageId="28255428" pageNr="5"/>
    </metadata>
    <Page imageFilename="IMG_0495.JPG" imageWidth="3888" imageHeight="2592">
      <TextRegion type="page-number" id="region_1633456404101_3562" custom="readingOrder {index:0;} structure {type:page-number;}">
            <Coords points="1008,240 907,240 907,159 1008,159"/>
        </textregion>
  .....

In the HTRed pages, we have:

 <PcGts ... >
    <Metadata>
        <Creator>prov=...</creator>
        <Created>2020-08-11T10:25:28.732+02:00</created>
        <LastChange>2020-10-16T13:21:28.894+02:00</lastchange>
    </metadata>
    <Page imageFilename="IMG_0495.JPG" imageWidth="3888" imageHeight="2592">
        <TextRegion orientation="0.0" id="r1" custom="readingOrder {index:0;}">
            <Coords points="927,113 927,784 1049,784 1049,113"/>
            <TextLine id="r1l1" custom="readingOrder {index:0;}">
                <Coords points="922,239 943,239 958,229 970,229 977,220 983,223 994,219 991,164 978,165 972,162 966,169 955,171 941,183 919,183"/>
                <Baseline points="927,216 957,216 987,213"/>
                <TextEquiv>
                    <Unicode>4.</unicode>
                </textequiv>
            </textline>
  .....

Let's project. We have a problem: IDs of HTR and GT documents do not match... Luckily the file name per page is preserved.

Policeibuch

So we issue that sort of commands, one per document:

 mkdir gt.120777.449495
 python $SRC/tasks/project_GT_by_location.py trnskrbs_120777/col/449495 trnskrbs_111796/col/744848 gt.120777.449495 --xpArea2=.//pg:TextRegion --pxml

Alternatively, a SHELL script can do them all:

 SRC=~/Documents/git/DocumentUnderstanding-github/TranskribusDU/
 HTR=120777
 GT=111796
 for ab in "449495 744848"  "449493 744837"  "443905 744836"  "444284 744835"  "449477 744833"  "442595 744832"
 do
  set $ab
  a=$1
  b=$2
  echo "htr=$a  gt=$b"
 
  outdir=gt.$HTR.$a
  mkdir $outdir
  python $SRC/tasks/project_GT_by_location.py trnskrbs_$HTR/col/$a trnskrbs_$GT/col/$b $outdir --xpArea2=.//pg:TextRegion --pxml
 done

=> we get one folder per document, with a sub-folder 'col' that contain the annotated pages (with text from HTR and annotation from groundtruth).

Mandatenbuch

Let's do similarly for the Mandatenbuch collection:

python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage  111801
python $TRPYC/Transkribus_downloader.py $CREDENTIALS --noImage  120778
HTR=120778
GT=111801
for ab in "439969 744860"  "443189 744859"  "441731 744858"  "444208 744857"  "444203 744856"  "443765 744855"  "744850 744850"  "742265 742265"  "742263 742263"  "441746 630254"  "439940 629338"
do
  set $ab
  a=$1
  b=$2
  echo "htr=$a  gt=$b"
 
  outdir=gt.$HTR.$a
  mkdir $outdir
  python $SRC/tasks/project_GT_by_location.py trnskrbs_$HTR/col/$a trnskrbs_$GT/col/$b $outdir --xpArea2=.//pg:TextRegion --pxml
done

Some statistics

StAB_Policeybuch_LA

For the collection StAB_Policeybuch_LA we got those HTR files with projected annotations (most documents have no annotation)

120777 (HTR)	111796 (GT)	Number of annotated pages
449495	744848	17
449493	744837	24
443905	744836	20
444284	744835	26
449477	744833	26
442595	744832	42

> python $SRC/tasks/DU_analyze_collection.py 'gt.120777.*/col' '*.pxml'
------------------------------------------------------------
 ----- 155 documents, 155 pages
[...]
 ----- Label frequency for ALL 5 objects of interest: ['TextRegion', 'GraphicRegion', 'CharRegion', 'RelationType', 'TextLine']
        -- TextRegion    1083 occurences  889 labelled
                -              caption       1 occurences       ( 0.1%)  ( 0.1%)
                -              heading     189 occurences       (  17%)  (  21%)
                -          page-number     289 occurences       (  27%)  (  33%)
                -            paragraph     410 occurences       (  38%)  (  46%)
                -          <unlabeled>     194 occurences       (  18%)
        -- GraphicRegion       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- CharRegion       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- RelationType       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- TextLine    8172 occurences  0 labelled
                -          <unlabeled>    8172 occurences       ( 100%)
------------------------------------------------------------

 We get XML like:

    <TextRegion type="heading" id="region_1633456402500_3559" custom="readingOrder {index:1;} structure {type:heading;}">
      <Coords points="2122,553 1034,553 1034,175 2122,175"/>
      <TextLine id="r2l1" custom="readingOrder {index:0;}">
        <Coords points="1433,324 1492,326 1537,299 1628,293 1686,298 1743,331 1766,331 1761,228 1673,222 1612,256 1561,219 1541,221 1457,287 1431,292"/>
        <Baseline points="1439,305 1465,301 1492,297 1518,295 1545,293 1571,292 1598,291 1624,291 1651,290 1677,290 1704,289 1730,287 1757,285"/>
        <TextEquiv>
          <Unicode>Irdonnanz</unicode>
        </textequiv>
      </textline>
      <TextLine id="r2l2" custom="readingOrder {index:1;}">
        <Coords points="1103,438 1822,388 2123,406 2121,337 1101,376"/>
        <Baseline points="1109,417 1159,417 1209,417 1260,416 1310,414 1361,413 1411,411 1461,408 1512,405 1562,403 1613,401 1663,399 1713,396 1764,395 1814,393 1865,392 1915,392 1965,392 2016,393 2066,395 2117,398"/>
        <TextEquiv>
          <Unicode>Wagen den Bewehren und deren Qualibres and</unicode>
        </textequiv>
      </textline>

StAB_Mandatenbuch_LA

> python $SRC/tasks/DU_analyze_collection.py 'gt.120778.*/col' '*.pxml'
199 files
------------------------------------------------------------
 ----- 199 documents, 199 pages
[...]
----- Label frequency for ALL 5 objects of interest: ['TextRegion', 'GraphicRegion', 'CharRegion', 'RelationType', 'TextLine']
        -- TextRegion    1336 occurences  1015 labelled
                -              caption       5 occurences       ( 0.4%)  ( 0.5%)
                -              heading     201 occurences       (  15%)  (  20%)
                -           marginalia       1 occurences       ( 0.1%)  ( 0.1%)
                -          page-number     335 occurences       (  25%)  (  33%)
                -            paragraph     473 occurences       (  35%)  (  47%)
                -          <unlabeled>     321 occurences       (  24%)
        -- GraphicRegion       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- CharRegion       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- RelationType       0 occurences  0 labelled
                -          <unlabeled>       0 occurences       (  n/a)
        -- TextLine    8089 occurences  0 labelled
                -          <unlabeled>    8089 occurences       ( 100%)

Split in train / validation / test

We merge all annotated HTR files in on main folder, renaming page files to make them globally uniq.

mkdir gt.120x.all
mkdir gt.120x.all/col
for i in `/bin/ls -d gt.12077?.*`
do 
 echo $i
 cp $i/col/IMG_*.pxml gt.120x.all/col
 rename IMG_ $i.IMG_ gt.120x.all/col/IMG_*.pxml
done

We now have 354 .pxml files in the folder gt.120x.all/col. They have names like: gt.120778.439940.IMG_20200724_101422.pxml

Then we partition the set of pages in 3 parts, with 75 for the train, 15% for the validation, and 10% for test.

python $SRC/tasks/DU_split_collection.py  gt.120x.all 75,15,10

# rename the parts, for convenience. Note that Part_3 is the 75% (reverse order...)
mv gt.120x.all_part_3 gt120x_trn
mv gt.120x.all_part_2 gt120x_vld
mv gt.120x.all_part_1 gt120x_tst

# counting files in each part
for i in `ls -d gt120x_???`; do echo $i;ls $i/col|wc -l; done
gt120x_trn/
264
gt120x_tst/
36
gt120x_vld/
54

Now we are ready for training!

Training

Tagging

We will tag using those categories: heading, paragraph, page-number.

Everything else will be 'other'. Actually, some labels are missing and we will see "None" labels in the evaluation.

One problem is that at the TextLine level, the label statistics is unbalanced:

   Labels count: [  451   886 10268   365] (237 graphs)
   Labels      : ['tag_OTHER', 'tag_heading', 'tag_paragraph', 'tag_page-number']

Not sure how negatively will it affect the models. (Predicting always 'paragraph' would lead to 86% accuracy... :-)

The text is difficult to exploit, since some documents do not have any text.

Tagging TextLine

The command is:

python $GIT/usecases/StAB/DU_Tagger.py models 452398 --trn gt120x_trn --vld gt120x_vld --tst gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml

What it means:

- we create a model "452398" stored in sub-directory "models" using the dataset created in previous sections of this document.

- the (arbitrary) graph we create is of "g1o" style.

- we use the EdgeConvolutionNetwork (ECN) neural model, in the given configuration

- the extension of the files to process is .pxml

The result we got is:

TEST REPORT FOR: 452398

  Line=True class, column=Prediction
      tag_OTHER  [[  11    2   17    2]
    tag_heading   [   0   69   26    4]
  tag_paragraph   [  22   17 1602    4]
tag_page-number   [   3    0    2   62]]

(unweighted) Accuracy score = 94.63 %    trace=1744  sum=1843

                 precision    recall  f1-score   support

      tag_OTHER      0.306     0.344     0.324        32
    tag_heading      0.784     0.697     0.738        99
  tag_paragraph      0.973     0.974     0.973      1645
tag_page-number      0.861     0.925     0.892        67

    avg / total      0.947     0.946     0.946      1843

NOTE: it is worth doing ~3 trainings and to keep the best model.

Tagging TextRegion

Here we do not use the text.

The command is:

python $GIT/usecases/StAB/DU_Tagger.py models 411597 --trn gt120x_trn --vld gt120x_vld --tst gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --TextRegion --no_text

What we changed:

- the model name!

- added the options --TextRegion --no_text

The result we got is:

TEST REPORT FOR: 411597

  Line=True class, column=Prediction
      TR_tag_OTHER  [[33  0  0  1]
    TR_tag_heading   [ 1 28  0  0]
  TR_tag_paragraph   [ 1  3 82  0]
TR_tag_page-number   [ 1  0  0 69]]

(unweighted) Accuracy score = 96.80 %    trace=212  sum=219

                    precision    recall  f1-score   support

      TR_tag_OTHER      0.917     0.971     0.943        34
    TR_tag_heading      0.903     0.966     0.933        29
  TR_tag_paragraph      1.000     0.953     0.976        86
TR_tag_page-number      0.986     0.986     0.986        70

       avg / total      0.970     0.968     0.968       219

This is great! But let's see how good the segmentation is... (Because on "production" files, we will have to predict the TextRegion in order to tag them...)

Segmentation of TextLine nodes into TextRegions (aka Clustering)

The command is:

# NOTE: to evaluate the clustering, we need to produce some output, using --run instead of --tst, and with the option --eval_region
python $GIT/usecases/StAB/DU_Seg.py models 452773 --trn gt120x_trn --vld gt120x_vld --run gt120x_tst --eval_region --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json

The outcome is:

ALL_region  @simil 0.66   P 78.30  R 80.58  F1 79.43         ok=166  err=46  miss=40
ALL_region  @simil 0.80   P 74.06  R 76.21  F1 75.12         ok=157  err=55  miss=49
ALL_region  @simil 1.00   P 60.85  R 62.62  F1 61.72         ok=129  err=83  miss=77

How to read this metric?

- given a similarity threshold, e.g. 0.80, we consider a cluster to be valid if its IoU (intersection over union) exceeds 0.8 . Roughly speaking, the IoU indicates the proportion of cluster content that is valid. More precisely, the IoU is the ratio of its intersection with a GT cluster to the union with it.

- given a similarity threshold, we determine which predicted clusters are valid, and we can compute the usual precision, recall, F1 score.

Testing the whole pipe

Here we continue to use the same test set: gt120x_tst

Here we will use the iption --run of the tools to produce the output XML. This is particularly important for the segmentation task, because in --run mode, we destroy old TextRegion and create new, predicted, ones.

Tag then Segment

First let's tag the TextLine of the XML data, and copy it in another folder, where we will segment it.

python $GIT/usecases/StAB/DU_Tagger.py models 452398 --run gt120x_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml
mkdir tag_out_pxml tag_out_pxml/col
cp gt120x_tst/col/*_du.mpxml tag_out_pxml/col
# rename it as it was an input file
rename _du.mpxml .pxml tag_out_pxml/col/*

Secondly, we segment it:

python $GIT/usecases/StAB/DU_Seg.py models 452773 --run tag_out_pxml --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json

Finally, let's measure the quality at TextLine level (which should be similar to what we got)

python $GIT/TranskribusDU/tasks/eval_classif.py tag_out_pxml/col .mpxml //pc:TextLine ./@type_gt ./@type
--- Fri Oct 29 13:44:06 2021--------------------------------- 
TEST REPORT FOR: Classification

  Line=True class, column=Prediction
       None  [[  11    0    0    2   16]
    heading   [   0   69    0    4   26]
 marginalia   [   0    2    0    0    1]
page-number   [   3    0    0   62    2]
  paragraph   [  22   17    0    4 1602]]

(unweighted) Accuracy score = 94.63 %    trace=1744  sum=1843

             precision    recall  f1-score   support

       None      0.306     0.379     0.338        29
    heading      0.784     0.697     0.738        99
 marginalia      0.000     0.000     0.000         3
page-number      0.861     0.925     0.892        67
  paragraph      0.973     0.974     0.973      1645

avg / total      0.946     0.946     0.946      1843
--------------------------------------------------------------

We see the presence of "None" labels. This is due to some missing annotations.

Segment then Tag

Ok, let's segment and put the result in a folder as if it was some input data:

python $GIT/usecases/StAB/DU_Seg.py models 452773 --run gt120x_tst --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json
mkdir seg_out_pxml seg_out_pxml/col
cp gt120x_tst/col/*_du.mpxml seg_out_pxml/col
rename _du.mpxml .pxml seg_out_pxml/col/*

Now, tag **at TextRegion** level.

NOTE: since the annotation is in TextRegion's attributes, and since we destroy those TextRegion to create new ones (not necessary matching the original ones :-( ), we copy those attributes on the inner TextLine, before TextRegion destruction.

So the annotation is on TextLine, now.

python $GIT/usecases/StAB/DU_Tagger.py models 411597 --run seg_out_pxml --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --TextRegion --no_text

Evaluation:

python $GIT/TranskribusDU/tasks/eval_classif.py seg_out_pxml/col .mpxml //pc:TextLine/@type_gt //pc:TextLine/@type
--- Fri Oct 29 09:36:16 2021--------------------------------- 
TEST REPORT FOR: Classification

  Line=True class, column=Prediction
       None  [[  22    1    0    4    2]
    heading   [   1   56    0    4   38]
 marginalia   [   3    0    0    0    0]
page-number   [   5    4    0   57    1]
  paragraph   [  83  122    0   11 1429]]

(unweighted) Accuracy score = 84.86 %    trace=1564  sum=1843

             precision    recall  f1-score   support

       None      0.193     0.759     0.308        29
    heading      0.306     0.566     0.397        99
 marginalia      0.000     0.000     0.000         3
page-number      0.750     0.851     0.797        67
  paragraph      0.972     0.869     0.917      1645

avg / total      0.914     0.849     0.874      1843

Better to tag at TextLine level !!

On this usecase, given that headings seems critical for the data exploitation, the TextLine tagging would allow to catch 69 headings out of 99, while 4 are predicted as page-number, and 26 as paragraphs.

Train production models

We will use almost all available data to train.

For instance, we will train using both the train set and the validation set, using a large part of the test set as validation set. We will keep only a few test files to make sure we do not create something crap, by (big) mistake.

Creating a minimal test set:

cp -r gt120x_tst gt120x_prod_tst_trn
mkdir -p  gt120x_prod_tst_tst/col
pushd gt120x_prod_tst_trn/col
rm *.mpxml # old output files, if any
#move a few files to the test folder, for sanity check
mv gt.120777.442595.IMG_1092.pxml gt.120777.444284.IMG_9205.pxml ../../gt120x_prod_tst_tst/col
mv gt.120778.439940.IMG_20200724_101311.pxml  gt.120778.443189.IMG_4149.pxml ../../gt120x_prod_tst_tst/col
popd
# count
ls gt120x_prod_tst_trn/col|wc -l 
ls gt120x_prod_tst_tst/col|wc -l

# run the current model on this new test collection to see its performance
python $GIT/usecases/StAB/DU_Tagger.py models 452398 --tst gt120x_prod_tst_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml
--- Thu Dec 16 10:03:47 2021--------------------------------- 
TEST REPORT FOR: 452398

  Line=True class, column=Prediction
      tag_OTHER  [[ 46   0   0   0]
    tag_heading   [  0  11   3   1]
  tag_paragraph   [  1   1 115   0]
tag_page-number   [  0   0   1   4]]

(unweighted) Accuracy score = 96.17 %    trace=176  sum=183

                 precision    recall  f1-score   support

      tag_OTHER      0.979     1.000     0.989        46
    tag_heading      0.917     0.733     0.815        15
  tag_paragraph      0.966     0.983     0.975       117
tag_page-number      0.800     0.800     0.800         5

    avg / total      0.961     0.962     0.960       183

python $GIT/usecases/StAB/DU_Seg.py models 708931 --run gt120x_prod_tst_tst --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json --eval_region
(unweighted) Accuracy score = 81.04 %    trace=218  sum=269
...
ALL_region  @simil 0.66   P 72.22  R 20.63  F1 32.10         ok=13  err=5  miss=50
ALL_region  @simil 0.80   P 66.67  R 19.05  F1 29.63         ok=12  err=6  miss=51
ALL_region  @simil 1.00   P 61.11  R 17.46  F1 27.16         ok=11  err=7  miss=52
#NOTE 1 : we were not lucky in our file selection, because 2 out of 4 files are totally failed...
#NOTE 2 : due to disk crash, we lost the model 452773 (mentioned above) and used 708931 instead.
#NOTE3 this model did on training: (unweighted) BEST PERFORMANCE (acc= 95.91 %) on valid set at Epoch 240

Now we train a few models, keeping only the best one (selected by looking at the train loss (on the validation set) and/or the performance on mini test set)

python $GIT/usecases/StAB/DU_Tagger.py models prod_tag_model --trn gt120x_trn --trn gt120x_prod_tst_trn --vld gt120x_vld --tst gt120x_prod_tst_tst --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --ext=.pxml --no_text

python $GIT/usecases/StAB/DU_Seg.py models prod_seg_model --trn gt120x_trn --trn gt120x_prod_tst_trn --vld gt120x_vld --tst gt120x_prod_tst_tst --run gt120x_prod_tst_tst --eval_region --g1o --ecn --ext=.pxml --ecn_config=ecn_8Lay4Conv128.json

A trick to select the model, assuming you stored the log on the trainings in some folder:

#for the tagging model
egrep -e '(BEST PERFORMANCE|Accuracy score|tag_heading)' tag_log_files/*

#for the segmentation model
egrep -e '(BEST PERFORMANCE|Accuracy score|ALL_region)' seg_log_files/*

Strangely, no way to beat the best tagging model so far: 452398 . It is best on OTHER and on par on the other tags. Let's keep it.

For the segmentation we got a slightly better model: 1274574

BEST PERFORMANCE (acc= 96.24 %) on valid set at Epoch 170

log/1274574:(unweighted) Accuracy score = 81.04 %    trace=218  sum=269
log/1274574:ALL_region  @simil 0.66   P 62.07  R 28.57  F1 39.13         ok=18  err=11  miss=45
log/1274574:ALL_region  @simil 0.80   P 58.62  R 26.98  F1 36.96         ok=17  err=12  miss=46
log/1274574:ALL_region  @simil 1.00   P 55.17  R 25.40  F1 34.78         ok=16  err=13  miss=47

We will use model 452398 for tagging and 1274574 for segmenting.

NEWS: segmentation is done after tagging. Hence the segmentation step can use the labels of the textlines. Textual features are no longer needed. Final models: for tagging 720137 for segmentation 1281545

Production

How to apply the pipeline on your full collection

Download the collections

 python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 129687 --noimage $CREDENTIALS

 python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 129686 --noimage $CREDENTIALS

Process the collections

 python $GIT/usecases/StAB/DU_Tagger.py models 452398 --g1o --ecn --ecn_config=ecn_8Lay4Conv128.json --run PROD/trnskrbs_129686 --run PROD/trnskrbs_129687

 mkdir -p {trnskrbs_129686,trnskrbs_129687}_out/col
 # for upload: need to copy <COL>/col/<DOCID>/.*$.trp.json
 # otherwise Warning: cannot set Parent-ID because file not found: /trnskrbs_129686_out/col/865313/trp.json

 cp  trnskrbs_129687/col/*_du.mpxml  trnskrbs_129687_out/col/
 cp  trnskrbs_129686/col/*_du.mpxml  trnskrbs_129686_out/col/

 rename _du.mpxml .mpxml trnskrbs_129686_out/col/*.mpxml
 rename _du.mpxml .mpxml trnskrbs_129687_out/col/*.mpxml

 python $GIT/usecases/StAB/DU_Seg.py models 708931 --g1o --ecn --run PROD/trnskrbs_129686_out --run PROD/trnskrbs_129687_out

Upload the collections

 cd PROD 
 python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/TranskribusDU_transcriptUploader.py trnskrbs_129686_out/ 129686 $CREDENTIALS

 python /tmp-network/user/hdejean/git/TranskribusPyClient/src/TranskribusCommands/TranskribusDU_transcriptUploader.py trnskrbs_129687_out/ 129687 $CREDENTIALS

Conclusion

We have shown:

1. how to tag TextLine, or TextRegion

2. how to segment them, i.e. forming TextRegion.

3. how to pipe the two processing

On this use case, it turns out that tagging TextLine and then forming new TextRegion works better.

The alternative was to create TextRegion and tag those TextRegion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annemieke's Usecase: StAB

Table of Contents

SW Installation

Git

Python and the required modules

Collections

Notes

Pre-processing for DU Training

Select the GT pages

Download

Project the annotations to the HTRed documents

Policeibuch

Mandatenbuch

Some statistics

StAB_Policeybuch_LA

StAB_Mandatenbuch_LA

Split in train / validation / test

Training

Tagging

Tagging TextLine

Tagging TextRegion

Segmentation of TextLine nodes into TextRegions (aka Clustering)

Testing the whole pipe

Tag then Segment

Segment then Tag

Better to tag at TextLine level !!

Train production models

Production

Download the collections

Process the collections

Upload the collections

Conclusion

Clone this wiki locally