Option for squishing abbreviations into single tokens?
mwes = [
('P', 'E'), # Politiets Efterretninger
('d', 'M'), # denne Maaned
('s', 'M'), # samme Maaned
('f', 'M'), # forrige Maaned
]
mwe = nltk.tokenize.MWETokenizer(mwes, separator='.')
words = mwe.tokenize(words)
Possibly use special Tokens where gold is the expanded abbreviation.
- improve page segmentation/layout analysis
- https://www.slideshare.net/MarkHollow/pycon-apac-2017-page-layout-analysis-of-19th-century-siamese-newspapers-using-python-and-opencv
- https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html
- https://github.com/glazzara/olena
- https://github.com/phatn/lapdftext
- --build_pdf --images 1,2,3 --pdfname etc
- check tesseract version before using locale trick
- consider word frequency and weight lookups accordingly?
- https://info.clarin.dk/
- http://sprogtek2018.dk/
- https://alf.hum.ku.dk/korp/?mode=da1800
- https://clarin.dk/clarindk/find.jsp
- Look into integrating with the python client