Ideas for future features

tokenizer

Option for squishing abbreviations into single tokens?

mwes = [
	('P', 'E'), # Politiets Efterretninger
	('d', 'M'), # denne Maaned
	('s', 'M'), # samme Maaned
	('f', 'M'), # forrige Maaned
]
mwe = nltk.tokenize.MWETokenizer(mwes, separator='.')
words = mwe.tokenize(words)

Possibly use special Tokens where gold is the expanded abbreviation.

hocr

improve page segmentation/layout analysis
--build_pdf --images 1,2,3 --pdfname etc
check tesseract version before using locale trick

dictionary

consider word frequency and weight lookups accordingly?

misc

https://info.clarin.dk/
http://sprogtek2018.dk/
https://alf.hum.ku.dk/korp/?mode=da1800
https://clarin.dk/clarindk/find.jsp

Transkribus

Look into integrating with the python client
- https://github.com/Transkribus/TranskribusPyClient

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FUTURE.md

FUTURE.md

Ideas for future features

tokenizer

hocr

dictionary

misc

Transkribus

Files

FUTURE.md

Latest commit

History

FUTURE.md

File metadata and controls

Ideas for future features

tokenizer

hocr

dictionary

misc

Transkribus