Skip to content

Latest commit

 

History

History
54 lines (40 loc) · 1.37 KB

FUTURE.md

File metadata and controls

54 lines (40 loc) · 1.37 KB

Ideas for future features

tokenizer

Option for squishing abbreviations into single tokens?

mwes = [
	('P', 'E'), # Politiets Efterretninger
	('d', 'M'), # denne Maaned
	('s', 'M'), # samme Maaned
	('f', 'M'), # forrige Maaned
]
mwe = nltk.tokenize.MWETokenizer(mwes, separator='.')
words = mwe.tokenize(words)

Possibly use special Tokens where gold is the expanded abbreviation.

hocr

dictionary

  • consider word frequency and weight lookups accordingly?

misc

Transkribus