In this repositories we publish adapted dictionaries and artifacts resulting from our use of data protected by licenses requiring further publications. The kind of license is specified within the subdirectory representing each single data source.
Karakun AG
Elisabethenanlage 25
4051 BASEL, Switzerland
email: hibu_at_karakun.com
Following kinds of data are published from every corresponding data source (for the currently considered 4 languages):
- .input File
- Adapted text input file generated from the origiinal data source.
- Each input file line has the format citation-form ; inflected-form ; POS
- The input file format is used as input by the ixa-pipe-pos multilingual Part of Speech tagger and lemmatizer to create its lemmatizer dictionary, binarized as Finite State Automata (FSA) within a corresponding .dict file.
- .dict File
- This is the published FSA binary file containing the data compiled from .input file.
- The file is used by the lemmatizer.
- The automata are read by the ixa-pipe-pos using the morfologik-stemming project.
- .info File
- This is a property file where some dictionary meta data are listed.
- This file must be present during the FSA generation, as well as during the FSA use as lemmatizer.
The data are described in their specific language Wiktionary page
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.