Skip to content

karakun/Public-Dictionaries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Public Dictionaries

In this repositories we publish adapted dictionaries and artifacts resulting from our use of data protected by licenses requiring further publications. The kind of license is specified within the subdirectory representing each single data source.

Contact

Karakun AG
Elisabethenanlage 25
4051 BASEL, Switzerland

email: hibu_at_karakun.com

Published Data

Following kinds of data are published from every corresponding data source (for the currently considered 4 languages):

  • .input File
    • Adapted text input file generated from the origiinal data source.
    • Each input file line has the format citation-form ; inflected-form ; POS
    • The input file format is used as input by the ixa-pipe-pos multilingual Part of Speech tagger and lemmatizer to create its lemmatizer dictionary, binarized as Finite State Automata (FSA) within a corresponding .dict file.
  • .dict File
    • This is the published FSA binary file containing the data compiled from .input file.
    • The file is used by the lemmatizer.
    • The automata are read by the ixa-pipe-pos using the morfologik-stemming project.
  • .info File
    • This is a property file where some dictionary meta data are listed.
    • This file must be present during the FSA generation, as well as during the FSA use as lemmatizer.

Current Data Sources

Wiktionary (CC-BY-SA)

The data are described in their specific language Wiktionary page

Shield: CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

About

Repository of public dictionaries and artifacts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published