Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contains many hangeul terms in notcore_lex.csv #36

Open
hanya opened this issue Sep 30, 2021 · 1 comment
Open

Contains many hangeul terms in notcore_lex.csv #36

hanya opened this issue Sep 30, 2021 · 1 comment

Comments

@hanya
Copy link

hanya commented Sep 30, 2021

There are some hungeul terms can be found in notcore_lex.csv file. Such as follows:

전범국,4785,4785,22000,전범국,名詞,固有名詞,一般,*,*,*,センパンコク,戦犯国,*,A,*,*,*,*
전지충이,4785,4785,22000,전지충이,名詞,固有名詞,一般,*,*,*,チョンジチュンイ,デンヂムシ,*,A,*,*,*,*
전툴라,4785,4785,22000,전툴라,名詞,固有名詞,一般,*,*,*,チョントゥラ,チョントゥラ,*,A,*,*,*,*

Are they intentionally contained?

@sakamoto-mi
Copy link
Collaborator

Thank you for your inquiry.

In Sudachi dictionary, three types of words are registered.
That is,
・words from UniDic
・words from NEologd
・words we collected
Hangeul terms were contained in NEologd.
Regarding UniDic words and NEologd words , we have not scrutinized them in particular so far.
Looking at registered Hangeul terms, most of the them are Pokemon names.
As Hangeul is written in katakana in Japanese sentences, we are considering removing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants