Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information on the writing system or script used to the header? #1

Open
christofs opened this issue Jan 8, 2020 · 1 comment
Open

Comments

@christofs
Copy link
Contributor

At the moment, we have collections with texts using the latin alphabet and others using cyrillic script, possibly some mixed collections or even texts. This cannot entirely be deduced from the language of the texts, in some cases, but probably has significant impact on analyses using the words as tokens. So it might be worth while noting this somewhere.

TEI recommends using "@xml:lang" to register the language and, optionally, the writing system of a text. For this, it recommends using the IANA.org codes, see here: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry and, more accessibly, here: https://en.wikipedia.org/wiki/IETF_language_tag#Syntax_of_language_tags

Bottom line: it is probably sufficient to update the language codes used as values in "@xml:lang" to have the added information about the script used, like: "de-Latn" (for German in Latin script) or "bg-Cyrl" for Bulgarian in Cyrillic script.

Note that the Serbian texts already do this, with @xml:lang value of "sr-Cyrl", but it might be good to generalize the practice at least for those languages where variation in scripts used is possible.

@CarolinOdebrecht
Copy link

That is a good idea. Thanks!
If the schema already allows the modification of "@xml:lang", the only thing we need to do is to update the encoding documentation.
As for the metadata update in the document headers: in the next WG meeting, I could add this to the open tasks for the members responsible for the language collections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants