Consistency in orthography listing #32

NeilSureshPatel · 2022-10-30T16:39:17Z

As we prepare implementing shaperglot for testing African language support, I am noticing that there is variation in the way language orthographies are incorporated in gflang. Here are few examples:

bas_Latn

exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}

bin_Latn

exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}

af_Latn

exemplar_chars {
  base: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z"
  auxiliary: "à å ä ã æ ç í ì ó ò ú ù ü ý"
  marks: "◌̀ ◌̂ ◌̈"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
  index: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
}

The first inconsistency is that not all language profiles contain auxiliary bases when they should. When auxiliary bases include a mark the mark list doesn't always include those accents.

The second big inconsistency is the inclusion of non-precomposed base/mark pairs in the base list. Sometimes these pairs are in base list and sometimes they are not.

In order for shaperglot to properly parse gflang to run its orthography tests we need some consistency in how the exemplar character lists are constructed. For the purposes of shaperglot, it is good to have gflang contain all necessary base/mark pairs regardless if they can be precomposed or not. It appears like the variation is caused by the incoming source data. (The bas_Latn entry reflects the data in CLDR, including the lack of spaces between certain bases.) Should we have a guideline specifically spells out what needs to be included in bases, auxiliary, and marks?

Perhaps something like:
-bases: all primary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-auxiliary: all secondary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-marks: all standalone marks whether they are primary or auxiliary

The text was updated successfully, but these errors were encountered:

simoncozens · 2024-11-27T13:22:54Z

An overlapping issue related by Dan Burzo:

I was looking at Romanian, wondering about the necessity of combining marks as independent codepoints to declare it supported. As long as “Ă” and “ă” exist, a combining breve is just a nice to have (maybe to form the historical ĕ, ĭ, ŭ)?

Our marks entry is underspecified: does it means the marks you need to form characters, or marks which can attach to a variety of bases? For Romanian, it's the former: we ask for '◌̂', '◌̆', '◌̦', '◌̧' but only because we have base characters which already contain those marks. And so this field is redundant data: just decompose the base characters into NFD, and there your marks are. But for Arabic it's the latter: we ask for '◌ٰ', '◌ٓ', '◌ٔ', '◌ٕ', '◌ً', '◌ٌ', '◌ٍ', '◌َ', '◌ُ', '◌ِ', '◌ّ', '◌ْ' which can sit on top of any base consonant. This is new data since it can't be derived from the bases.

I think we probably want to move towards the latter interpretation: "marks" are any independent combining marks that you need to support the language.

NeilSureshPatel mentioned this issue Oct 31, 2022

Remove duplicates from languages exemplar_chars #18

Merged

vv-monsalve mentioned this issue Sep 13, 2024

Shaperglot reporting missing punctuation? googlefonts/shaperglot#66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistency in orthography listing #32

Consistency in orthography listing #32

NeilSureshPatel commented Oct 30, 2022

simoncozens commented Nov 27, 2024

Consistency in orthography listing #32

Consistency in orthography listing #32

Comments

NeilSureshPatel commented Oct 30, 2022

simoncozens commented Nov 27, 2024