This project is a benchmark for evaluating Cantonese Grapheme-to-Phoneme (G2P) systems. The benchmark is based on the Jyutping Romanization System. The data is obtained from word.hk.
The dataset includes character pairs along with their corresponding ground truth phonemes for G2P model evaluation. Each text file line contains a word, an underscore (_), and the target character for phoneme prediction. This format focuses on single-character phoneme prediction within words, making it suitable for assessing G2P models on specific characters.
The benchmark uses the following metrics:
- Accuracy: The percentage of words that are correctly converted the specified character from graphemes to phonemes.
- Levenshtein Distance: The average Levenshtein distance between the predicted phonemes and the ground truth phonemes.
# pull submodules
git submodule update --init --recursive
# install dependencies
pip install -r requirements.txt
# install g2pW-Cantonese dependencies
pip install -r g2pW-Cantonese/requirements.txt
python run.py
To submit your G2P system, please subclass the G2PModel
class in models
and implement the _predict
method. Then, add your model to the models
list in run.py
. Finally, run run.py
to generate the results.