本項目為 canto-filter 之後續。canto-filter 得 4 個分類標籤且判斷邏輯更加快速簡單,適合在線快速篩選判別文本或者其他要求低延遲、速度快嘅應用場合。本項目採用更精細嘅判斷邏輯,有 6 個分類標籤,準確度更高,但速度亦會相對 canto-filter 更慢。
This is an extension of the canto-filter project. canto-filter has only 4 output labels. It has a simipler classification logic and is faster, more suitable for use cases which require low-latency or high classification speed. This package has 6 output and uses a more sophisticated classification logic for more fine-grained classification. It has higher classification accuracy but slower performance.
The definitions and boundaries of the labels depend on the user's language ideology. When discussing the criteria adopted by this tool, please cite:
Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024. The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection. In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.
分類方法係利用粵語同書面中文嘅特徵字詞,用 Regex 方式加以識別。分類器主要有兩個主要參數,--split
The filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense.
。所以對於篩選純粵文嘅用途嚟講,打開呢個參數會提高 precision 但降低 recall。
呢個參數默認關閉,分類器淨係會將輸入分為 4 類。如果打開,就會再增加兩類總共有 6 個標籤。打開後分類器會將引號內嘅文本抽出嚟,將佢哋同引號外文本分開判斷。下面一段就係介紹呢四個同六個標籤。
The classifiers output four (coarse) or six (fine-grained) categories. The labels are:
: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度SWC
: 書面中文,係一個僅含有書面語特徵字詞,例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏Mixed
:書粵混雜文,同時含有書面語同粵語特徵嘅字詞,例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的Neutral
:無特徵中文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書MixedQuotesInSWC
: 書面中文,引文入面係Mixed
contents quoted within SWC textCantoneseQuotesInSWC
: 書面中文,引文入面係純粵文cantonese
contents quoted within SWC text
Python >= 3.11
pip install cantonesedetect
可以通過 Python 函數嚟引用,亦可以直接 CLI 調用。
You can call the Python API or this library, or run it directly in CLI.
用下面嘅方法創建一個 Detector
,然後直接調用 judge()
Initialize a Detector
and call the judge()
function on inputs, and you will get the classification outputs.
from cantonesedetect import CantoneseDetector
# 默認情況下 use_quotes=False, split_seg=False, get_analysis=False
detector = CantoneseDetector()
detector.judge('你喺邊度') # cantonese
detector.judge('你在哪裏') # swc
detector.judge('是咁的') # mixed
detector.judge('去學校讀書') # neutral
detector.judge('他説:“係噉嘅。”') # cantonese_quotes_in_swc
detector.judge('那就「是咁的」') # mixed_quotes_in_swc
If you want to judge inputs based on matrix-quote-splitting, or spliting into segments, you can:
from cantonesedetect import Detector
detector = Detector(use_quotes=True, split_seg=True, get_analysis=True)
judgement, document_features = detector.judge("他説:「我哋今晚食飯。你想去邊度食?」")
# 打印分析結果
# Print analysis results
# `document_features` 入面有每個分句嘅 `document_segments_features` 同 `document_segments_judgements`
# `document_features` object contains `document_segments_features` which is a list of segment features
# Also contains `document_segments_judgements` which is a list of judgements of the segments
print([j.value for j in document_features.document_segments_judgements])
如果直接喺 CLI 調用嘅話,只需要指明--input
就得。 --quotes
If you run directly in CLI, simply specify the --input
. The optional arguments --quotes
are all False
by default, and you can turn them on by specifying them.
cantonesedetect --input input.txt
# 開啓引號抽取判別、分句判別並且打印分析結果
# Enable matrix-quotes-splitting, segment-splitting and printing the analysis.
cantonesedetect --input input.txt --quotes --split --print_analysis