- Notifications
You must be signed in to change notification settings - Fork38
Chinese Mandarin Grapheme-to-Phoneme Converter. 中文轉注音或拼音 (INTERSPEECH 2022)
License
NotificationsYou must be signed in to change notification settings
GitYCC/g2pW
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Authors:Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh
This is the official repository of our paperg2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin (INTERSPEECH 2022).
- g2pW is included inPaddlePaddle/PaddleSpeech
- g2pW is included inmozillazg/pypinyin-g2pW
(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)
InstallPyTorch
$ pip install g2pw
>>>fromg2pwimportG2PWConverter>>>conv=G2PWConverter()>>>sentence='上校請技術人員校正FN儀器'>>>conv(sentence)[['ㄕㄤ4','ㄒㄧㄠ4','ㄑㄧㄥ3','ㄐㄧ4','ㄕㄨ4','ㄖㄣ2','ㄩㄢ2','ㄐㄧㄠ4','ㄓㄥ4',None,None,'ㄧ2','ㄑㄧ4']]>>>sentences= ['銀行','行動']>>>conv(sentences)[['ㄧㄣ2','ㄏㄤ2'], ['ㄒㄧㄥ2','ㄉㄨㄥ4']]
conv=G2PWConverter(model_dir='./G2PWModel-v2-onnx/',model_source='./path-to/bert-base-chinese/')
>>>fromg2pwimportG2PWConverter>>>conv=G2PWConverter(style='pinyin',enable_non_tradional_chinese=True)>>>conv('然而,他红了20年以后,他竟退出了大家的视线。')[['ran2','er2',None,'ta1','hong2','le5',None,None,'nian2','yi3','hou4',None,'ta1','jing4','tui4','chu1','le5','da4','jia1','de5','shi4','xian4',None]]
$ git clone https://github.com/GitYCC/g2pW.git
For example, we train models on CPP dataset as follows:
$ bash cpp_dataset/download.sh$ python scripts/train_g2p_bert.py --config configs/config_cpp.py
$ python scripts/test_g2p_bert.py \ --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \ --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \ --sent_path cpp_dataset/test.sent \ --output_path output_pred.txt
$ python scripts/predict_g2p_bert.py \ --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \ --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \ --sent_path cpp_dataset/test.sent \ --lb_path cpp_dataset/test.lb
To cite the code/data/paper, please use this BibTex
@inproceedings{chen22d_interspeech,title ={g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},author ={Yi-Chang Chen and Yu-Chuan Steven and Yen-Cheng Chang and Yi-Ren Yeh},year ={2022},booktitle ={Interspeech 2022},pages ={1926--1930},doi ={10.21437/Interspeech.2022-216},issn ={2958-1796},}
About
Chinese Mandarin Grapheme-to-Phoneme Converter. 中文轉注音或拼音 (INTERSPEECH 2022)