- Notifications
You must be signed in to change notification settings - Fork56
speechio/BigCiDian
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This project is an attempt to create a pronunciation lexicon covering both English and Chinese wordsin a unified phoneset for ASR applications.
P.S. "CiDian" means "lexicon" in Chinese.
typical use cases in Chinese ASR applications:
你手机上都装了什么 APP ?APPLE 的新 MACBOOK PRO 真漂亮上个月 PRADA 出了款新包包手机开了 GPRS 导航世界杯 H 组小组赛The unified phoneset should be a simple and precise phoneset that covers both languages. Note that the mapping listed below are heavily based on IPA.
English entries are derived from CMUDict 0.7b, hence we need a mapping from ARPA phoneset to target phoneset.
| ARPA | IPA | CMUDict example entries |
|---|---|---|
| AA0 | a | icon:AY1 K AA0 N |
| AA1 | a | heart: HH AA1 R T |
| AA2 | a | kmart: K EY1 M AA2 R T |
| AE0 | æ | romance: R OW1 M AE0 N S |
| AE1 | æ | lambda: L AE1 M D AH0 |
| AE2 | æ | setback: S EH1 T B AE2 K |
| AH0 | ə | station: S T EY1 SH AH0 N |
| AH1 | ʌ | bug: B AH1 G |
| AH2 | ʌ | haircut: HH EH1 R K AH2 T |
| AO0 | ɔ | hongkong: HH AO1 NG K AO0 NG |
| AO1 | ɔ | law: L AO1 |
| AO2 | ɔ | layoff: L EY1 AO2 F |
| AW0 | au | foundation: F AW0 N D EY1 SH AH0 N |
| AW1 | au | founder: F AW1 N D ER0 |
| AW2 | au | hometown: HH OW1 M T AW2 N |
| AY0 | ai | hypothese: HH AY0 P AA1 TH AH0 S IY2 Z |
| AY1 | ai | ice: AY1 S |
| AY2 | ai | iceland: AY1 S L AH0 N D |
| B | b | bike: B AY1 K |
| CH | ch | chase: CH EY1 S |
| D | d | desk: D EH1 S K |
| DH | ð | those: DH OW1 Z |
| EH0 | e | princess: P R IH1 N S EH0 S |
| EH1 | e | professor: P R AH0 F EH1 S ER0 |
| EH2 | e | progress: P R AA1 G R EH2 S |
| ER0 | ə r | programmer: P R OW1 G R AE2 M ER0 |
| ER1 | ə r | purge: P ER1 JH |
| ER2 | ə r | showgirl: SH OW1 G ER2 L |
| EY0 | ei | eighteen: EY0 T IY1 N |
| EY1 | ei | email: IY0 M EY1 L |
| EY2 | ei | thursday: TH ER1 Z D EY2 |
| F | f | face: F EY1 S |
| G | g | give: G IH1 V |
| HH | h | hey: HH EY1 |
| IH0 | i | facing: F EY1 S IH0 NG |
| IH1 | i | fear: F IH1 R |
| IH2 | i | fellowship: F EH1 L OW0 SH IH2 P |
| IY0 | ii | email: IY0 M EY1 L |
| IY1 | ii | prefix: P R IY1 F IH0 K S |
| IY2 | ii | increase: IH1 N K R IY2 S |
| JH | zh | gesture: JH EH1 S CH ER0 |
| K | k | cat: K AE1 T |
| L | l | lack: L AE1 K |
| M | m | may: M EY1 |
| N | n | no: N OW1 |
| NG | ŋ | thing: TH IH1 NG |
| OW0 | əu | crypto: K R IH1 P T OW0 |
| OW1 | əu | token: T OW1 K AH0 N |
| OW2 | əu | earphone: IH1 R F OW2 N |
| OY0 | ɔi | invoice: IH1 N V OY0 S |
| OY1 | ɔi | floyd: F L OY1 D |
| OY2 | ɔi | episode: EH1 P IH0 S OW2 D |
| P | p | pat: P AE1 T |
| R | r | risk: R IH1 S K |
| S | s | sing: S IH1 NG |
| SH | sh | shake: SH EY1 K |
| T | t | test: T EH1 S T |
| TH | θ | think: TH IH1 NG K |
| UH0 | u | fulfill: F UH0 L F IH1 L |
| UH1 | u | full: F UH1 L |
| UH2 | u | goodbye: G UH2 D B AY1 |
| UW0 | uu | rescue: R EH1 S K Y UW0 |
| UW1 | uu | fool: F UW1 L |
| UW2 | uu | restroom: R EH1 S T R UW2 M |
| V | v | very: V EH1 R IY0 |
| W | w | west: W EH1 S T |
| Y | y | yes: Y EH1 S |
| Z | z | zero: Z IY1 R OW0 |
| ZH | ʒ | illusion: IH2 L UW1 ZH AH0 N |
notes: If you find anything that doesn't make sense in the mapping table, please let me know, thanks
Chinese entries are extracted fromDaCiDian project
Here is a PinYin to IPA mapping from educational prospective:https://resources.allsetlearning.com/chinese/pronunciation/Pinyin_chart
With a few mapping modifications and symbolic adaptations, here is the finalPinYin to target phoneset mapping
There are normally 5 tones in Chinese PinYin system ranging from 0 ~ 4.However there is no tone definition in English. In BigCiDian, Chinese tonal information is retained and merged with untoned English, so the resulting phoneset may contain 6 tonal variation(1 from English and 5 from Chinese):
e.g. for phoneme *ai*1. HI -> h ai2. 哎 -> ai_03. 掰 -> b ai_14. 还 -> h ai_25. 凯 -> k ai_36. 外 -> w ai_4The final unified bi-lingual phoneset details are listed below:
| phoneme | CN example | EN example |
|---|---|---|
| a | 把b a_3 | AACHENa k ə n |
| æ | CATk æ t | |
| ai | 爱ai_4 | KITEk ai t |
| an | 安an_1 | |
| aŋ | 羊y aŋ_2 | |
| au | 老l au_3 | LOUDl au d |
| b | 白b ai_2 | BUTb ʌ t |
| ch | 陈ch ən_2 | CHESTch e s t |
| d | 大d a_4 | DAYd ei |
| ð | THISð i s | |
| e | BEDb e d | |
| ei | 累l ei_4 | LAKEl ei k |
| ə | 鹅ə_2 | COCA-COLAk əu k ə k əu l a |
| ən | 陈ch ən_2 | |
| əŋ | 横h əŋ_2 | |
| ər | 二ər_4 | |
| əu | 欧əu_1 | BOATb əu t |
| f | 房f aŋ_2 | FACEf ei s |
| g | 刚g aŋ_1 | GIVEg i v |
| h | 海h ai_3 | HUGh ʌ g |
| i | 天t i an_1 | HITh i t |
| ie | 别b ie_2 | |
| ii | 比b ii_3 | BEATb ii t |
| iii | 吃ch iii_1 | |
| in | 音y in_1 | |
| iŋ | 听t iŋ_1 | |
| j | 九j i əu_3 | |
| k | 看k an_4 | CAKEk ei k |
| l | 来l ai_2 | LAKEl ei k |
| m | 马m a_3 | MAKEm ei k |
| n | 那n a_1 | NIKEn ai k ii |
| ŋ | INTERESTINGi n t ə r e s t i ŋ | |
| ɔ | OFFɔ f | |
| ɔi | JOYzh ɔi | |
| p | 胖p aŋ_4 | PACEp ei s |
| q | 钱q i an_2 | |
| r | 让ʒ aŋ_4 | RISKr i s k |
| s | 丝s iii_1 | SINGs i ŋ |
| sh | 上sh aŋ_4 | SHAKEsh ei k |
| t | 团t u an_2 | TIMEt ai m |
| ts | 才ts ai_2 | |
| u | BOOKb u k | |
| uŋ | 从ts uŋ_2 | |
| uɔ | 桌zh uɔ_1 | |
| uu | 不b uu_4 | TWOt uu |
| v | VICTORYv i k t ə r ii | |
| ʌ | CUTk ʌ t | |
| w | 王w aŋ_2 | WESTw e s t |
| x | 西x ii_1 | |
| y | 言y an_2 | YESy e s |
| yu | 去q yu_4 | |
| yue | 缺q yue_1 | |
| z | 赞z an_4 | ZOOz uu |
| zh | 中zh uŋ_1 | GESTUREzh e s ch ə r |
| ʒ | 让ʒ aŋ_4 | LEISUREl e ʒ ə r |
| θ | THINKθ i ŋ k |
So overall there are 56 phonemes in the unified phoneset(regardless of tones).
Theoretically some phonemes can be split with smaller granularity(eg. au->a u, ɔi->ɔ i, an->a n ...), hence making the phoneset even more compact. But it is a common practice that larger acoustic modeling units are beneficial for Chinese ASR accuracy, and the existence of decision-tree based state-tying, makes base phoneset size less irrelevant to ASR problem.
I may or may not change the unified phoneset in the future, currently it seems to be sufficient for my purpose.
sh run.sh should give you a ready-to-use bi-lingual ASR lexicon (lexicon.txt), and a phoneset list(phones.list) in project root directory.
To extend the final lexicon with entries of your own interest(say "IPHONE", "华为P30"), you can either:
- add those entries into the very bottom sources(CMUDict and DaCiDian)
or:
- maintain a seperate extension-lexicon, and merge it with main lexicon automatically generated above.
InAISHELL-2 Mandarin ASR task, replacing Chinese lexicon(DaCiDian) with multilingual CN-EN lexicon(BigCiDian), details are showed below:
For DaCiDian, system performance:
----- test -----:%WER 44.39 [ 21986 / 49532, 338 ins, 2085 del, 19563 sub ] exp/mono/decode_test/cer_9_0.0%WER 24.25 [ 12011 / 49532, 393 ins, 792 del, 10826 sub ] exp/tri1/decode_test/cer_12_0.0%WER 22.13 [ 10963 / 49532, 396 ins, 644 del, 9923 sub ] exp/tri2/decode_test/cer_12_0.0%WER 19.29 [ 9555 / 49532, 263 ins, 640 del, 8652 sub ] exp/tri3/decode_test/cer_13_0.5%WER 8.33 [ 4125 / 49532, 84 ins, 192 del, 3849 sub ] exp/chain/tdnn_1a/decode_test/cer_8_0.5For BigCiDian, system performance:
%WER 43.92 [ 21754 / 49532, 405 ins, 1574 del, 19775 sub ] exp/mono/decode_test/cer_7_0.0%WER 22.54 [ 11163 / 49532, 406 ins, 652 del, 10105 sub ] exp/tri1/decode_test/cer_11_0.0%WER 21.09 [ 10445 / 49532, 377 ins, 609 del, 9459 sub ] exp/tri2/decode_test/cer_12_0.0%WER 18.47 [ 9148 / 49532, 265 ins, 621 del, 8262 sub ] exp/tri3/decode_test/cer_13_0.5%WER 8.22 [ 4072 / 49532, 68 ins, 260 del, 3744 sub ] exp/chain/tdnn_1a/decode_test/cer_9_0.5Conclusion
- It shows that BigCiDian only gives slightly better results than DaCiDian.
- But more importantly, BigCiDian turns a pure Chinese ASR system to multiligual system, which is pretty much the case in nowadays Chinese ASR applications.
THE END
About
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.