In this paper, we report our submitted system for the ZeroSpeech 2020challenge on Track 2019. The main theme in this challenge is to builda speech synthesizer without any textual information or phonetic labels.In order to tackle those challenges, we build a system that must addresstwo major components such as 1) given speech audio, extract subwordunits in an unsupervised way and 2) re-synthesize the audio from novelspeakers. The system also needs to balance the codebook performancebetween the ABX error rate and the bitrate compression rate. Our maincontribution here is we proposed Transformer-based VQ-VAE for unsupervisedunit discovery and Transformer-based inverter for the speech synthesisgiven the extracted codebook. Additionally, we also explored severalregularization methods to improve performance even further.
@inproceedings{tjandra20_interspeech, title = {Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge}, author = {Andros Tjandra and Sakriani Sakti and Satoshi Nakamura}, year = {2020}, booktitle = {Interspeech 2020}, pages = {4851--4855}, doi = {10.21437/Interspeech.2020-3033}, issn = {2958-1796},}
Cite as:Tjandra, A., Sakti, S., Nakamura, S. (2020) Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge. Proc. Interspeech 2020, 4851-4855, doi: 10.21437/Interspeech.2020-3033