Movatterモバイル変換


[0]ホーム

URL:


ISCAArchiveInterspeech 2021
ISCAArchiveInterspeech 2021

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion

Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

One-shot voice conversion (VC), which performs conversion across arbitraryspeakers with only a single target-speaker utterance for reference,can be effectively achieved by speech representation disentanglement.Existing work generally ignores the correlation between different speechrepresentations during training, which causes leakage of content informationinto the speaker representation and thus degrades VC performance. Toalleviate this issue, we employ vector quantization (VQ) for contentencoding and introduce mutual information (MI) as the correlation metricduring training, to achieve proper disentanglement of content, speakerand pitch representations, by reducing their inter-dependencies inan unsupervised manner. Experimental results reflect the superiorityof the proposed method in learning effective disentangled speech representationsfor retaining source linguistic content and intonation variations,while capturing target speaker characteristics. In doing so, the proposedapproach achieves higher speech naturalness and speaker similaritythan current state-of-the-art one-shot VC systems. Our code, pre-trainedmodels and demo are publicly available.

@inproceedings{wang21n_interspeech,  title     = {VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion},  author    = {Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},  year      = {2021},  booktitle = {Interspeech 2021},  pages     = {1344--1348},  doi       = {10.21437/Interspeech.2021-283},  issn      = {2958-1796},}

Cite as:Wang, D., Deng, L., Yeung, Y.T., Chen, X., Liu, X., Meng, H. (2021) VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion. Proc. Interspeech 2021, 1344-1348, doi: 10.21437/Interspeech.2021-283

doi:10.21437/Interspeech.2021-283

[8]ページ先頭

©2009-2025 Movatter.jp