As synthesized speech technology becomes more widely used, the synthesizedspeech quality must be assessed to ensure that it is acceptable. Subjectiveevaluation metrics, such as mean opinion score (MOS), can only providean overall impression without any further detailed information aboutthe speech. Therefore, this study proposes predicting speech qualityusing electroencephalographs (EEG), which are more objective and havehigh temporal resolution. In this paper, we use one natural speechand four types of synthesized speech lasting two to six seconds. First,to obtain ground truth of MOS, we gathered ten subjects to give opinionscore on a scale of one to five for each recording. Second, anothernine subjects were asked to measure how close to natural speech eachsynthesized speech sounded. The subjects’ EEGs were recordedwhile they were listening to and evaluating the listened speech. Thebest accuracy achieved for classification was 96.61% using supportvector machine, 80.36% using linear discriminant analysis, and 59.9%using logistic regression. For regression, we achieved root mean squarederror as low as 1.133 using SVR and 1.353 using linear regression.This study demonstrates that EEG could be used to evaluate the perceivedspeech quality objectively.
@inproceedings{parmonangan19_interspeech, title = {Speech Quality Evaluation of Synthesized Japanese Speech Using EEG}, author = {Ivan Halim Parmonangan and Hiroki Tanaka and Sakriani Sakti and Shinnosuke Takamichi and Satoshi Nakamura}, year = {2019}, booktitle = {Interspeech 2019}, pages = {1228--1232}, doi = {10.21437/Interspeech.2019-2059}, issn = {2958-1796},}
Cite as:Parmonangan, I.H., Tanaka, H., Sakti, S., Takamichi, S., Nakamura, S. (2019) Speech Quality Evaluation of Synthesized Japanese Speech Using EEG. Proc. Interspeech 2019, 1228-1232, doi: 10.21437/Interspeech.2019-2059