1779年には、クリスティアン・クラッツェンシュタインにより母音を発声できる機械が製作された[4]。この流れはふいごを使った機械式音声合成器を作ったオーストリアのヴォルフガング・フォン・ケンペレンに引き継がれた。彼は1791年に論文Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine(『音声の仕組みとしゃべる機械の解説』)を発表し、その機械について説明している。この機械は舌と唇をモデル化しており、母音だけでなく子音も発音できた。1837年、チャールズ・ホイートストンがフォン・ケンペレンのデザインを元にしゃべる機械を製作し、1857年には、M. FaberがEuphoniaを製作した。ホイートストンの機械は、1923年にPagetが再現している[5]。
音声合成の手法は研究者によってそれぞれ独自のデータセットを用いてモデル学習を行い、かつ独自の課題により評価することが少なくなく、性能を公平に比較することが困難な場合がある。そこで、音声に関する国際学会であるInternational Speech Communication Association(ISCA) のSpeech Synthesis Special Interest Group(SynSIG)では、2005年より毎年Blizzard Challenge[60]という競技会を行っている。この競技会では、共通のデータセットを学習に用いた音声合成システムを、共通の課題により評価することで、性能の公平な比較を可能としている。
^ab"with desired characteristics, including but not limited to textual content ..., speaker identity ..., and speaking styles" Wang, et al. (2021).FAIRSEQ S2 : A Scalable and Integrable Speech Synthesis Toolkit.
^Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed.),Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974.
^"Aformant synthesizer is a source-filter model in which the source models the glottal pulse train and the filter models the formant resonances of the vocal tract." Smith. (2010).Formant Synthesis Models. Physical Audio Signal Processing.ISBN 978-0-9745607-2-4
^"Constrained linear prediction can be used to estimate the parameters ... more generally ... directly from the short-time spectrum" Smith. (2010).Formant Synthesis Models. Physical Audio Signal Processing.ISBN 978-0-9745607-2-4
^abconcatenation-based synthesis systems ... the synthesis stage generally involves ... a concatenation process: the sequence of acoustical units must be concatenated after an appropriate modification of their intrinsic prosody.(Moulines 1990, p. 454)
^abPSOLA ... a family of methods for modifying the prosody ... These methods are used to improve the voice quality of text-to-speech systems based on the concatenation of elementary speech units,(Moulines 1990, p. 453)
^Andrew J., Hunt; Black, Alan W. (1996). “Unit selection in a concatenative speech synthesis system using a large speech database” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 373–376. doi:10.1109/ICASSP.1996.541110. ISBN0-7803-3192-3. ISSN1520-6149.
^abUnit selection synthesis is also referred as corpus based synthesis.以下より引用。Kayte, Sangramsing (2015).“A Review of Unit Selection Speech Synthesis”.International Journal of Advanced Research in Computer Science and Software Engineering.5 (10):475–479.
^concatenation-based synthesis systems require the use of rather large databases of acoustical units(Moulines 1990, p. 454)
^Masuko, Takashi; Keiichi, Tokuda; Takao, Kobayashi; Satoshi, Imai (1999-05-09). “Speech synthesis using HMMs with dynamic features” (English). 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (IEEE): 389–392. doi:10.1109/ICASSP.1996.541114. ISBN0-7803-3192-3. ISSN1520-6149.
^Zen, Heiga; Senior, Andrew; Schuster, Mike (2013-05-26). “Statistical parametric speech synthesis using deep neural networks” (English). 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE): 7962–7966. ISBN978-1-4799-0356-6. ISSN1520-6149.
^van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew et al. (2016-09-12). “WaveNet: A Generative Model for Raw Audio” (English). arXiv. arXiv:1609.03499.
^J. Shen, R. Pang, R. J. Weiss, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.
^W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” arXiv preprint arXiv:1807.07281, 2018
^R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flowbased generative network for speech synthesis,” arXiv preprint arXiv:1811.00002, 2018
^N. Kalchbrenner, E. Elsen, K. Simonyan, et al., “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
^Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal (2019) TOWARDS ACHIEVING ROBUST UNIVERSAL NEURAL VOCODING. Interspeech 2019
^Gopala K. Anumanchipalli, et al.. (2019) Speech synthesis from neural decoding of spoken sentences[paper]
^"Singing voice synthesis (SVS) aims to generate humanlike singing voices from musical scores with lyrics" Wu. (2022).DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation.
^"Voice conversion (VC) refers to a technique that converts a certain aspect of speech from a source to that of a target without changing the linguistic content" Huang, et al. (2021).S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.
^"speaker conversion, which is the most widely investigated type of VC." Huang, et al. (2021).S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations. p.1.
^"Bandwidth extension ... Frequency bandwidth extension ... can be viewed as a realistic increase of signal sampling frequency." Andreev. (2023).HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement.
^"Bandwidth extension ... also known as audio super-resolution" Andreev. (2023).HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement.
^"The applications of conditional speech generation include ... bandwidth extension (BWE)" Andreev. (2023).HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement.
Moulines, Eric (1990). “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones”.Speech Communication.9 (5–6):453–467.doi:10.1016/0167-6393(90)90021-Z.