Movatterモバイル変換

[0]ホーム

Jump to content

Speech coding

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromVoice codec)

Lossy audio compression applied to human speech

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Speech coding" – news ·newspapers ·books ·scholar ·JSTOR(January 2013) (Learn how and when to remove this message)

Speech coding is an application ofdata compression todigital audio signals containingspeech. Speech coding uses speech-specificparameter estimation usingaudio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.^[1]

Common applications of speech coding aremobile telephony andvoice over IP (VoIP).^[2] The most widely used speech coding technique in mobile telephony islinear predictive coding (LPC), while the most widely used in VoIP applications are the LPC andmodified discrete cosine transform (MDCT) techniques.^{[citation needed]}

The techniques employed in speech coding are similar to those used inaudio data compression andaudio coding where appreciation ofpsychoacoustics is used to transmit only data that is relevant to the human auditory system. For example, invoiceband speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal retains adequateintelligibility.

Speech coding differs from other forms of audio coding in that speech is a simpler signal than other audio signals, and statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. Speech coding stresses the preservation of intelligibility andpleasantness of speech while using a constrained amount of transmitted data.^[3] In addition, most speech applications require low coding delay, aslatency interferes with speech interaction.^[4]

Sample companding viewed as a form of speech coding

[edit]

TheA-law andμ-law algorithms used inG.711 PCMdigital telephony can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12bits of resolution.^[7] Logarithmic companding are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as aperiodic waveform having a singlefundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.^{[citation needed]}^{[dubious –discuss]}

A wide variety of other algorithms were tried at the time, mostlydelta modulation variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.^{[citation needed]}

In 2008,G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.^[8]

Modern speech compression

[edit]

Much of the later work in speech compression was motivated by military research into digital communications forsecure military radios, where very low data rates were used to achieve effective operation in a hostile radio environment. At the same time, far moreprocessing power was available, in the form ofVLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.

The most widely used speech coding algorithms are based onlinear predictive coding (LPC).^[9] In particular, the most common speech coding scheme is the LPC-basedcode-excited linear prediction (CELP) coding, which is used for example in theGSM standard. In CELP, the modeling is divided in two stages, alinear predictive stage that models the spectral envelope and a code-book-based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually asline spectral pairs (LSPs). In addition to the actual speech coding of the signal, it is often necessary to usechannel coding for transmission, to avoid losses due to transmission errors. In order to get the best overall coding results, speech coding and channel coding methods are chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding.

Themodified discrete cosine transform (MDCT) is used in the LD-MDCT technique used by theAAC-LD format introduced in 1999.^[10] MDCT has since been widely adopted invoice-over-IP (VoIP) applications, such as theG.729.1 wideband audio codec introduced in 2006,^[11]Apple'sFaceTime (using AAC-LD) introduced in 2010,^[12] and theCELT codec introduced in 2011.^[13]

Opus is afree software audio coder. It combines the speech-oriented LPC-basedSILK algorithm and the lower-latency MDCT-based CELT algorithm, switching between or combining them as needed for maximal efficiency.^[14]^[15] It is widely used for VoIP calls inWhatsApp.^[16]^[17]^[18] ThePlayStation 4 video game console also uses Opus for itsPlayStation Network system party chat.^[19]

A number of codecs with even lowerbit rates have been demonstrated.Codec 2, which operates at bit rates as low as450 bit/s, sees use in amateur radio.^[20] NATO currently usesMELPe, offering intelligible speech at600 bit/s and below.^[21] Neural vocoder approaches have also emerged:Lyra by Google gives an "almost eerie" quality at3 kbit/s.^[22] Microsoft'sSatin also uses machine learning, but uses a higher tunable bitrate and is wideband.^[23]

Sub-fields

[edit]

Wideband audio coding

Linear predictive coding (LPC)
- AMR-WB forWCDMA networks
- VMR-WB forCDMA2000 networks
- Speex, IP-MR,SILK (part ofOpus), andUSAC/xHE-AAC for VoIP andvideoconferencing
Modified discrete cosine transform (MDCT)
- AAC-LD,G.722.1,G.729.1,CELT andOpus for VoIP and videoconferencing
Adaptive differential pulse-code modulation (ADPCM)
- G.722 for VoIP
Neural speech coding
- Lyra (Google): V1 uses neural network reconstruction of log-mel spectrogram; V2 is an end-to-endautoencoder.
- Satin (Microsoft)
- LPCNet (Mozilla, Xiph): neural network reconstruction of LPC features^[24]

Narrowband audio coding

LPC
- FNBDT for military applications
- SMV forCDMA networks
- Full Rate,Half Rate,EFR andAMR forGSM networks
- G.723.1,G.728,G.729,G.729.1 andiLBC for VoIP or videoconferencing
ADPCM
- G.726 for VoIP
Multi-Band Excitation (MBE)
- AMBE+ fordigital mobile radio andsatellite phone
- Codec 2

References

[edit]

^Arjona Ramírez, M.; Minam, M. (2003). "Low bit rate speech coding".Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed.3. New York: Wiley:1299–1308.
^M. Arjona Ramírez and M. Minami, "Technology and standards for low-bit-rate vocoding methods," in The Handbook of Computer Networks, H. Bidgoli, Ed., New York: Wiley, 2011, vol. 2, pp. 447–467.
^P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.
^J. H. Chen, R. V. Cox, Y.-C. Lin, N. S. Jayant, and M. J. Melchner, A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE J. Select. Areas Commun. 10(5): 830-849, June 1992.
^"Soo Hyun Bae, ECE 8873 Data Compression & Modeling, Georgia Institute of Technology, 2004". Archived fromthe original on 7 September 2006.
^Zeghidour, Neil; Luebs, Alejandro; Omran, Ahmed; Skoglund, Jan; Tagliasacchi, Marco (2022). "SoundStream: An End-to-End Neural Audio Codec".IEEE/ACM Transactions on Audio, Speech, and Language Processing.30:495–507.arXiv:2107.03312.doi:10.1109/TASLP.2021.3129994.S2CID 236149944.
^Jayant, N. S.; Noll, P. (1984).Digital coding of waveforms. Englewood Cliffs: Prentice-Hall.
^G.711.1 : Wideband embedded extension for G.711 pulse code modulation, ITU-T, 2012, retrieved2022-12-24
^Gupta, Shipra (May 2016)."Application of MFCC in Text Independent Speaker Recognition"(PDF).International Journal of Advanced Research in Computer Science and Software Engineering.6 (5): 805–810 (806).ISSN 2277-128X.S2CID 212485331. Archived fromthe original(PDF) on 2019-10-18. Retrieved18 October 2019.
^Schnell, Markus; Schmidt, Markus; Jander, Manuel; Albert, Tobias; Geiger, Ralf; Ruoppila, Vesa; Ekstrand, Per; Bernhard, Grill (October 2008).MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication(PDF). 125th AES Convention.Fraunhofer IIS.Audio Engineering Society. Retrieved20 October 2019.
^Nagireddi, Sivannarayana (2008).VoIP Voice and Fax Signal Processing.John Wiley & Sons. p. 69.ISBN 9780470377864.
^Daniel Eran Dilger (June 8, 2010)."Inside iPhone 4: FaceTime video calling".AppleInsider. RetrievedJune 9, 2010.
^Presentation of the CELT codec Archived 2011-08-07 at theWayback Machine by Timothy B. Terriberry (65 minutes of video, see alsopresentation slides in PDF)
^"Opus Codec".Opus (Home page). Xiph.org Foundation. RetrievedJuly 31, 2012.
^Valin, Jean-Marc; Maxwell, Gregory; Terriberry, Timothy B.; Vos, Koen (October 2013).High-Quality, Low-Delay Music Coding in the Opus Codec. 135th AES Convention.Audio Engineering Society.arXiv:1602.04845.
^Leyden, John (27 October 2015)."WhatsApp laid bare: Info-sucking app's innards probed".The Register. Retrieved19 October 2019.
^Hazra, Sudip; Mateti, Prabhaker (September 13–16, 2017)."Challenges in Android Forensics". In Thampi, Sabu M.; Pérez, Gregorio Martínez; Westphall, Carlos Becker; Hu, Jiankun; Fan, Chun I.; Mármol, Félix Gómez (eds.).Security in Computing and Communications: 5th International Symposium, SSCC 2017. Springer. pp. 286–299 (290).doi:10.1007/978-981-10-6898-0_24.ISBN 9789811068980.
^Srivastava, Saurabh Ranjan; Dube, Sachin; Shrivastaya, Gulshan; Sharma, Kavita (2019)."Smartphone Triggered Security Challenges: Issues, Case Studies and Prevention". In Le, Dac-Nhuong; Kumar, Raghvendra; Mishra, Brojo Kishore; Chatterjee, Jyotir Moy; Khari, Manju (eds.).Cyber Security in Parallel and Distributed Computing: Concepts, Techniques, Applications and Case Studies. John Wiley & Sons. pp. 187–206 (200).doi:10.1002/9781119488330.ch12.ISBN 9781119488057.S2CID 214034702.
^"Open Source Software used in PlayStation4". Sony Interactive Entertainment Inc. Retrieved2017-12-11.^{[failed verification]}
^"GitHub - Codec2".GitHub. November 2019.
^Alan McCree, “A scalable phonetic vocoder framework using joint predictive vector quantization of MELP parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2006, pp. I 705–708, Toulouse, France
^Buckley, Ian (2021-04-08)."Google Makes Its Lyra Low Bitrate Speech Codec Public".MakeUseOf. Retrieved2022-07-21.
^Levent-Levi, Tsahi (2021-04-19)."Lyra, Satin and the future of voice codecs in WebRTC".BlogGeek.me. Retrieved2022-07-21.
^"LPCNet: Efficient neural speech synthesis". Xiph.Org Foundation. 8 August 2023.

External links

[edit]

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT