Movatterモバイル変換

[0]ホーム

Jump to content

Modified discrete cosine transform

Edit links

From Wikipedia, the free encyclopedia

Mathematical transform using in signal processing

"MDCT" redirects here. For the form ofmedical imaging, seeMultidetector computed tomography.

Themodified discrete cosine transform (MDCT) is a transform based on the type-IVdiscrete cosine transform (DCT-IV), with the additional property of beinglapped: it is designed to be performed on consecutive blocks of a largerdataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the DCT, makes the MDCT especially attractive for signal compression applications, since it helps to avoidartifacts stemming from the block boundaries. As a result of these advantages, the MDCT is the most widely usedlossy compression technique inaudio data compression. It is employed in most modernaudio coding standards, includingMP3,Dolby Digital (AC-3),Vorbis (Ogg),Windows Media Audio (WMA),ATRAC,Cook,Advanced Audio Coding (AAC),^[1]High-Definition Coding (HDC),^[2]LDAC,Dolby AC-4,^[3] andMPEG-H 3D Audio,^[4] as well asspeech coding standards such asAAC-LD (LD-MDCT),^[5]G.722.1,^[6]G.729.1,^[7]CELT,^[8] andOpus.^[9]^[10]

Thediscrete cosine transform (DCT) was first proposed byNasir Ahmed in 1972,^[11] and demonstrated by Ahmed with T. Natarajan andK. R. Rao in 1974.^[12] The MDCT was later proposed by John P. Princen, A.W. Johnson and Alan B. Bradley at theUniversity of Surrey in 1987,^[13] following earlier work by Princen and Bradley (1986)^[14] to develop the MDCT's underlying principle oftime-domain aliasing cancellation (TDAC), described below. (There also exists an analogous transform, the MDST, based on thediscrete sine transform, as well as other, rarely used, forms of the MDCT based on different types of DCT or DCT/DST combinations.)

In MP3, the MDCT is not applied to the audio signal directly, but rather to the output of a 32-bandpolyphase quadrature filter (PQF) bank. The output of this MDCT is postprocessed by an alias reduction formula to reduce the typical aliasing of the PQF filter bank. Such a combination of a filter bank with an MDCT is called ahybrid filter bank or asubband MDCT. AAC, on the other hand, normally uses a pure MDCT; only the (rarely used)MPEG-4 AAC-SSR variant (bySony) uses a four-band PQF bank followed by an MDCT. Similar to MP3,ATRAC uses stackedquadrature mirror filters (QMF) followed by an MDCT.

Definition

[edit]

As a lapped transform, the MDCT is somewhat unusual compared to other Fourier-related transforms in that it has half as many outputs as inputs (instead of the same number). In particular, it is alinear function $F\colon \mathbf {R} ^{2N}\to \mathbf {R} ^{N}$ (whereR denotes the set ofreal numbers). The 2N real numbersx₀, ...,x_2N−1 are transformed into theN real numbersX₀, ...,X_N−1 according to the formula

X_{k}=\sum _{n=0}^{2N-1}x_{n}\cos \left[{\frac {\pi }{N}}\left(n+{\frac {1}{2}}+{\frac {N}{2}}\right)\left(k+{\frac {1}{2}}\right)\right].

The normalization coefficient in front of this transform, here unity, is an arbitrary convention and differs between treatments. Only the product of the normalizations of the MDCT and the IMDCT, below, is constrained.

Inverse transform

[edit]

The inverse MDCT is known as theIMDCT. Because there are different numbers of inputs and outputs, at first glance it might seem that the MDCT should not be invertible. However, perfect invertibility is achieved byadding the overlapped IMDCTs of subsequent overlapping blocks, causing the errors tocancel and the original data to be retrieved; this technique is known astime-domain aliasing cancellation (TDAC).

The IMDCT transformsN real numbersX₀, ...,X_N−1 into 2N real numbersy₀, ...,y_2N−1 according to the formula

y_{n}={\frac {1}{N}}\sum _{k=0}^{N-1}X_{k}\cos \left[{\frac {\pi }{N}}\left(n+{\frac {1}{2}}+{\frac {N}{2}}\right)\left(k+{\frac {1}{2}}\right)\right].

Like for theDCT-IV, an orthogonal transform, the inverse has the same form as the forward transform.

In the case of a windowed MDCT with the usual window normalization (see below), the normalization coefficient in front of the IMDCT should be multiplied by 2 (i.e., becoming 2/N).

Computation

[edit]

Although the direct application of the MDCT formula would require O(N²) operations, it is possible to compute the same thing with only O(N logN) complexity by recursively factorizing the computation, as in thefast Fourier transform (FFT). One can also compute MDCTs via other transforms, typically a DFT (FFT) or a DCT, combined with O(N) pre- and post-processing steps. Also, as described below, any algorithm for the DCT-IV immediately provides a method to compute the MDCT and IMDCT of even size.

Window functions

[edit]

In typical signal-compression applications, the transform properties are further improved by using awindow functionw_n (n = 0, ..., 2N − 1) that is multiplied withx_n in the MDCT and withy_n in the IMDCT formulas above, in order to avoid discontinuities at then = 0 and 2N boundaries by making the function go smoothly to zero at those points. (That is, the window function is applied to the databefore the MDCT orafter the IMDCT.) In principle,x andy could have different window functions, and the window function could also change from one block to the next (especially for the case where data blocks of different sizes are combined), but for simplicity we consider the common case of identical window functions for equal-sized blocks.

The transform remains invertible (that is, TDAC works), for a symmetric windoww_n =w_2N−1−n, as long asw satisfies the Princen–Bradley condition:

w_{n}^{2}+w_{n+N}^{2}=1.

Various window functions are used. A window that produces a form known as a modulated lapped transform (MLT)^[15]^[16] is given by

w_{n}=\sin \left[{\frac {\pi }{2N}}\left(n+{\frac {1}{2}}\right)\right]

and is used for MP3 and MPEG-2 AAC, and

w_{n}=\sin \left({\frac {\pi }{2}}\sin ^{2}\left[{\frac {\pi }{2N}}\left(n+{\frac {1}{2}}\right)\right]\right)

for Vorbis. AC-3 uses aKaiser–Bessel derived (KBD) window, and MPEG-4 AAC can also use a KBD window.

Note that windows applied to the MDCT are different from windows used for some other types of signal analysis, since they must fulfill the Princen–Bradley condition. One of the reasons for this difference is that MDCT windows are applied twice, for both the MDCT (analysis) and the IMDCT (synthesis).

Relationship to DCT-IV and origin of TDAC

[edit]

As can be seen by inspection of the definitions, forevenN the MDCT is essentially equivalent to a DCT-IV, where the input is shifted byN/2 and twoN-blocks of data are transformed at once. By examining this equivalence more carefully, important properties like TDAC can be easily derived.

In order to define the precise relationship to the DCT-IV, one must realize that the DCT-IV corresponds to alternating even/odd boundary conditions: even at its left boundary (aroundn = −1/2), odd at its right boundary (aroundn = N − 1/2), and so on (instead of periodic boundaries as for aDFT). This follows from the identities

\cos \left[{\frac {\pi }{N}}\left(-n-1+{\frac {1}{2}}\right)\left(k+{\frac {1}{2}}\right)\right]=\cos \left[{\frac {\pi }{N}}\left(n+{\frac {1}{2}}\right)\left(k+{\frac {1}{2}}\right)\right]

and

\cos \left[{\frac {\pi }{N}}\left(2N-n-1+{\frac {1}{2}}\right)\left(k+{\frac {1}{2}}\right)\right]=-\cos \left[{\frac {\pi }{N}}\left(n+{\frac {1}{2}}\right)\left(k+{\frac {1}{2}}\right)\right].

Thus, if its inputs are an arrayx of lengthN, we can imagine extending this array to (x, −x_R, −x,x_R, ...) and so on, wherex_R denotesx in reverse order.

Consider an MDCT with 2N inputs andN outputs, where we divide the inputs into four blocks (a,b,c,d) each of sizeN/2. If we shift these to the right byN/2 (from the +N/2 term in the MDCT definition), then (b,c,d) extend past the end of theN DCT-IV inputs, so we must "fold" them back according to the boundary conditions described above.

Thus, the MDCT of 2N inputs (a,b,c,d) isexactly equivalent to a DCT-IV of theN inputs: (−c_R − d,a − b_R), whereR denotes reversal as above.

In this way, any algorithm to compute the DCT-IV can be trivially applied to the MDCT.

Similarly, the IMDCT formula above is precisely 1/2 of the DCT-IV (which is its own inverse), where the output is extended (via the boundary conditions) to a length 2N and shifted back to the left byN/2. The inverse DCT-IV would simply give back the inputs (−c_R − d,a − b_R) from above. When this is extended via the boundary conditions and shifted, one obtains

IMDCT(MDCT(a,b,c,d)) = (a −b_R,b −a_R,c +d_R,d +c_R)/2.

Half of the IMDCT outputs are thus redundant, asb − a_R = −(a − b_R)_R, and likewise for the last two terms. If we group the input into bigger blocksA,B of sizeN, whereA = (a,b) andB = (c,d), we can write this result in a simpler way:

IMDCT(MDCT(A,B)) = (A −A_R,B +B_R)/2.

One can now understand how TDAC works. Suppose that one computes the MDCT of the subsequent, 50% overlapped, 2N block (B,C). The IMDCT will then yield, analogous to the above: (B − B_R,C + C_R)/2. When this is added with the previous IMDCT result in the overlapping half, the reversed terms cancel and one obtains simplyB, recovering the original data.

Origin of TDAC

[edit]

The origin of the term "time-domain aliasing cancellation" is now clear. The use of input data that extend beyond the boundaries of the logical DCT-IV causes the data to bealiased in the same way that frequencies beyond theNyquist frequency arealiased to lower frequencies, except that this aliasing occurs in the time domain instead of the frequency domain: we cannot distinguish the contributions ofa and ofb_R to the MDCT of (a,b,c,d), or equivalently, tothe result of

IMDCT(MDCT(a,b,c,d))= (a −b_R,b −a_R,c +d_R,d +c_R)/2.

The combinationsc −d_R and so on have precisely the right signs for the combinations to cancel when they are added.

ForoddN (which are rarely used in practice),N/2 is not an integer, so the MDCT is not simply a shift permutation of a DCT-IV. In this case, the additional shift by half a sample means that the MDCT/IMDCT becomes equivalent to the DCT-III/II, and the analysis is analogous to the above.

Smoothness and discontinuities

[edit]

We have seen above that the MDCT of 2N inputs (a,b,c,d) is equivalent to a DCT-IV of theN inputs (−c_R −d,a −b_R).The DCT-IV is designed for the case where the function at the right boundary is odd, and therefore the values near the right boundary are close to 0. If the input signal is smooth, this is the case: the rightmost components ofa andb_R are consecutive in the input sequence (a,b,c,d), and therefore their difference is small.Let us look at the middle of the interval: if we rewrite the above expression as (−c_R −d,a −b_R) = (−d,a) − (b,c)_R, the second term, (b,c)_R, gives a smooth transition in the middle.However, in the first term, (−d,a), there is a potential discontinuity where the right end of −d meets the left end ofa.This is the reason for using a window function that reduces the components near the boundaries of the input sequence (a,b,c,d) towards 0.

TDAC for the windowed MDCT

[edit]

Above, the TDAC property was proved for the ordinary MDCT, showing that adding IMDCTs of subsequent blocks in their overlapping half recovers the original data. The derivation of this inverse property for the windowed MDCT is only slightly more complicated.

Consider two overlapping consecutive sets of 2N inputs (A,B) and (B,C), for blocksA,B,C of sizeN.Recall from above that when $(A,B)$ and $(B,C)$ are MDCTed, IMDCTed, and added in their overlapping half, we obtain $(B+B_{R})/2+(B-B_{R})/2=B$ , the original data.

Now we suppose that we multiplyboth the MDCT inputsand the IMDCT outputs by a window function of length 2N. As above, we assume a symmetric window function, which is therefore of the form $(W,W_{R})$ whereW is a length-N vector andR denotes reversal as before. Then the Princen-Bradley condition can be written as $W^{2}+W_{R}^{2}=(1,1,\ldots )$ , with the squares and additions performed elementwise.

Therefore, instead of MDCTing $(A,B)$ , we now MDCT $(WA,W_{R}B)$ (with all multiplications performed elementwise). When this is IMDCTed and multiplied again (elementwise) by the window function, the last-N half becomes:

W_{R}\cdot (W_{R}B+(W_{R}B)_{R})=W_{R}\cdot (W_{R}B+WB_{R})=W_{R}^{2}B+WW_{R}B_{R}

(Note that we no longer have the multiplication by 1/2, because the IMDCT normalization differs by a factor of 2 in the windowed case.)

Similarly, the windowed MDCT and IMDCT of $(B,C)$ yields, in its first-N half:

W\cdot (WB-W_{R}B_{R})=W^{2}B-WW_{R}B_{R}

When we add these two halves together, we obtain:

(W_{R}^{2}B+WW_{R}B_{R})+(W^{2}B-WW_{R}B_{R})=\left(W_{R}^{2}+W^{2}\right)B=B,

recovering the original data.

References

[edit]

^Luo, Fa-Long (2008).Mobile Multimedia Broadcasting Standards: Technology and Practice.Springer Science & Business Media. p. 590.ISBN 9780387782638.
^Jones, Graham A.; Layer, David H.; Osenkowsky, Thomas G. (2013).National Association of Broadcasters Engineering Handbook: NAB Engineering Handbook.Taylor & Francis. pp. 558–9.ISBN 978-1-136-03410-7.
^"Dolby AC-4: Audio Delivery for Next-Generation Entertainment Services"(PDF).Dolby Laboratories. June 2015. Retrieved11 November 2019.
^Bleidt, R. L.; Sen, D.; Niedermeier, A.; Czelhan, B.; Füg, S.; et al. (2017)."Development of the MPEG-H TV Audio System for ATSC 3.0"(PDF).IEEE Transactions on Broadcasting.63 (1):202–236.doi:10.1109/TBC.2017.2661258.S2CID 30821673.
^Schnell, Markus; Schmidt, Markus; Jander, Manuel; Albert, Tobias; Geiger, Ralf; Ruoppila, Vesa; Ekstrand, Per; Bernhard, Grill (October 2008).MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication(PDF). 125th AES Convention.Fraunhofer IIS.Audio Engineering Society. Retrieved20 October 2019.
^Lutzky, Manfred; Schuller, Gerald; Gayer, Marc; Krämer, Ulrich; Wabnik, Stefan (May 2004).A guideline to audio codec delay(PDF). 116th AES Convention.Fraunhofer IIS.Audio Engineering Society. Retrieved24 October 2019.
^Nagireddi, Sivannarayana (2008).VoIP Voice and Fax Signal Processing.John Wiley & Sons. p. 69.ISBN 9780470377864.
^Presentation of the CELT codec Archived 2011-08-07 at theWayback Machine by Timothy B. Terriberry (65 minutes of video, see alsopresentation slides Archived 2023-11-16 at theWayback Machine in PDF)
^"Opus Codec".Opus (Home page). Xiph.org Foundation. RetrievedJuly 31, 2012.
^Bright, Peter (2012-09-12)."Newly standardized Opus audio codec fills every role from online chat to music".Ars Technica. Retrieved2014-05-28.
^Ahmed, Nasir (January 1991)."How I Came Up With the Discrete Cosine Transform"(PDF).Digital Signal Processing.1 (1):4–5.Bibcode:1991DSP.....1....4A.doi:10.1016/1051-2004(91)90086-Z.
^Ahmed, Nasir; Natarajan, T.; Rao, K. R. (January 1974), "Discrete Cosine Transform",IEEE Transactions on Computers,C-23 (1):90–93,doi:10.1109/T-C.1974.223784,S2CID 149806273
^Princen, John P.; Johnson, A.W.; Bradley, Alan B. (1987). "Subband/Transform coding using filter bank designs based on time domain aliasing cancellation".ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 12. pp. 2161–2164.doi:10.1109/ICASSP.1987.1169405.S2CID 58446992.
^John P. Princen, Alan B. Bradley:Analysis/synthesis filter bank design based on time domain aliasing cancellation, IEEE Trans. Acoust. Speech Signal Processing,ASSP-34 (5), 1153–1161, 1986. Described a precursor to the MDCT using a combination of discrete cosine and sine transforms.
^H. S. Malvar, "Lapped Transforms for Efficient Transform/Subband Coding",IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 38, no. 6, pp. 969–978 (Equation 22), June 1990.
^H. S. Malvar, "Modulated QMF Filter Banks with Perfect Reconstruction",Electronics Letters, vol. 26, no. 13, pp. 906–907 (Equation 13), June 1990.

Bibliography

[edit]

Henrique S. Malvar,Signal Processing with Lapped Transforms (Artech House: Norwood MA, 1992).
A. W. Johnson and A. B. Bradley, "Adaptive transform coding incorporating time domain aliasing cancellation,"Speech Comm.6, 299-308 (1987).
For algorithms, see examples:
- Chi-Min Liu and Wen-Chieh Lee, "A unified fast algorithm for cosine modulated filterbanks in current audio standards^{[permanent dead link]}",J. Audio Engineering47 (12), 1061-1075 (1999).
- V. Britanak and K. R. Rao, "A new fast algorithm for the unified forward and inverse MDCT/MDST computation,"Signal Processing82, 433-459 (2002)
- Vladimir Nikolajevic and Gerhard Fettweis, "Computation of forward and inverse MDCT using Clenshaw's recurrence formula,"IEEE Trans. Sig. Proc.51 (5), 1439-1444 (2003)
- Che-Hong Chen, Bin-Da Liu, and Jar-Ferr Yang, "Recursive architectures for realizing modified discrete cosine transform and its inverse,"IEEE Trans. Circuits Syst. II: Analog Dig. Sig. Proc.50 (1), 38-45 (2003)
- J.S. Wu, H.Z. Shu, L. Senhadji, and L.M. Luo, "Mixed-radix algorithm for the computation of forward and inverse MDCTs,"IEEE Trans. Circuits Syst. I: Reg. Papers56 (4), 784-794 (2009)
- V. Britanak, "A survey of efficient MDCT implementations in MP3 audio coding standard: retrospective and state-of-the-art,"Signal. Process.91 (4), 624-672(2011)

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT