Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides an audio noise reduction method, which includes obtaining first audio data including an audio frame sequence, extracting a first amplitude spectrum of each audio frame, sequentially processing the first amplitude spectrum of each audio frame to obtain a noise-reduced second amplitude spectrum, and constructing the noise-reduced second audio data based on the second amplitude spectrum of each audio frame, wherein the processing of the first amplitude spectrum of each audio frame includes inputting the first amplitude spectrum of the audio frame and the second amplitude spectrum of at least one audio frame before the audio frame into a noise reduction model together to obtain the second amplitude spectrum of the audio frame.
Fig. 1 schematically illustrates an application scenario of an audio noise reduction method according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the method of the embodiment of the present disclosure processes to-be-processedfirst audio data 101 through anoise reduction model 110, to obtain noise-reducedsecond audio data 111. Thenoise reduction model 110 may be, for example, a neural network comprising an input layer, hidden layers, and an output layer. An activation function is arranged among all layers, so that the neural network has good adaptability in a scene with a complex rule.
Fig. 2 schematically shows a flow chart of an audio noise reduction method according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S210 to S240.
In operation S210, first audio data including a sequence of audio frames is acquired.
In operation S220, a first magnitude spectrum of each audio frame is extracted. Each magnitude spectrum used by the method according to embodiments of the present disclosure may be a log magnitude spectrum.
In operation S230, the first amplitude spectrum of each audio frame is sequentially processed to obtain a second amplitude spectrum after noise reduction, where the processing of the first amplitude spectrum of each audio frame includes inputting the first amplitude spectrum of the audio frame and the second amplitude spectrum of at least one audio frame before the audio frame into a noise reduction model together to obtain the second amplitude spectrum of the audio frame.
Operation S230 of the embodiment of the present disclosure is explained below with reference to fig. 3.
Fig. 3 schematically shows a flowchart for processing a first magnitude spectrum of each audio frame to obtain a noise-reduced second magnitude spectrum according to an embodiment of the present disclosure.
As shown in fig. 3, the method includes operations S310 to S330.
In operation S310, a second magnitude spectrum of a predetermined number of consecutive audio frames preceding and adjacent to the audio frame is determined.
In operation S320, the first magnitude spectra of the audio frames and the second magnitude spectra of the predetermined number of audio frames are combined to obtain input data.
In operation S330, the input data is input into a noise reduction model, and a second magnitude spectrum of the audio frame is obtained.
For example, for the 10 th audio frame, 5 audio frames before the audio frame, i.e., the 5 th to 9 th audio frames, may be determined, and the second amplitude spectrum of the audio frames may be determined, wherein the second amplitude spectrum is the amplitude spectrum of the audio frames after the noise reduction processing. The second magnitude spectrum of the 5 th to 9 th audio frames may be combined with the first magnitude spectrum of the current audio frame (10 th audio frame). Specifically, when the input form of the noise reduction model is a vector form, the feature vectors corresponding to the magnitude spectra of the audio frames may be spliced into one vector. And inputting the merged data into a noise reduction model to obtain a second amplitude spectrum of the 10 th audio frame, namely a noise reduction result of the 10 th audio frame. By analogy, when the 11 th audio frame is processed, the second magnitude spectrum of the 6 th to 10 th audio frames and the first magnitude spectrum of the 11 th audio frame may be input into the noise reduction model together, so as to obtain the second magnitude spectrum of the 11 th audio frame, and so on.
Reference is made back to fig. 2. In operation S240, second audio data after noise reduction processing is constructed based on the second magnitude spectrum of each audio frame. And reconstructing second audio data subjected to noise reduction processing by utilizing the second amplitude spectrum of each audio frame and combining the original phase of the audio frame.
The method inputs the noise reduction results of a plurality of previous audio frames and the amplitude spectrum of the current audio frame into the noise reduction model together, and can improve the noise reduction effect of the current frame.
According to an embodiment of the present disclosure, the noise reduction model is a neural network that uses a linear rectification function (RELU) as an activation function. In the prior art, a nonlinear activation function, such as sigmoid, tanh activation function, etc., is generally used for processing audio, because the nonlinear activation function is generally considered to achieve better effect when the features are more complicated. The inventor finds that in the audio noise reduction process, the training process of the model can be simplified on the premise of not influencing the noise reduction effect by using the linear rectification function.
According to an embodiment of the present disclosure, the neural network may include, for example, one fully-connected input layer, two fully-connected hidden layers, and one fully-connected output layer. The fully-connected input layer can have 1024 nodes, the input dimension is M multiplied by N, N represents the dimension of a single-frame amplitude spectrum, and M represents that the input vector totally contains M frames of amplitude spectra; the fully connected hidden layer may have 1024 nodes; the fully connected output layer may have N nodes; the fully connected input layer and the fully connected hidden layer use the activation function of the RELU, and the fully connected output layer has no activation function.
Before using the noise reduction model, the model needs to be trained using training data. The traditional training data is obtained by superposing a piece of noise data with the same time length on a piece of original audio data, the superposed audio data is used as training data, and the original audio data is used as target data for noise reduction. The inventor finds that the noise reduction model trained by the method has poor application effect in an actual scene. The embodiment of the disclosure provides a method for training a noise reduction model, and the noise reduction effect in an actual scene is expected to be improved.
FIG. 4 schematically shows a flow diagram for training a noise reduction model according to an embodiment of the disclosure.
As shown in fig. 4, the method includes operations S410 to S430.
At least one time period is randomly determined from the clean audio data in operation S410. For example, a 5-minute period of audio data, 00: 38-00: 46, 01: 59-02: 17, 02: 26-02: 31, 03: 30-04: 52, it should be understood that the number, length, and location of the time periods may be random, or at least one of the number, length, and location may be predetermined or determined according to a preset rule, and the other parameters are random.
In operation S420, the noise data is added to each of the time segments according to the randomly determined signal-to-noise ratio, so as to obtain the audio data containing noise. The signal-to-noise ratio may be determined randomly within a certain range, for example, or from a number of alternatives, for example from-5 dB, 0dB, 5dB, 10dB, 15dB, 20 dB. The signal-to-noise ratio used at different time periods may be different.
In operation S430, the noise reduction model is trained with the noisy audio data.
According to the method, the time period is randomly selected in the clean audio and the noise is added, so that the trained noise reduction model has certain adaptability to transient unsteady noise, and the noise reduction performance of the noise reduction model in a real noise environment can be improved.
Based on the same inventive concept, the present disclosure also provides an audio noise reduction device, and the audio noise reduction device according to the embodiment of the present disclosure is described below with reference to fig. 5 to 7.
Fig. 5 schematically shows a block diagram of an audionoise reduction apparatus 500 according to an embodiment of the present disclosure.
As shown in fig. 5, theaudio noise reducer 500 includes an obtainingmodule 510, an extractingmodule 520, aprocessing module 530, and aconstructing module 540. The audionoise reduction apparatus 500 may perform the various methods described above.
The obtainingmodule 510, for example performing operation S210 described above with reference to fig. 2, is configured to obtain first audio data comprising a sequence of audio frames.
Theextraction module 520, for example performing operation S220 described above with reference to fig. 2, is configured to extract a first magnitude spectrum of each audio frame.
Theprocessing module 530, for example, performs operation S230 described above with reference to fig. 2, and is configured to sequentially process the first magnitude spectrum of each audio frame to obtain a noise-reduced second magnitude spectrum.
Theconstruction module 540, for example, performs operation S240 described above with reference to fig. 2, for constructing noise-reduced second audio data based on the second magnitude spectrum of each audio frame.
Wherein the processing the first amplitude spectrum of each audio frame includes inputting the first amplitude spectrum of the audio frame and the second amplitude spectrum of at least one audio frame before the audio frame into the noise reduction model together to obtain the second amplitude spectrum of the audio frame.
Fig. 6 schematically illustrates a block diagram of aprocessing module 530 according to an embodiment of the disclosure.
As shown in fig. 6, theprocessing module 530 includes adetermination sub-module 610, a merging sub-module 620, and aprocessing sub-module 630.
The determining sub-module 610, for example performing operation S310 described above with reference to fig. 3, is configured to determine a second magnitude spectrum of a predetermined number of consecutive audio frames preceding and adjacent to the audio frame.
The combining sub-module 620, for example performing operation S320 described above with reference to fig. 3, is configured to combine the first magnitude spectrum of the audio frames and the second magnitude spectrum of the predetermined number of audio frames to obtain input data.
The processing sub-module 630, for example performing operation S330 described above with reference to fig. 3, is configured to input the input data into a noise reduction model, resulting in a second magnitude spectrum of the audio frame.
Fig. 7 schematically shows a block diagram of an audionoise reduction apparatus 700 according to another embodiment of the present disclosure.
As shown in fig. 7, the audionoise reduction apparatus 700 further includes adetermination module 710, apreparation module 720 and atraining module 730 based on the foregoing embodiments.
The determiningmodule 710, for example performing operation S410 described above with reference to fig. 4, is configured to randomly determine at least one time period from the clean audio data.
Thepreparation module 720, for example performing operation S420 described above with reference to fig. 4, is configured to add the noise data to each of the time segments according to the randomly determined signal-to-noise ratio, so as to obtain the noise-containing audio data.
Atraining module 730, for example performing operation S430 described above with reference to fig. 4, for training the noise reduction model using the noisy audio data.
According to an embodiment of the present disclosure, the noise reduction model is a neural network that uses a linear rectification function as an activation function.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any of the obtainingmodule 510, the extractingmodule 520, theprocessing module 530, theconstructing module 540, the determining sub-module 610, the combining sub-module 620, theprocessing sub-module 630, the determiningmodule 710, the preparingmodule 720, and thetraining module 730 may be combined and implemented in one module, or any one of them may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtainingmodule 510, the extractingmodule 520, theprocessing module 530, theconstructing module 540, the determining sub-module 610, the combining sub-module 620, theprocessing sub-module 630, the determiningmodule 710, the preparingmodule 720, and thetraining module 730 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or an appropriate combination of any of them. Alternatively, at least one of the obtainingmodule 510, the extractingmodule 520, theprocessing module 530, theconstructing module 540, the determining sub-module 610, the combining sub-module 620, theprocessing sub-module 630, the determiningmodule 710, the preparingmodule 720 and thetraining module 730 may be at least partially implemented as a computer program module which, when executed, may perform a corresponding function.
Fig. 8 schematically illustrates a block diagram of acomputer system 800 suitable for implementing the above-described method according to an embodiment of the present disclosure. The computer system illustrated in FIG. 8 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.
As shown in fig. 8, acomputer system 800 according to an embodiment of the present disclosure includes aprocessor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from astorage section 808 into a Random Access Memory (RAM) 803. Theprocessor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. Theprocessor 801 may also include onboard memory for caching purposes. Theprocessor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.
In theRAM 803, various programs and data necessary for the operation of thesystem 800 are stored. Theprocessor 801, theROM 802, and theRAM 803 are connected to each other by abus 806. Theprocessor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in theROM 802 and/orRAM 803. Note that the programs may also be stored in one or more memories other than theROM 802 andRAM 803. Theprocessor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
System 800 may also include an input/output (I/O)interface 805, input/output (I/O)interface 805 also connected tobus 806, according to an embodiment of the disclosure. Thesystem 800 may also include one or more of the following components connected to the I/O interface 805: aninput portion 806 including a keyboard, a mouse, and the like; anoutput section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; astorage portion 808 including a hard disk and the like; and acommunication section 809 including a network interface card such as a LAN card, a modem, or the like. Thecommunication section 809 performs communication processing via a network such as the internet. Adrive 810 is also connected to the I/O interface 805 as necessary. Aremovable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on thedrive 810 as necessary, so that a computer program read out therefrom is mounted on thestorage section 808 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through thecommunication section 809 and/or installed from theremovable medium 811. The computer program, when executed by theprocessor 801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include theROM 802 and/orRAM 803 described above and/or one or more memories other than theROM 802 andRAM 803.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.