Disclosure of Invention
In order to solve the technical problems described above or at least partially solve the technical problems described above, the present application provides a method, an apparatus and a storage medium for voice noise reduction.
In a first aspect, the present application provides a method for noise reduction in speech, the method comprising:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to noise sample sets in various scenes;
and selecting a preset noise reduction model corresponding to the voice scene, and carrying out noise reduction on the voice data.
In one embodiment of the first aspect, before the step of obtaining the voice data, the method includes:
collecting noise sample sets in each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set;
dividing the classified voice set into a training voice set and a test voice set, constructing the scene recognition model by using the training voice set, and performing test adjustment on the scene recognition model by using the test voice set to obtain a standard scene recognition model.
In one embodiment of the first aspect, the step of segmenting the classified speech set into a training speech set and a test speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model further includes:
and establishing a noise reduction model corresponding to each scene for calling according to the collected noise sample set in each scene.
In one embodiment of the first aspect, the constructing a scene recognition model by using the training speech set includes:
calculating a base index between each feature label and the corresponding training voice set to obtain a base index set corresponding to the feature label, wherein the feature label is a category label which is extracted from a noise sample set in each scene to obtain corresponding audio features;
sorting the base index sets according to the sequence from big to small, and selecting marks corresponding to the smallest base index in the base index sets as dividing points;
taking the segmentation point as a root node of an initial decision tree, generating sub-nodes from the segmentation point, and distributing the training voice set to the sub-nodes until all labels in the feature labels are traversed, so as to generate the initial decision tree;
pruning is carried out on the initial decision tree, and a scene recognition model is obtained.
In one embodiment of the first aspect, pruning the initial decision tree to obtain a scene recognition model includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
pruning the non-leaf nodes with the surface error gain values smaller than a preset gain threshold value to obtain a scene recognition model.
In one embodiment of the first aspect, the performing test adjustment on the scene recognition model by using the test voice set to obtain a standard scene recognition model includes:
performing scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
and when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, training the scene recognition model by utilizing the training voice set again until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and obtaining a standard scene recognition model.
In one embodiment of the first aspect, the performing cluster analysis on the noise sample set based on the audio feature to obtain a classified speech set includes:
acquiring preset standard features, and calculating a conditional probability value between the audio features and the standard features;
and sorting each noise sample in the noise sample set according to the size of the conditional probability value, and dividing the sorted noise sample set by taking a preset audio interval as a dividing point to obtain a classified voice set.
In one embodiment of the first aspect, collecting a set of noise samples in each scene, and extracting audio features from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transformation are carried out on the noise sample set, so that a short-time frequency spectrum of the noise sample set is obtained;
taking the modulus square of the short-time frequency spectrum to obtain the power spectrum of the noise sample set;
and calculating the power spectrum by using a triangle filter bank with a preset Mel scale to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In a second aspect, the present application provides a speech scene recognition apparatus, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data;
the voice scene recognition module is used for inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to noise sample sets in various scenes;
the noise reduction module is used for selecting a preset noise reduction model corresponding to the voice scene and reducing noise of the voice data.
In a third aspect, a speech recognition device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory perform communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the voice noise reduction method according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, implements the steps of the speech noise reduction method according to any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the voice scene corresponding to the voice data is recognized by utilizing the standard scene recognition model, the voice environment where the voice data is positioned can be determined by recognizing the voice scene corresponding to the voice data, the preset noise reduction model corresponding to the voice scene is selected, the voice data is subjected to noise reduction, and the noise reduction operation is performed more accurately by the noise reduction model matched with the scene, so that the purpose of improving the accuracy of voice noise reduction is achieved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a flow chart of a voice noise reduction method according to an embodiment of the present application. In this embodiment, the voice noise reduction method includes:
s1, acquiring voice data.
In the embodiment of the application, the voice data is noise-containing audio data to be subjected to noise reduction processing so as to be subjected to subsequent audio processing such as voice recognition and the like. Specifically, the voice data may be audio data collected in any voice scene.
Further, before the step of acquiring the voice data, the method includes:
collecting noise sample sets in each scene, and extracting audio features from each noise sample;
performing cluster analysis on the noise sample set based on the audio features to obtain a classified speech set;
dividing the classified voice set into a training voice set and a test voice set, constructing the scene recognition model by using the training voice set, and performing test adjustment on the scene recognition model by using the test voice set to obtain a standard scene recognition model.
In detail, in the embodiment of the present application, the noise sample set includes noise audio data in each voice scene, for example, noise audio data in a park, noise audio data at a roadside of a horse, or noise audio data in an office. In the embodiment of the present application, the noise sample set may further include feature labels corresponding to each noise sample, where the feature labels are used to label each noise sample and extract corresponding audio features. The audio features may include zero-crossing rate, mel-frequency cepstrum coefficient, spectrum centroid, spectrum diffusion, spectrum entropy, spectrum flux, and the like, wherein the audio features in the embodiment of the application are preferably mel-frequency cepstrum coefficient.
Specifically, the collecting the noise sample set under each scene, extracting the audio feature from each noise sample includes:
pre-emphasis processing, framing processing, windowing processing and fast Fourier transformation are carried out on the noise sample set, so that a short-time frequency spectrum of the noise sample set is obtained;
taking the modulus square of the short-time frequency spectrum to obtain the power spectrum of the noise sample set;
and calculating the power spectrum by using a triangle filter bank with a preset Mel scale to obtain logarithmic energy, and performing discrete cosine transform on the logarithmic energy to obtain the audio characteristics corresponding to each noise sample.
In an alternative embodiment of the present application, the noise sample set is subjected to pre-emphasis processing by a preset high-pass filter, so as to obtain a high-frequency noise sample set, where the pre-emphasis processing can enhance the high-frequency part of the voice signal in the noise sample set.
The embodiment of the application carries out pre-emphasis processing on the noise sample set, and can highlight the formants of the high-frequency part in the noise sample.
In an optional embodiment of the present application, the high-frequency noise sample set is segmented into multi-frame data by using a preset sampling point, so as to obtain a framing data set;
preferably, in the embodiment of the present application, the sampling point is 512 or 256.
In an optional embodiment of the present application, the windowing process is performed on each frame in the frame data set according to a preset window function, so as to obtain a windowed signal.
In detail, the preset window function is:
S′(n)=S(n)×W(n)
where S' (N) is a windowed signal, S (N) is a framing dataset, W (N) is a window function, N is the size of the frame, and N is the number of frames.
Preferably, in the embodiment of the present application, the preset window function may select a hamming window, and W (n) is a functional expression of the hamming window.
According to the embodiment of the application, the windowing processing is carried out on the framing data set, so that the continuity of the left end of the frame and the right end of the frame can be increased, and the frequency spectrum leakage is reduced.
Further, the embodiment of the application adopts the following formula to execute the fast Fourier transform, and comprises the following steps:
and
The short-time spectrum is modulo-squared using the following formula:
where S (k) is a short-time spectrum, p (k) is a power spectrum, S' (N) is a windowed signal, N is the size of a frame, N is the number of frames, and k is a preset parameter on the short-time spectrum.
Since the transformation of a signal in the time domain is often difficult to see the characteristics of the signal, embodiments of the present application transform the set of noise samples into energy distributions in the frequency domain, different energy distributions may represent the characteristics of different voices.
Further, in an embodiment of the present application, the triangular filter bank with Mel (Mel) scale is:
wherein T (m) is logarithmic energy, p (k) is power spectrum, H (k) is frequency response of the triangular filter, N is frame size, and k is preset parameter on short-time frequency spectrum.
According to the embodiment of the application, the triangular filter is utilized to calculate the logarithmic energy of the power spectrum, so that the short-time spectrum can be smoothed, the harmonic wave is eliminated, and the formants in the voice information are highlighted.
Specifically, the performing cluster analysis on the noise sample set based on the audio feature to obtain a classified speech set includes:
acquiring preset standard features, and calculating correlation coefficients between the audio features and the standard features;
and sorting each noise sample in the noise sample set according to the size of the correlation coefficient, and dividing the sorted noise sample set by taking a preset audio interval as a dividing point to obtain a classified voice set.
Wherein the classified speech set includes speech in different scenes, for example, speech in a road scene, speech in a park scene, and the like.
In detail, calculating a correlation coefficient between the audio feature corresponding to each noise sample in the noise sample set and the standard feature using the following formula includes:
wherein qij For the correlation coefficient, yi For the corresponding audio features of the noise samples, yj For the standard feature, exp is an exponential function, yk And yl Is a fixed parameter.
Specifically, the clustering analysis is performed on the original noise sample set, namely, noise samples distributed in a high-dimensional space are embedded into a certain low-dimensional subspace, and data in the low-dimensional space are kept consistent with characteristics in the high-dimensional space as much as possible. The clustering analysis can keep the advantage of the global clustering characteristic of the high dimension data in the low dimension space, and the clustering relation of various noise samples is subjected to visual analysis, so that the noise samples with similar time-frequency domain characteristics are classified into one type for classification and identification, and the identification accuracy is improved.
Further, the classified speech set is divided into a training speech set and a test speech set, the training speech set is utilized to construct the scene recognition model, and the test speech set is utilized to test and adjust the scene recognition model, so that a standard scene recognition model is obtained. And segmenting the classified voice set according to a preset dividing proportion to obtain a training voice set and a test voice set.
Preferably, the dividing ratio is a training speech set: test speech set = 7:3.
the training voice set can be used for subsequent model training and is a sample for model fitting, and the test voice set can be used for adjusting super parameters of the model and carrying out preliminary assessment on the capacity of the model, and is particularly used for assessing the generalization capacity of the model.
Specifically, the constructing a scene recognition model by using the training voice set includes:
calculating a base index between each feature label and the corresponding training voice set to obtain a base index set corresponding to the feature label, wherein the feature label is a category label which is extracted from a noise sample set in each scene to obtain corresponding audio features;
sorting the base index sets according to the sequence from big to small, and selecting marks corresponding to the smallest base index in the base index sets as dividing points;
taking the segmentation point as a root node of an initial decision tree, generating sub-nodes from the segmentation point, and distributing the training voice set to the sub-nodes until all labels in the feature labels are traversed, so as to generate the initial decision tree;
pruning is carried out on the initial decision tree, and a scene recognition model is obtained.
Specifically, the calculating a base index between each feature label and the corresponding training dataset includes:
calculating the base index between each feature label and the training voice set corresponding to the feature label by using the following base index:
wherein Gini (p) is a base index, pk K represents the kth frame data in the training speech set, K being the number of frames in the training speech set.
In detail, the base index represents the model's unreliability, and the smaller the base index, the lower the unreliability, the better the description feature.
Further, the pruning processing is performed on the initial decision tree to obtain a scene recognition model, which includes:
calculating surface error gain values of all non-leaf nodes on the initial decision tree;
pruning the non-leaf nodes with the surface error gain values smaller than a preset gain threshold value to obtain a scene recognition model.
In this embodiment of the present application, the preset gain threshold is 0.5.
Further, the calculating the surface error gain values of all non-leaf nodes on the initial decision tree includes:
calculating the surface error gain values of all non-leaf nodes on the initial decision tree by using the following gain formula:
R(t)=r(t)*p(t)
wherein alpha represents a surface error gain value, R (T) represents an error cost of a leaf node, R (T) represents an error cost of a non-leaf node, N (T) represents the number of nodes of the initial decision tree, R (T) represents an error rate of the leaf node, and p (T) represents the ratio of the number of the leaf nodes to the number of all nodes.
Specifically, referring to fig. 2, the performing test adjustment on the scene recognition model by using the test voice set to obtain a standard scene recognition model includes:
s101, performing scene recognition processing on the test voice set by using the scene recognition model to obtain a recognition result corresponding to the test voice set;
s102, when the recognition result corresponding to the test voice set is inconsistent with the feature label corresponding to the test voice set, training the scene recognition model by utilizing the training voice set again until the recognition result corresponding to the test voice set is consistent with the feature label corresponding to the test voice set, and obtaining a standard scene recognition model.
Further, the step of segmenting the classified speech set into a training speech set and a test speech set, constructing the scene recognition model by using the training speech set, and performing test adjustment on the scene recognition model by using the test speech set to obtain a standard scene recognition model further comprises the following steps:
and establishing a noise reduction model corresponding to each scene for calling according to the collected noise sample set in each scene.
S2, inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to noise sample sets in various scenes.
In the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the preset standard scene recognition model carries out scene recognition processing on the voice data, and a voice scene corresponding to the voice data is output.
S3, selecting a preset noise reduction model corresponding to the voice scene, and carrying out noise reduction on the voice data.
In the embodiment of the application, the noise reduction model comprises a dynamic time regular model, a vector quantization model, a hidden Markov model and the like, and according to the voice scene corresponding to the voice data and the characteristics of the noise reduction model, the corresponding noise reduction model is selected to execute noise reduction operation on the voice data, so as to obtain a noise reduction result.
According to the embodiment of the application, the acquired voice data is input into the preset standard scene recognition model, the voice scene corresponding to the voice data is recognized by utilizing the standard scene recognition model, the voice environment where the voice data is positioned can be determined by recognizing the voice scene corresponding to the voice data, the preset noise reduction model corresponding to the voice scene is selected, the voice data is subjected to noise reduction, and the accuracy of voice noise reduction is improved.
As shown in fig. 3, an embodiment of the present application provides a schematic block diagram of a speech noise reduction device 10, where the speech noise reduction device 10 includes: a voice data acquisition module 11, a voice scene recognition module 12 and a noise reduction module 13.
The voice data acquisition module 11 is configured to acquire voice data;
the voice scene recognition module 12 is configured to input the voice data into a preset standard scene recognition model, and determine a voice scene corresponding to the voice data, where the standard scene recognition model is obtained by training according to a noise sample set in each scene;
the noise reduction module 13 is configured to select a preset noise reduction model corresponding to the voice scene, and perform noise reduction on the voice data.
In detail, each module in the voice noise reduction device 10 in the embodiment of the present application adopts the same technical means as the voice noise reduction method described in fig. 1 and can produce the same technical effects, which are not described herein.
As shown in fig. 4, an embodiment of the present application provides a voice noise reduction apparatus, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, wherein the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114,
a memory 113 for storing a computer program;
in one embodiment of the present application, the processor 111 is configured to implement the voice noise reduction method provided in any one of the foregoing method embodiments when executing the program stored in the memory 113, and includes:
acquiring voice data;
inputting the voice data into a preset standard scene recognition model, and determining a voice scene corresponding to the voice data, wherein the standard scene recognition model is obtained by training according to noise sample sets in various scenes;
and selecting a preset noise reduction model corresponding to the voice scene, and carrying out noise reduction on the voice data.
The communication bus 114 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industrial Standard Architecture (EISA) bus, or the like. The communication bus 114 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 112 is used for communication between the above-described electronic device and other devices.
The memory 113 may include a Random Access Memory (RAM) or a nonvolatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory 113 may be at least one memory device located remotely from the processor 111.
The processor 111 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSP), application Specific Integrated Circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the voice noise reduction method provided in any one of the method embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state disk SolidStateDisk (SSD)), among others. It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.