CN114694678B

Movatterモバイル変換

Info

Publication number: CN114694678B
Application number: CN202210333127.XA
Authority: CN
Inventors: 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2025-07-15
Anticipated expiration: 2042-03-31
Also published as: CN114694678A

Abstract

The application discloses a tone quality detection model training method, a tone quality detection method, equipment and a computer readable storage medium, wherein the tone quality detection model training method comprises the steps of obtaining initial training audio and a corresponding average opinion score label; the method comprises the steps of performing interference filtering processing based on voice endpoint detection on initial training audio to obtain training audio, performing feature extraction processing on the training audio to obtain training features, inputting the training features into an initial model to obtain corresponding training tone quality detection results, calculating a loss value by using the training tone quality detection results and a mean opinion score label, adjusting model parameters of the initial model by using the loss value, determining the adjusted initial model as a tone quality detection model when the condition that training is completed is detected to be met, and accurately evaluating the quality of the audio to be tested under the condition that the obtained tone quality detection model does not have pure clean audio.

Description

Tone quality detection model training method, tone quality detection method, electronic device, and medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method for training a sound quality detection model, a method for detecting sound quality, an electronic device, and a computer-readable storage medium.

Background

Currently, a PESQ (i.e., perceptual evaluation of speech quality, speech quality perception evaluation) method is generally adopted to detect the audio quality, so as to obtain a detection result for representing the quality of the audio quality. The PESQ method is generally aimed at audio of VOIP (Voice over Internet Protocol, voice over IP) network communication, and can evaluate problems of audio time misalignment, spectrum distortion and the like caused by frame loss, jitter and the like in the process of network transmission of audio signals. Calculating PESQ scores requires preparing clean audio corresponding to noisy frequencies, and this method of timbre assessment is called reference timbre assessment. In applications, pure clean audio is difficult to obtain, so that tone quality detection of most of the audio is difficult to perform.

Disclosure of Invention

Accordingly, the present application is directed to a training method for a tone quality detection model, a tone quality detection method, an electronic device, and a computer-readable storage medium for accurately evaluating the quality of audio to be detected.

In order to solve the above technical problems, in a first aspect, the present application provides a training method for a sound quality detection model, including:

The method comprises the steps of obtaining initial training audio and a corresponding average opinion score label, wherein the average opinion score label is used for representing average tone quality evaluation parameters obtained after tone quality evaluation of the initial training audio by a plurality of evaluation objects;

Performing non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio;

performing feature extraction processing on the training audio to obtain training features;

inputting the training characteristics into an initial model to obtain a corresponding training tone quality detection result;

Calculating a loss value by using the training tone quality detection result and the average opinion score label, and adjusting model parameters of the initial model by using the loss value;

And when the condition that the training completion condition is met is detected, determining the adjusted initial model as a tone quality detection model.

Optionally, the obtaining process of the average opinion score label includes:

Playing the initial training audio to each evaluation object;

receiving initial tone quality data obtained after tone quality evaluation of the initial training audio by each evaluation object;

and generating the average opinion score label by using each initial tone quality data.

Optionally, the generating the mean opinion score label using each of the initial timbre data includes:

carrying out average processing on each initial tone quality data to obtain a first score label;

inputting the initial training audio into an audio defect detection model to obtain a defect detection result, wherein the audio defect detection model is used for detecting audio defects which can influence auditory perception;

Generating a second score tag based on the defect detection result;

the mean opinion score label is generated using the first score label and the second score label.

Optionally, the performing feature extraction processing on the training audio to obtain training features includes:

resampling the training audio based on the maximum sampling rate perceivable by human ears to obtain intermediate data;

carrying out sliding window framing based on a preset window length on the intermediate data to obtain a plurality of audio frames;

and carrying out feature extraction processing on each audio frame to obtain the training features.

Optionally, the performing non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio includes:

performing voice endpoint detection on the initial training audio to obtain voice endpoint time;

Segmenting the initial training audio according to the voice endpoint moment to obtain a plurality of audio segments, and removing non-human voice audio segments in the audio segments to obtain human voice segments;

And splicing the voice audio segments to obtain the training audio.

Optionally, the initial model comprises a convolutional neural network, a long-term and short-term memory network, a full-connection layer and an average pooling layer;

inputting the training characteristics into an initial model to obtain a corresponding training tone quality detection result, wherein the method comprises the following steps:

inputting the training characteristics into the convolutional neural network to obtain training intermediate characteristics;

Inputting the training intermediate features into the long-period and short-period memory network to obtain a training initial detection result;

Inputting the training initial detection result into the full-connection layer to obtain a training intermediate detection result;

And inputting the training intermediate detection result into the average pooling layer to obtain the training tone quality detection result.

Optionally, the acquiring the initial training audio and the corresponding average opinion score label includes:

acquiring a plurality of initial training audios and the average opinion score labels of a batch from a training data set according to a preset batch size;

Correspondingly, the calculating the loss value by using the training tone quality detection result and the average opinion score label includes:

And when the training tone quality detection results corresponding to all the initial training tones corresponding to one batch are obtained, obtaining the loss value by using the training tone quality detection results, the average opinion score label and the training intermediate detection results in the batch.

Optionally, the obtaining the loss value by using the training tone quality detection result, the average opinion score label and the training intermediate detection result in the batch includes:

According to

Obtaining the loss value;

Wherein the loss is the loss value, S is the preset batch size, T_S is the number of frames corresponding to the training feature, M_S is the average opinion score label, andFor the training tone quality detection result, theAnd for a value corresponding to the t frame in the training characteristics in the training intermediate detection result, wherein alpha is a preset weight.

Optionally, the detecting that the training completion condition is satisfied includes:

judging whether the loss value is smaller than a preset threshold value or not;

if yes, determining that the training completion condition is met.

In a second aspect, the present application further provides a sound quality detection method, including:

Acquiring an initial audio to be detected;

performing non-human voice filtering processing based on voice endpoint detection on the initial audio to be detected to obtain the audio to be detected;

Performing feature extraction processing on the audio to be detected to obtain features to be detected;

And inputting the feature to be detected into a tone quality detection model to obtain a tone quality detection result corresponding to the initial audio to be detected.

Optionally, the tone quality detection model comprises a convolutional neural network, a long-term and short-term memory network, a full-connection layer and an average pooling layer;

inputting the feature to be detected into a tone quality detection model to obtain a tone quality detection result corresponding to the initial audio to be detected, wherein the tone quality detection result comprises:

inputting the feature to be detected into the convolutional neural network to obtain an intermediate feature to be detected;

Inputting the intermediate features to be detected into the long-short-period memory network to obtain an initial detection result;

inputting the initial detection result into the full-connection layer to obtain an intermediate detection result;

and inputting the intermediate detection result into the average pooling layer to obtain the tone quality detection result.

In a third aspect, the present application also provides an electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the above-mentioned training method for the tone quality detection model and/or the above-mentioned tone quality detection method.

In a fourth aspect, the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the above-mentioned training method for a sound quality detection model, and/or the above-mentioned method for sound quality detection.

The training method for the tone quality detection model comprises the steps of obtaining initial training audio and a corresponding average opinion score label, performing interference filtering processing based on voice endpoint detection on the initial training audio to obtain training audio, performing feature extraction processing on the training audio to obtain training features, inputting the training features into the initial model to obtain a corresponding training tone quality detection result, calculating a loss value by using the training tone quality detection result and the average opinion score label, adjusting model parameters of the initial model by using the loss value, and determining the adjusted initial model as the tone quality detection model when the condition that training is completed is detected.

It can be seen that the method utilizes the mean opinion score (mean opinion score, MOS) as an initial training audio and a label for the training audio. The mean opinion score is evaluated by a large audience to assess the quality of audio of sentences that are transmitted through the communication circuit and aloud by a male or female speaker. Listeners score each sentence by (1) very poor (2) poor (3) generally (4) good (5) good, MOS is an arithmetic method of personal scoring for all listeners, ranging from 1 (worst) to 5 (best). The obtained MOS tag can accurately represent the quality of sentence audio in initial training audio. By voice endpoint detection, the part of the non-human voice audio can be used as interference and filtered, and only the part of the human voice is reserved as training audio. After training features are obtained through feature extraction, quality detection is carried out by using an initial model, a loss value is calculated by using the obtained training tone quality detection result and an average opinion score label, and the initial model is adjusted by using the loss value, so that the initial model learns a mode of correctly evaluating audio quality, and the obtained training tone quality detection result is as close to an MOS label as possible. After training is completed, the adjusted initial model may then be determined as a tone quality detection model. The obtained tone quality detection model can accurately evaluate the quality of the audio to be detected under the condition of not having pure clean audio.

In addition, the application also provides a tone quality detection method, electronic equipment and a computer readable storage medium, and the method has the same beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a hardware composition framework to which a training method for a tone quality detection model according to an embodiment of the present application is applicable;

FIG. 2 is a schematic diagram of a hardware framework to which another training method for a tone quality detection model according to an embodiment of the present application is applicable;

FIG. 3 is a flowchart of a training method for a tone quality detection model according to an embodiment of the present application;

fig. 4 is a flowchart of a sound quality detection method according to an embodiment of the present application;

Fig. 5 is a flowchart of another sound quality detection method according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to facilitate understanding, the training method of the tone quality detection model provided by the embodiment of the present application and/or a hardware composition framework used by a scheme corresponding to the tone quality detection method are introduced. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition frame to which a training method for a sound quality detection model according to an embodiment of the present application is applicable. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

Wherein the processor 101 is configured to control the overall operation of the electronic device 100 to perform all or part of the voice quality detection model training method and/or voice quality detection method steps, and the memory 102 is configured to store various types of data to support operation on the electronic device 100, such as instructions for any application or method operating on the electronic device 100, and application-related data. The Memory 102 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as one or more of static random access Memory (Static Random Access Memory, SRAM), electrically erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. In the present embodiment, at least programs and/or data for realizing the following functions are stored in the memory 102:

acquiring initial training audio and a corresponding average opinion score label;

Inputting training characteristics into an initial model to obtain a corresponding training tone quality detection result;

And/or the number of the groups of groups,

Acquiring an initial audio to be detected;

carrying out non-human voice filtering processing based on voice endpoint detection on the initial audio to be detected to obtain the audio to be detected;

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 102 or transmitted through the communication component 105. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more thereof, so that the corresponding Communication component 105 may include Wi-Fi components, bluetooth components, NFC components.

The electronic device 100 may be implemented by one or more Application Specific Integrated Circuits (ASIC), digital signal processors (DIGITAL SIGNAL Processor DSP), digital signal processing devices (DIGITAL SIGNAL Processing Device DSPD), programmable logic devices (Programmable Logic Device PLD), field programmable gate arrays (Field Programmable GATE ARRAY FPGA), controllers, microcontrollers, microprocessors, or other electronic components for performing the tone quality detection model training method.

Of course, the structure of the electronic device 100 shown in fig. 1 is not limited to the electronic device in the embodiment of the present application, and the electronic device 100 may include more or less components than those shown in fig. 1 or may combine some components in practical applications.

It may be understood that the number of the electronic devices is not limited in the embodiment of the present application, and may be a method for cooperatively completing training of a tone quality detection model by a plurality of electronic devices and/or a tone quality detection method. In a possible implementation manner, please refer to fig. 2, fig. 2 is a schematic diagram of a hardware composition frame to which another training method for a sound quality detection model according to an embodiment of the present application is applicable. As can be seen from fig. 2, the hardware component framework may include a first electronic device 11 and a second electronic device 12, which are connected via a network 13.

In the embodiment of the present application, the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in fig. 1. I.e. it can be understood that in this embodiment there are two electronic devices 100, which interact with each other. Further, the form of the network 13 is not limited in the embodiment of the present application, that is, the network 13 may be a wireless network (such as WIFI, bluetooth, etc.), or may be a wired network.

Or the tone quality detection model is deployed on the server, and the smart phone can interact with the user to acquire the initial audio to be detected and send the initial audio to the server. The server detects the initial audio to be detected by using the tone quality detection model to obtain a corresponding tone quality detection result, and sends the tone quality detection result to the smart phone to be output to the user.

Referring to fig. 3, fig. 3 is a flow chart of a training method for a sound quality detection model according to an embodiment of the application. The method in this embodiment comprises:

s101, acquiring initial training audio and a corresponding average opinion score label.

The mean opinion score, mean Opinion Score, MOS, is a quality of audio of sentences transmitted through the communication circuit that are aloud by a male or female speaker, evaluated by a large audience. Listeners score each sentence by (1) very poor (2) poor (3) generally (4) good (5) good, MOS is an arithmetic method of personal scoring for all listeners, ranging from 1 (worst) to 5 (best). This method of evaluation is widely used in subjective evaluation of audio quality, however, the subjective evaluation process is time-consuming and laborious. Accordingly, PESQ (perceptual evaluation of speech quality, speech quality perception assessment) methods are widely used with automatic detection of audio quality (otherwise known as objective assessment) to improve the efficiency of audio quality detection. PESQ is a reference sound quality assessment, however, where the audio to be detected can only be detected when it is pure clean (can be regarded as lossless) with the audio to be detected, so that the detection range is limited.

In the application, the training data is formed by the existing initial training audio and the corresponding MOS label, and the tone quality detection model is obtained by training, so that the automatic tone quality detection can be carried out under the condition of no pure clean audio, the detection efficiency is improved, and the detection range is enlarged. Specifically, the initial training audio refers to training data directly acquired, and may have a basis or multiple sentences of speech. The mean opinion score label is an MOS score label obtained by manually performing MOS scoring on a voice part in initial training audio, and is a mean tone quality evaluation parameter obtained by characterizing multiple evaluation objects to perform tone quality evaluation on the initial training audio. The initial training audio and the corresponding MOS tag may be prepared in advance, or when the tone quality detection model needs to be trained, the initial training audio may be temporarily selected and MOS scored by the user, so as to obtain the corresponding MOS tag.

In one embodiment, the mean opinion score label may be generated at the time of acquisition. Specifically, the initial training audio may be played to each evaluation object, for example, the initial training audio may be sent to the electronic device used by each evaluation person and a play control instruction may be sent. And receiving initial tone quality data obtained after tone quality evaluation of the initial training audio by each evaluation object, wherein the initial tone quality data can be generated and transmitted by electronic equipment used by each evaluation person. A mean opinion score label is generated using each of the initial timbre data.

The present embodiment is not limited to a specific manner of generating the mean opinion score, and for example, average calculation may be performed on each piece of initial sound quality data. In another embodiment, in order to improve the reliability of the mean opinion score label, the first score label may be obtained by first performing an averaging process on each initial sound quality data. In addition, the initial training audio may be input into an audio defect detection model to obtain a defect detection result, where the audio defect detection model is used to detect an audio defect capable of affecting auditory perception, and the type of the audio defect may be set according to needs, and may include, for example, a microphone (generated due to a mouth being too close to a microphone, and represented by po, hu, snore, etc.), a hissing/tooth sound (represented by hissing, eating chi), a galvanic sound (noise caused by abnormality of a hardware circuit and represented by a soft/buzzing sound), a click (represented by a thorn, click, katon), a pop (called clip, generated due to sound pop, clip, etc., and also easily generated after mixing of a human sound and an accompaniment), an environmental noise, a katon (represented by a short break of a human sound, poor front-back engagement, or an obvious network transmission swallow). The audio defect detection model can identify audio defects in the initial training audio, and specifically can be the type number of the audio defects, the occurrence frequency of each type of the audio defects and the like. Based on the defect detection result, a second score label may be generated, and in one embodiment, a full score of 5 may be set, and the second score label may be obtained by scoring as appropriate according to the defect detection result, and the mean opinion score label may be generated using the first score label and the second score label.

S102, performing non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain the training audio.

It will be appreciated that a speakerphone will typically speak multiple sentences in the initial training audio when recording, with a time interval or other non-human audio typically having a speech blank between different sentences. In MOS scoring, the user only evaluates the speech portion, ignoring the non-human voice audio. Therefore, in the training process of the tone quality detection model, the non-human voice audio in the initial training audio should be removed, so that the interference to the training of the tone quality detection model is avoided. In this embodiment, the starting time position and the ending time position of the voice can be identified by adopting a voice endpoint detection mode, so that the non-voice part in the initial training audio is filtered out, and the training audio is obtained. Voice endpoint detection, voice Activity Detection, VAD, is able to identify periods of silence in the audio signal.

Specifically, in one embodiment, the training audio generation process includes:

and 11, detecting a voice endpoint of the initial training audio to obtain a voice endpoint moment.

And 12, segmenting the initial training audio according to the voice endpoint time to obtain a plurality of audio segments, and removing the non-human voice audio segment in the audio segments to obtain the human voice segment.

And 13, splicing the voice audio segments to obtain training audio.

In this embodiment, the voice endpoint detection may recognize a start time and a stop time of voice audio (i.e., human voice audio), which are voice endpoint times. And segmenting the initial training audio according to the voice endpoint time to obtain audio segments, wherein the audio segments comprise human voice segments and non-human voice audio segments, and the two audio segments are alternately arranged in sequence. And removing the non-human voice frequency band and reserving the human voice frequency band. Illustratively, the human voice frequency band and the non-human voice audio frequency band alternate, so that the start time and the end time in the voice endpoint time also alternate. Along the time sequence, the audio frequency segments from the adjacent starting time to the adjacent ending time are determined to be human voice frequency segments, and the audio frequency segments from the adjacent ending time to the starting time are determined to be non-human voice frequency segments. After the classification of the audio frequency segments is completed, the non-human voice audio frequency segments are removed, the human voice frequency segments are reserved, and then the non-human voice audio frequency segments are spliced to obtain the final training audio frequency. The non-human sound band may be a blank sound band or a sound band in which non-human sound such as background sound is recorded.

And S103, performing feature extraction processing on the training audio to obtain training features.

In order to enable the initial model to learn how to detect the tone quality of the audio more efficiently, feature extraction is performed on the training audio to obtain corresponding training features so as to better characterize the tone quality characteristics of the training audio. The specific manner of feature extraction is not limited, and in one embodiment, the training features may be in the form of images, such as a spectrogram (Spectrogram), mel-spectrogram (Mel-Frequency Spectrum), or other spectrogram form. In another embodiment, in order to evaluate the signal in a wider frequency band, the training audio may be resampled to make the signal frequency better.

Specifically, the training feature generation process may include:

and step 21, resampling the training audio based on the maximum sampling rate perceivable by human ears to obtain intermediate data.

And 22, carrying out sliding window framing based on the preset window length on the intermediate data to obtain a plurality of audio frames.

And step 23, carrying out feature extraction processing on each audio frame to obtain training features.

The frequency range of the signal that can be heard by the human ear is relatively fixed, typically in the range of 20Hz-20000 Hz. The sampling frequency is typically 2 times the signal frequency, so the maximum sampling rate perceivable by the human ear can be determined. It will be appreciated that, because different hearing ranges of different people, the upper limit of the frequency range of the human ear of some people can reach 22000Hz, so the maximum sampling rate of the human ear perception can be higher than the upper limit of the frequency range of the ordinary human ear, and the specific size is not limited. By way of example, 48kHz may be chosen as the maximum sampling rate perceivable by the human ear. Through resampling, the frequency domain range of the audio can be widened, and corresponding intermediate data can be obtained.

The sliding window framing refers to a process of sampling an audio frame in time sequence by using a preset analysis window, and the analysis window may specifically be a hanning window (hann), a hamming window (hamming), a blackman-harris window (blackman-harris), etc., and the preset window length is the width of the analysis window, which is not specifically limited, for example, when the sampling frequency is 48kHz, the preset window length may be 21.3ms. After each sampling of the analysis window is completed, the analysis window slides backwards for a certain distance so as to resample, and the sliding distance is the frame shift. The specific size of the frame shift is not limited, and may be, for example, half the window length.

After the audio frames are obtained, carrying out feature extraction processing such as Mel frequency spectrum extraction and spectrogram extraction on the audio signals to obtain audio frame features corresponding to single audio frames, and splicing the audio frame features according to time sequence to obtain corresponding training features.

S104, inputting training characteristics into the initial model to obtain a corresponding training tone quality detection result.

S105, calculating a loss value by using the training tone quality detection result and the average opinion score label, and adjusting model parameters of the initial model by using the loss value.

The above two steps are described in combination.

The initial model is an incompletely trained model, and can be used as a tone quality detection model after sufficient training and parameter adjustment. The embodiment is not limited to the specific form and class of the initial model, and may be, for example, a convolutional neural network model, or may be a combination of a convolutional neural network and a recurrent neural network. After the processing model processes the training characteristics, a training tone quality detection result obtained by tone quality detection of training audio based on the current learning and parameter adjustment conditions can be obtained.

When the training is not enough, the training tone quality detection result obtained by the initial model has a certain gap from a truly correct result (namely, MOS label), and the model parameters of the initial model are adjusted based on the loss value by calculating the loss value, so that the model can learn how to correctly perform tone quality detection and give out a correct tone quality detection result.

Specifically, in one embodiment, the initial model includes a convolutional neural network, a long and short term memory network, a fully connected layer, and an average pooling layer. The convolution neural network is used for carrying out convolution calculation on the input training features and extracting effective audio local features. The Long Short-Term Memory network (LSTM) may be specifically a Bi-directional Long-Short Term Memory (BLSTM) for extracting a time sequence relationship between local audio features, and learning a correlation between a frame and a frame. And the full-connection layer is used for predicting the tone quality detection result corresponding to each frame by taking the frame as a unit. And the average pooling layer is used for synthesizing the tone quality detection result of each frame to obtain a final training tone quality detection result.

Accordingly, the process of inputting training features into the initial model to obtain corresponding training sound quality detection results may include:

And step 31, inputting the training characteristics into a convolutional neural network to obtain training intermediate characteristics.

And step 32, inputting training intermediate features into a long-term and short-term memory network to obtain a training initial detection result.

And step 33, inputting the training initial detection result into the full-connection layer to obtain a training intermediate detection result.

And step 34, inputting the training intermediate detection result into an average pooling layer to obtain a training tone quality detection result.

The training intermediate features are features obtained after convolution calculation, and the training initial detection result is data obtained after the long-term memory network extracts the time sequence relation. The training intermediate detection result is the tone quality detection result corresponding to each frame in the training characteristics.

The specific calculation method of the loss value is not limited, and may be calculated according to the initial model type, the expression form of the training tone quality detection result, the training emphasis direction, and the like, and may be, for example, a square loss function, an exponential loss function, a cross entropy loss function, and the like. In a specific embodiment, the loss value may be calculated by integrating training tone quality detection results corresponding to each initial training audio in a training batch. In this case, when the initial training audio and the corresponding MOS tag are acquired, a plurality of initial training audio and average opinion score tags of one lot may be acquired from the training dataset according to a preset lot size. The preset batch size refers to the number of initial training audio acquired by each training batch. Since the frequency of parameter adjustment is the same as the frequency of loss value generation, in this embodiment, after each batch of initial training audio is processed, a loss value calculation and parameter adjustment are performed once.

Accordingly, the process of calculating the loss value using the training sound quality detection result and the mean opinion score label may include:

and step 41, when the training tone quality detection results corresponding to all the initial training tones corresponding to one batch are obtained, obtaining a loss value by using the training tone quality detection results, the average opinion score label and the training intermediate detection results in the batch.

In this embodiment, loss value calculation is performed by integrating all training tone quality detection results and average opinion score labels in a batch, and parameter adjustment can be performed by integrating the training conditions of the whole batch. Further, the loss value calculation is performed using the MOS tag and the training intermediate detection result, and the sound quality evaluation condition of each frame can be reflected in the loss value.

In particular, it can be based on

A loss value is obtained.

Wherein loss is a loss value, S is a preset batch size, T_S is a frame number corresponding to training features, namely audio frames divided when the training features are generated, M_S is a mean opinion score label,In order to train the result of the tone quality detection,For the value corresponding to the t frame in the training feature in the training intermediate detection result, α is a preset weight, and the specific size of α is not limited, for example, may be 1.

And S106, when the condition that the training completion condition is met is detected, determining the adjusted initial model as a tone quality detection model.

And circularly executing the steps, and judging whether the training completion condition is met after executing one training round and performing parameter adjustment. Training completion conditions refer to conditions that indicate that the initial model has been sufficiently trained, which may be specifically conditions defining the training process or conditions defining the performance of the initial model. For example, the number of training rounds condition may be used, or the loss value interval condition may be used. For example, it may be determined whether the loss value is smaller than a preset threshold, and if so, it is determined that the training completion condition is satisfied. The specific size of the preset threshold is not limited, and can be set according to the needs.

By applying the training method of the tone quality detection model provided by the embodiment of the application, the average opinion score (mean opinion score, MOS) is used as the label of the initial training audio and the training audio. The mean opinion score is evaluated by a large audience to assess the quality of audio of sentences that are transmitted through the communication circuit and aloud by a male or female speaker. Listeners score each sentence by (1) very poor (2) poor (3) generally (4) good (5) good, MOS is an arithmetic method of personal scoring for all listeners, ranging from 1 (worst) to 5 (best). The obtained MOS tag can accurately represent the quality of sentence audio in initial training audio. By voice endpoint detection, the part of the non-human voice audio can be used as interference and filtered, and only the part of the human voice is reserved as training audio. After training features are obtained through feature extraction, quality detection is carried out by using an initial model, a loss value is calculated by using the obtained training tone quality detection result and an average opinion score label, and the initial model is adjusted by using the loss value, so that the initial model learns a mode of correctly evaluating audio quality, and the obtained training tone quality detection result is as close to an MOS label as possible. After training is completed, the adjusted initial model may then be determined as a tone quality detection model. The obtained tone quality detection model can accurately evaluate the quality of the audio to be detected under the condition of not having pure clean audio.

Based on the above embodiment, after the tone quality detection model is obtained, tone quality detection can be performed using the initial audio to be detected without the MOS tag. Referring to fig. 4, fig. 4 is a flowchart of a sound quality detection method according to an embodiment of the present application, which specifically includes the following steps:

s201, obtaining initial audio to be tested.

S202, performing non-human voice filtering processing based on voice endpoint detection on the initial audio to be detected to obtain the audio to be detected.

And S203, carrying out feature extraction processing on the audio to be detected to obtain features to be detected.

S204, inputting the feature to be detected into a tone quality detection model to obtain a tone quality detection result corresponding to the initial audio to be detected.

The tone quality detection model is obtained by adopting the tone quality detection model training method, and the non-human voice filtering processing and the characteristic extraction processing are the same as the tone quality detection model training process. Specifically, in one embodiment, the tone quality detection model includes a convolutional neural network, a long-short-term memory network, a full-connection layer, and an average pooling layer.

Correspondingly, inputting the feature to be detected into a tone quality detection model to obtain a tone quality detection result corresponding to the initial audio to be detected, wherein the method comprises the following steps:

And step 51, inputting the feature to be measured into a convolutional neural network to obtain the intermediate feature to be measured.

And step 52, inputting the intermediate features to be detected into a long-term and short-term memory network to obtain an initial detection result.

And step 53, inputting the initial detection result into the full connection layer to obtain an intermediate detection result.

And 54, inputting the intermediate detection result into an average pooling layer to obtain a tone quality detection result.

The processes of steps 51 to 54 may refer to the processes of steps 31 to 34, except that the processed data are different.

Further, referring to fig. 5, fig. 5 is a flowchart of another sound quality detection method according to an embodiment of the present application. After the audio signal (i.e. the initial audio to be detected) is acquired, the audio signal is subjected to voice detection by using a VAD algorithm, and a voice signal (i.e. the audio to be detected) is output, wherein the VAD algorithm can be specifically a webrtc-VAD algorithm. The human voice detection filters out portions of the audio signal that are mute and non-human voice. And then resampled to 48kHz. Extracting audio features (namely features to be detected) in the Mel frequency spectrum form, and inputting the audio features into a tone quality detection model. The audio feature extraction uses a Blackman-Harris (blackman-harris) window, and the frame shift size is 10.7ms.

The audio detection model comprises a 3-layer CNN network, a 2-layer BLSTM network, a full connection layer and an average pooling layer. The convolution layer in the CNN network is a 2D convolution layer, the convolution kernel is 3*3, and the lengths of the output filters of the three convolution layers are 16, 32 and 64 in sequence. Normalization is performed after each CNN layer and a ReLU activation function is used. Through 3 layers of CNNs, the audio local characteristics can be effectively learned. Then the local characteristic is sent to a 2-layer bidirectional LSTM network, the hidden units are all set to 256, and the main purpose is to extract the time sequence relation of the local characteristic and learn the correlation of the front frame and the rear frame. After 2 layers of bidirectional LSTM cascading, the audio is sent to a full-connection layer, the tone quality value (namely, the intermediate detection result) of each frame level is predicted, and finally, the final objective evaluation value (namely, the tone quality detection result) corresponding to the initial audio to be detected is output through an average pooling layer.

The following describes a computer readable storage medium provided in an embodiment of the present application, where the computer readable storage medium described below and the training method for a tone quality detection model described above may be referred to correspondingly.

The application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the tone quality detection model training method when being executed by a processor.

The computer readable storage medium may include a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc. various media that can store program codes.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

While the principles and embodiments of the present application have been described in detail in this application, the foregoing embodiments are provided to facilitate understanding of the principles and concepts of the application and are further provided by one of ordinary skill in the art to which the application pertains.

Claims

1. The training method for the tone quality detection model is characterized by comprising the following steps of:

When the condition that the training completion condition is met is detected, the adjusted initial model is determined to be a tone quality detection model;

the obtaining process of the average opinion score label comprises the following steps:

The method comprises the steps of carrying out average processing on initial tone quality data to obtain a first score label, wherein the initial tone quality data is obtained after tone quality evaluation on initial training audio by each evaluation object;

Generating a second score tag based on the defect detection result;

2. The method of claim 1, wherein the process of obtaining the mean opinion score label further comprises:

Playing the initial training audio to each evaluation object;

and receiving initial tone quality data obtained after tone quality evaluation of the initial training audio by each evaluation object.

3. The method for training a sound quality detection model according to claim 1, wherein the performing feature extraction processing on the training audio to obtain training features includes:

4. The method for training a voice quality detection model according to claim 1, wherein the performing non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio comprises:

And splicing the voice audio segments to obtain the training audio.

5. The method for training a sound quality detection model according to claim 1, wherein the initial model comprises a convolutional neural network, a long-term and short-term memory network, a full-connection layer and an average pooling layer;

6. The method of claim 5, wherein the obtaining initial training audio and corresponding mean opinion score label comprises:

7. The method of training a timbre detection model according to claim 6, wherein said deriving the loss value using the training timbre detection result, the mean opinion score label and the training intermediate detection result within the batch comprises:

According to

Obtaining the loss value;

8. The method of training a tone quality detection model according to claim 1, wherein the detecting that the training completion condition is satisfied comprises:

judging whether the loss value is smaller than a preset threshold value or not;

if yes, determining that the training completion condition is met.

9. A sound quality detection method, comprising:

Acquiring an initial audio to be detected;

Inputting the feature to be detected into a tone quality detection model to obtain a tone quality detection result corresponding to the initial audio to be detected, wherein the tone quality detection model is obtained by adopting the tone quality detection model training method according to any one of claims 1 to 7.

10. The method of claim 9, wherein the tone quality detection model comprises a convolutional neural network, a long-short-term memory network, a full-connection layer, and an average pooling layer;

11. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the timbre detection model training method according to any one of claims 1 to 8, and/or the timbre detection method according to any one of claims 9 to 10.

12. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements a timbre detection model training method according to any one of claims 1 to 8 and/or a timbre detection method according to any one of claims 9 to 10.