Disclosure of Invention
In view of the foregoing, it is desirable to provide a training of an audio recognition model, an audio recognition method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product that can improve the accuracy of audio recognition.
In a first aspect, the present application provides a training method of an audio recognition model, the audio recognition model including a singing voice recognition model and a melody recognition model, the method comprising:
The method comprises the steps of obtaining training sample data, wherein the training sample data comprises various sample song audios and lyric texts corresponding to the sample song audios;
inputting the voice audio in the sample song audio to the singing voice recognition model to obtain a predicted phoneme sequence, and inputting the sample song audio to the melody recognition model to obtain a melody vector representation corresponding to the sample song audio;
Determining an actual phoneme sequence of the sample song audio based on the lyric text, carrying out iterative training on the singing voice recognition model according to the difference between the actual phoneme sequence and the predicted phoneme sequence, carrying out iterative training on the melody recognition model according to the difference between the melody vector representation and the prototype vector representation until the two iterative training meets a preset training ending condition to obtain a trained audio recognition model, wherein different prototype vector representations are used for representing different melody categories, and the trained audio recognition model is used for outputting song recognition results corresponding to the song audio to be recognized.
In one embodiment, the iteratively training the melody recognition model according to the difference between the melody vector representation and the prototype vector representation includes:
Clustering each melody vector representation to obtain a clustered vector representation of at least one category;
Determining prototype vector representations corresponding to the categories based on the distance relation between the clustered vector representations of the same category, wherein the distances between the prototype vector representations and the clustered vector representations of the corresponding categories of the prototype vector representations meet preset conditions;
and performing iterative training on the melody recognition model based on the distances between the melody vector representations and the prototype vector representations corresponding to each class.
In one embodiment, the inputting the voice audio in the sample song audio to the singing voice recognition model to obtain the predicted phoneme sequence includes:
inputting the sample song audio to a pre-trained voice separation model to obtain voice audio in the sample song audio;
And extracting a first frequency spectrum characteristic in the voice audio, and inputting the first frequency spectrum characteristic into the singing voice recognition model to obtain a predicted phoneme sequence corresponding to the voice audio.
In one embodiment, the singing voice recognition model includes a first convolution layer, a first full-connection layer and a first classification layer, and the inputting the first spectral feature into the singing voice recognition model to obtain a predicted phoneme sequence corresponding to the voice audio includes:
Inputting the first spectrum characteristic into the first convolution layer so that the first convolution layer extracts a first spectrum convolution characteristic corresponding to the first spectrum characteristic;
Inputting the first spectrum convolution characteristic to the first full-connection layer, so that the first full-connection layer transforms the dimension type of the first spectrum convolution characteristic from a space characteristic to a time sequence characteristic to obtain a spectrum convolution characteristic after dimension reduction;
And inputting the spectrum convolution characteristics after the dimension reduction to the first classification layer, so that the first classification layer classifies the spectrum convolution characteristics after the dimension reduction to obtain a predicted phoneme sequence corresponding to the voice audio.
In one embodiment, the melody recognition model includes a second convolution layer and a second full connection layer, and the inputting the sample song audio to the melody recognition model to obtain the melody vector representation corresponding to the sample song audio includes:
Extracting a second spectrum characteristic in the sample song audio, and inputting the second spectrum characteristic into the second convolution layer so that the second convolution layer extracts a second spectrum convolution characteristic corresponding to the second spectrum characteristic;
And inputting the second spectrum convolution characteristic to the second full-connection layer, so that the second full-connection layer transforms the dimension type of the second spectrum convolution characteristic from a space characteristic to a time sequence characteristic, and obtaining melody vector representation corresponding to the sample song audio.
In a second aspect, the present application also provides an audio recognition method, the method comprising:
the method comprises the steps of obtaining a trained audio recognition model, wherein the trained audio recognition model is obtained by training according to the training method, and comprises a trained singing recognition model and a trained melody recognition model;
Inputting the voice audio in the song audio to be identified to the trained singing voice identification model to obtain a predicted phoneme sequence corresponding to the song audio to be identified, and inputting the song audio to be identified to the trained melody identification model to obtain a melody vector representation corresponding to the song audio to be identified;
And outputting a song recognition result corresponding to the song audio to be recognized according to the predicted phoneme sequence and the melody vector representation.
In one embodiment, the outputting, according to the predicted phoneme sequence and the melody vector representation, a song recognition result corresponding to the song audio to be recognized includes:
obtaining a phoneme sequence vector representation corresponding to the predicted phoneme sequence;
Fusing the phoneme sequence vector representation and the melody vector representation to obtain a fused vector representation;
And generating a song identification result corresponding to the song audio to be identified according to the fused vector representation.
In one embodiment, the fusing the phoneme sequence vector representation and the melody vector representation to obtain a fused vector representation includes:
Respectively obtaining a phoneme weight corresponding to the phoneme sequence vector representation and a melody weight corresponding to the melody vector representation;
According to the phoneme weight and the melody weight, carrying out weighted summation on the phoneme sequence vector representation and the melody vector representation to obtain the fused vector representation;
In one embodiment, the fusing the phoneme sequence vector representation and the melody vector representation to obtain a fused vector representation includes:
splicing the phoneme sequence vector representation and the melody vector representation to obtain a spliced vector representation;
And taking the spliced vector representation as the fused vector representation.
In one embodiment, the generating, according to the fused vector representation, a song recognition result corresponding to the song audio to be recognized includes:
obtaining candidate song vector representations corresponding to the candidate song audios;
Determining similarity between the song vector representation corresponding to the song audio to be identified and each candidate song vector representation;
and using the candidate song vector with the highest similarity to represent the audio identification information of the corresponding candidate song audio as a song identification result corresponding to the song audio to be identified.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the steps of the method described above.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method described above.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.
Training of the above-mentioned audio recognition model, audio recognition method, apparatus, computer device, storage medium and computer program product, by obtaining training sample data; the training sample data comprises various sample song audios and corresponding lyrics texts, human voice audios in the sample song audios are input into a singing voice recognition model to obtain a predicted phoneme sequence, the sample song audios are input into a melody recognition model to obtain melody vector representations corresponding to the sample song audios, the actual phoneme sequences of the sample song audios are determined based on the lyrics texts, the singing voice recognition model is subjected to iterative training according to the differences between the actual phoneme sequences and the predicted phoneme sequences, the melody recognition model is subjected to iterative training according to the differences between the melody vector representations and the prototype vector representations until a preset training ending condition is met to obtain a trained audio recognition model, wherein the prototype vector representations are obtained by clustering the melody vector representations and determining the distance relation between the melody vector representations after clustering, different melody vector representations are used for representing different melody categories, the trained audio recognition model is used for outputting song recognition results corresponding to the song audios to be recognized, and therefore the melody recognition model and the melody semantic recognition model in the trained audio recognition model can be combined and optimized by adopting the same sample song audios, the melody recognition model in the trained audio recognition melody can be used for effectively representing the melody semantic recognition of the melody model and can be used for accurately representing the song models in the audio recognition model to represent the song models, and the characteristics can be accurately characterized by the song models, and the song information corresponding to the song audio is identified by utilizing the fusion vector representation between the melody vector representation and the phoneme sequence corresponding to the song audio, so that the audio identification model can extract semantic information in the song and melody information in the song, prevent information loss during prediction, more fully mine similar information in the song, and further achieve a better song identification effect.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, there is provided a training method of an audio recognition model, the audio recognition model including a singing voice recognition model and a melody recognition model, the method being applied to an electronic device as an example for explanation, including the steps of:
step S102, training sample data are obtained, wherein the training sample data comprise song audios of all samples and lyric texts corresponding to the song audios of all samples.
The training sample data may refer to sample data for training an audio recognition model, among other things.
In practice, the training sample data may include a plurality of sample song audio. The sample song audio may be music audio corresponding to various songs. Wherein each sample song audio has a corresponding lyric text.
In a specific implementation, in the process of training the audio recognition model, the electronic device can acquire music audio corresponding to each song and lyric text of each song as training sample data for the audio recognition model.
Step S104, inputting the human voice audio in the sample song audio to the singing voice recognition model to obtain a predicted phoneme sequence, and inputting the sample song audio to the melody recognition model to obtain a melody vector representation corresponding to the sample song audio.
In a specific implementation, after the electronic device acquires each sample song audio for training the audio recognition model, the electronic device may input the human voice audio in the sample song audio to the singing voice recognition model to obtain the predicted phoneme sequence.
In particular, the electronic device may input the sample song audio to a pre-trained musical human separation model (e.g., spleeter separation network) to cause the pre-trained musical human separation model to separate human audio from the sample song audio. Then, the electronic device inputs spectral features corresponding to the vocal audio in the sample song audio to the vocal recognition model, so that the vocal recognition model outputs a predicted phoneme sequence corresponding to the vocal audio.
The predicted phoneme sequence may refer to a predicted result of a corresponding pronunciation phoneme in each pronunciation time period in the voice audio.
The electronic device may input the sample song audio to the melody recognition model to obtain a melody vector representation corresponding to the sample song audio. Specifically, the electronic device may input spectral features corresponding to accompaniment audio in the sample song audio to the melody recognition model so that the melody recognition model outputs a melody vector representation corresponding to the sample song audio.
Step S106, determining an actual phoneme sequence of the sample song audio based on the lyric text, performing iterative training on the song recognition model according to the difference between the actual phoneme sequence and the predicted phoneme sequence, and performing iterative training on the song recognition model according to the difference between the melody vector representation and the prototype vector representation until a preset training ending condition is met to obtain a trained audio recognition model, wherein the trained audio recognition model is used for generating song vector representations corresponding to the song audio to be recognized.
The prototype vector is obtained by clustering the melody vector representations and determining the distance relation among the clustered melody vector representations, and different prototype vector representations are used for representing different melody categories.
In a specific implementation, in a training stage, after the electronic device obtains a predicted phoneme sequence corresponding to a voice audio in a sample song audio, the electronic device can determine an actual phoneme sequence of the sample song audio according to the lyric text, and the actual phoneme sequence is adopted to represent a labeling result of a corresponding pronunciation phoneme in each pronunciation time period in the voice audio and serve as a classification training target of the singing voice recognition model. The electronic device may use the actual phoneme sequence as a supervisory signal for training the singing voice recognition model, i.e. based on the difference between the actual phoneme sequence and the predicted phoneme sequence, perform iterative training on the singing voice recognition model.
Specifically, the electronic device may determine a loss function value of the singing voice recognition model by using a preset loss function (e.g., CTC (Connectionist temporal classification) loss function) and using an actual phoneme sequence and a predicted phoneme sequence, and update network parameters (e.g., weights and offsets) in the singing voice recognition model by using the loss function value until the trained singing voice recognition model meets a preset training ending condition, thereby obtaining a trained singing voice recognition model.
For example, taking a lyric text as "hello" as an example, the actual phoneme sequence corresponding to the lyric text may be expressed as n i h a o, and the predicted phoneme sequence output by the singing voice recognition model (i.e. the phoneme posterior probability matrix in the training stage, as shown in fig. 2) is used to characterize the singing voice recognition model to calculate the probability that each frame belongs to the whole phone set. The electronic equipment takes the actual phoneme sequence as a supervision signal for training the singing voice recognition model, so that the trained singing voice recognition model responds to the input voice audio to find the optimal decoding path corresponding to the phonemes contained in the voice audio, and further the target phoneme sequence for effectively representing the phoneme information in the voice audio is obtained.
After obtaining melody vector representations corresponding to sample song audio, the electronic device clusters the melody vector representations corresponding to each sample song audio, and determines prototype vector representations based on distance relations among the clustered melody vector representations, wherein different prototype vector representations have different melody types.
The electronic device may perform iterative training on the melody recognition model according to a difference (e.g., a vector distance) between the melody vector representation and the prototype vector representation until a preset training target is met to obtain a trained melody recognition model, so that the trained melody recognition model may learn target melody vector representations corresponding to various sample song audio frequencies.
In practical applications, the training target may be that the closer the similar melody vector is to the center of the class, the better the melody vector not belonging to the current class is. The training process of the pattern recognition model will be described in detail below, and will not be described in detail here.
The electronic equipment can splice the output ends of the two branch modules of the trained singing voice recognition model and the trained melody recognition model into a fully-connected convergence layer to obtain the trained audio recognition model, and the trained audio recognition model can be used for outputting song recognition results corresponding to song audio to be recognized.
Specifically, the electronic device may input the vocal audio of the song audio to be identified (e.g., the eversion audio) to the singing recognition model in the trained audio recognition model, generate a predicted phoneme sequence of the song audio to be identified, input the song audio to be identified to the melody recognition model in the trained audio recognition model, generate a melody vector representation corresponding to the song audio to be identified, and then fuse the vector representation (phoneme sequence embedding, phoneme sequence representation) corresponding to the predicted phoneme sequence with the melody vector representation (melody embedding, melody representation) corresponding to the song audio to be identified, and output a song recognition result corresponding to the song audio to be identified based on the fused vector representation.
Taking a music listening and recognition scene as an example, because a large number of music with a music playing version exist in a music library, and a new increase of music with a music playing version is daily, a fingerprint library for music listening and recognition cannot be covered completely, the music playing recognition can be performed by using the audio recognition model trained by the audio recognition model training method to make up for the blank of the music playing, and the recall rate of music listening and recognition can be improved.
In addition, the audio recognition model trained by the audio recognition model training method is also used for generating the same song group, so that the song library information can be effectively tidied, and the same song classification can be efficiently carried out. A large number of user turnings are not marked by songs, and the user works can be marked by combining song listening and song recognition with a turning recognition technology, so that subsequent work analysis is facilitated.
According to the training method of the audio recognition model, training sample data are obtained, the training sample data comprise various sample song audios and corresponding lyrics texts, the human voice audios in the sample song audios are input into the singing voice recognition model to obtain a predicted phoneme sequence, different prototype vector representations are used for representing different melody categories, the trained audio recognition model is used for outputting song recognition results corresponding to the song audios to be recognized, the actual phoneme sequence of the sample song audios is determined based on the lyrics texts, iterative training is conducted on the song recognition model according to the difference between the actual phoneme sequence and the predicted phoneme sequence, and the music recognition model is subjected to iterative training according to the difference between the melody vector representations and the prototype vector representations until preset training end conditions are met to obtain a trained audio recognition model, wherein the prototype vector representations are obtained by clustering the melody vector representations, the different prototype vector representations are determined based on the distance relations among the melody vector representations after clustering, the trained audio recognition model is used for outputting song recognition results corresponding to the song audios to be recognized, therefore, the song recognition model can be effectively characterized by the fact that the melody vector representations in the song audio recognition model can be effectively recognized in the song audio recognition model through the combination of the song recognition model by the same sample song audios, and the song information corresponding to the song audio is identified by utilizing the fusion vector representation between the melody vector representation and the phoneme sequence corresponding to the song audio, so that the audio identification model can extract semantic information in the song and melody information in the song, prevent information loss during prediction, more fully mine similar information in the song, and further achieve a better song identification effect.
In another embodiment, iterative training is performed on a discipline recognition model according to differences between melody vector representations and prototype vector representations, wherein the iterative training is performed on the discipline recognition model according to differences between melody vector representations and prototype vector representations, the discipline recognition model comprises clustering the melody vector representations to obtain clustered vector representations of at least one category, determining prototype vector representations corresponding to the categories based on distance relations among clustered vector representations of the same category, wherein distances between the prototype vector representations and the clustered vector representations in the corresponding categories meet preset conditions, and the iterative training is performed on the discipline recognition model based on distances between the melody vector representations and the prototype vector representations corresponding to the categories.
In the specific implementation, in the process of performing iterative training on a melody recognition model according to the difference between melody vector representations and prototype vector representations, the electronic equipment can cluster each melody vector representation to obtain clustered vector representations (samples) of at least one category, and then determine the prototype vector representations corresponding to each category based on the distance relation between clustered vector representations of the same category. Specifically, for any type of post-cluster vector representation, the electronic device may calculate a mean of the post-cluster vector representations of the any type, and determine, according to the mean of the post-cluster vector representations, a prototype vector representation (i.e., a type center) corresponding to the any type.
The electronic device may perform iterative training on the melody recognition model according to a difference (e.g., a vector distance) between the melody vector representation and the prototype vector representation until a preset training target is met to obtain a trained melody recognition model, so that the trained melody recognition model may learn target melody vector representations corresponding to various sample song audio frequencies.
In practical applications, the training target may be that the closer the similar melody vector is to the center of the class, the better the melody vector not belonging to the current class is.
For example, referring to fig. 3, fig. 3 schematically illustrates a schematic representation of three categories of post-cluster vector representations. Wherein, the three circle centers represent three class centers, which may belong to the mean decision of samples embedding (post-cluster vector representation) within the same class.
Wherein, the category center can be expressed as:
Where xj,k represents the kth sample within class j and cj is the center of sample class j, i.e., the prototype center.
The electronic device obtains the euclidean distance dk,j of each sample xk to each class center cj, and characterizes the differences between each melody vector representation and the prototype vector representation with the euclidean distance between each melody vector representation and the prototype vector representation.
Wherein the Euclidean distance dk,j of the sample xk to each class center cj can be expressed as:
dk,j=||xk-cj||2,xk∈Q;
In order to make the melody vector representation output by the trained melody recognition model approach the prototype vector representation of the same melody class in the vector space and be far away from the prototype vector representation of different melody classes in the vector space, the following penalty function may be employed:
Wherein, the
According to the technical scheme, clustering is conducted on each melody vector representation to obtain clustered vector representations of at least one category, prototype vector representations corresponding to each category are determined based on distance relations among clustered vector representations of the same category, distances between the prototype vector representations and the clustered vector representations in the corresponding category meet preset conditions, and then iterative training is conducted on a melody recognition model based on distances between the melody vector representations and the prototype vector representations corresponding to each category, so that the melody vector representations output by the trained melody recognition model are close to prototype vector representations of the same melody category in a vector space, and meanwhile prototype vector representations corresponding to other melody categories are far away from each other in the vector space, and therefore the trained melody recognition model can effectively learn the melody vector representations used for representing melody characteristics of songs of each sample.
In another embodiment, the voice audio in the sample song audio is input to a singing voice recognition model to obtain a predicted phoneme sequence, and the voice audio comprises the steps of inputting the sample song audio to a pre-trained voice separation model to obtain the voice audio in the sample song audio, extracting first spectral features in the voice audio, and inputting the first spectral features to the singing voice recognition model to obtain the predicted phoneme sequence corresponding to the voice audio.
In a specific implementation, in the process of inputting the voice audio in the sample song audio to the singing voice recognition model to obtain the predicted phoneme sequence, the electronic device may input the sample song audio to a pre-trained music voice separation model (e.g., spleeter separation network) so that the pre-trained music voice separation model separates the voice audio from the sample song audio.
After the electronic device separates the human voice audio in the sample song audio, the electronic device can extract a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the human voice audio, so as to extract a first frequency spectrum feature (i.e., an acoustic feature) corresponding to the human voice audio. And the electronic equipment inputs the extracted first frequency spectrum characteristics into the singing voice recognition model to obtain a predicted phoneme sequence corresponding to the voice frequency.
In another embodiment, the singing voice recognition model comprises a first convolution layer, a first full connection layer and a first classification layer, wherein the first spectral features are input into the singing voice recognition model to obtain a predicted phoneme sequence corresponding to the voice frequency, the method comprises the steps of inputting the first spectral features into the first convolution layer to enable the first convolution layer to extract the first spectral convolution features corresponding to the first spectral features, inputting the first spectral convolution features into the first full connection layer to enable the first full connection layer to convert the dimension types of the first spectral convolution features into time sequence features from space features to obtain reduced-dimension spectral convolution features, and inputting the reduced-dimension spectral convolution features into the first classification layer to enable the first classification layer to conduct classification processing on the reduced-dimension spectral convolution features to obtain the predicted phoneme sequence corresponding to the voice frequency.
Wherein, the first convolution layer can be constructed by adopting a convolution neural network.
According to the technical scheme, the first spectrum characteristic is input into the first convolution layer, so that the first convolution layer extracts the first spectrum convolution characteristic corresponding to the first spectrum characteristic, the first spectrum convolution characteristic is input into the first full-connection layer, so that the first full-connection layer converts the dimension type of the first spectrum convolution characteristic from the space characteristic to the time sequence characteristic to obtain the spectrum convolution characteristic after dimension reduction, and then the spectrum convolution characteristic after dimension reduction is input into the first classification layer, so that the first classification layer classifies the spectrum convolution characteristic after dimension reduction to obtain the predicted phoneme sequence corresponding to the voice audio.
In another embodiment, the melody recognition model comprises a second convolution layer and a second full connection layer, the sample song audio is input to the melody recognition model to obtain a melody vector representation corresponding to the sample song audio, the melody recognition model comprises the steps of extracting second spectrum features in the sample song audio, inputting the second spectrum features to the second convolution layer to enable the second convolution layer to extract second spectrum convolution features corresponding to the second spectrum features, and inputting the second spectrum convolution features to the second full connection layer to enable the second full connection layer to convert dimension types of the second spectrum convolution features from space features to time sequence features to obtain the melody vector representation corresponding to the sample song audio.
Wherein the second convolutional layer may be constructed using a convolutional neural network.
According to the technical scheme, the second spectrum characteristic in the sample song audio is extracted, the second spectrum characteristic is input to the second convolution layer, so that the second convolution layer extracts the second spectrum convolution characteristic corresponding to the second spectrum characteristic, the second spectrum convolution characteristic is input to the second full-connection layer, the second full-connection layer converts the dimension type of the second spectrum convolution characteristic from the space characteristic to the time sequence characteristic, the melody vector representation corresponding to the sample song audio is obtained, and therefore the melody vector representation of melody information for representing the sample song audio can be extracted efficiently and accurately, meanwhile, the dimension type of the second spectrum convolution characteristic is converted from the space characteristic to the time sequence characteristic, the melody vector representation has the same dimension as a predicted phoneme sequence corresponding to the human voice audio, the predicted phoneme sequence corresponding to the human voice audio is conveniently identified and processed by the melody vector representation in the follow-up process, and the follow-up data processing capacity of the recognition model is effectively reduced.
In another embodiment, as shown in fig. 4, there is provided a training method of an audio recognition model including a singing voice recognition model including a first convolution layer, a first full connection layer, and a first classification layer, and a melody recognition model including a second convolution layer and a second full connection layer, comprising the steps of:
step S410, training sample data is obtained, wherein the training sample data comprises song audios of various samples and corresponding lyric texts.
Step S420, inputting the sample song audio to the pre-trained voice separation model to obtain voice audio in the sample song audio.
Step S430, extracting a first spectrum feature in the voice audio, and inputting the first spectrum feature into the first convolution layer, so that the first convolution layer extracts a first spectrum convolution feature corresponding to the first spectrum feature.
Step S440, the first spectrum convolution feature is input to the first full connection layer, so that the first full connection layer transforms the dimension type of the first spectrum convolution feature from the space feature to the time sequence feature, and the spectrum convolution feature after the dimension reduction is obtained.
And S450, inputting the spectrum convolution characteristics after the dimension reduction into a first classification layer, so that the first classification layer classifies the spectrum convolution characteristics after the dimension reduction to obtain a predicted phoneme sequence corresponding to the voice audio.
Step S460, extracting a second spectrum feature in the sample song audio, and inputting the second spectrum feature into the second convolution layer, so that the second convolution layer extracts a second spectrum convolution feature corresponding to the second spectrum feature.
Step S470, the second spectrum convolution feature is input to the second full-connection layer, so that the second full-connection layer transforms the dimension type of the second spectrum convolution feature from the spatial feature to the time sequence feature, and the melody vector representation corresponding to the sample song audio is obtained.
Step S480, determining an actual phoneme sequence of the sample song audio based on the lyric text, performing iterative training on the song recognition model according to the difference between the actual phoneme sequence and the predicted phoneme sequence, and performing iterative training on the song recognition model according to the difference between the melody vector representation and the prototype vector representation until a preset training ending condition is met to obtain a trained audio recognition model.
It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of the training method of an audio recognition model.
The audio identification method provided by the embodiment of the application can be applied to an application environment shown in fig. 5. Wherein the terminal 502 communicates with the electronic device 504 via a network. The data storage system may store data that server 504 needs to process. The data storage system may be integrated on the electronic device 504 or may be located on a cloud or other network server. In practical applications, the song-listening and song-identifying function of the terminal 502 may record the received sound signal in real time, and generate the music audio to be identified. The terminal 502 then transmits the music audio to be identified to the electronic device 504. The electronic device 504 obtains a trained audio recognition model, the trained audio recognition model is obtained by training according to the training method, the trained audio recognition model comprises a trained singing voice recognition model and a trained melody recognition model, the electronic device 504 inputs human voice audio in song audio to be recognized to the trained singing voice recognition model to obtain a predicted phoneme sequence corresponding to the song audio to be recognized, and inputs the song audio to be recognized to the trained melody recognition model to obtain a melody vector representation corresponding to the song audio to be recognized, and the electronic device 504 outputs a song recognition result corresponding to the song audio to be recognized according to the predicted phoneme sequence and the melody vector representation. The terminal 502 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The electronic device 504 may be implemented as a stand-alone server or as a cluster of servers.
In one embodiment, as shown in fig. 6, an audio recognition method is provided, and the method is applied to the electronic device in fig. 5 for illustration, and includes the following steps:
Step S602, a trained audio recognition model is obtained.
Step S604, inputting the voice audio in the song audio to be identified into the trained singing voice identification model to obtain a predicted phoneme sequence corresponding to the song audio to be identified, and inputting the song audio to be identified into the trained melody identification model to obtain a melody vector representation corresponding to the song audio to be identified.
In a specific implementation, the electronic device obtains a trained audio recognition model, where the trained audio recognition model includes a trained singing voice recognition model and a trained melody recognition model. It should be noted that, the training method may be referred to the specific limitation of the training method of the audio recognition model, which is not described herein.
Specifically, the electronic device may input the vocal audio in the song audio to be identified (e.g., the eversion audio) to the singing recognition model in the trained audio recognition model, generate a predicted phoneme sequence of the song audio to be identified, and input the song audio to be identified to the melody recognition model in the trained audio recognition model, generate a melody vector representation corresponding to the song audio to be identified.
Step S606, outputting a song recognition result corresponding to the song audio to be recognized according to the predicted phoneme sequence and the melody vector representation.
In the specific implementation, the electronic equipment can fuse the predicted phoneme sequence and the melody vector representation to obtain a fused vector representation, and the electronic equipment can determine a song recognition result corresponding to the song audio to be recognized from the fused vector representation. Specifically, the electronic device may input the fused vector representation to a pre-trained song recognition network, classify the fused vector by the pre-trained song recognition network, and determine a song recognition result corresponding to the song audio to be recognized based on the classification result.
Of course, the electronic device may also determine the song recognition result corresponding to the song audio to be recognized by adopting a method of performing similarity matching between the fused vector and the song vector representation corresponding to each candidate song.
According to the audio recognition method, the trained audio recognition model is obtained, the voice audio in the song audio to be recognized is input into the trained singing voice recognition model to obtain the predicted phoneme sequence corresponding to the song audio to be recognized, the song audio to be recognized is input into the trained melody recognition model to obtain the melody vector representation corresponding to the song audio to be recognized, and then the song recognition result corresponding to the song audio to be recognized is output according to the predicted phoneme sequence and the melody vector representation, so that the audio recognition model can understand semantic information in songs and melody information in songs, more fully mine similar information in the songs, and further achieve a better recognition effect.
In another embodiment, the song recognition result corresponding to the song audio to be recognized is output according to the predicted phoneme sequence and the melody vector representation, and the song recognition result corresponding to the song audio to be recognized is generated according to the fused vector representation.
In a specific implementation, the electronic device fuses a vector representation (phoneme sequence embedding, phoneme sequence representation) corresponding to the predicted phoneme sequence and a melody vector representation (melody embedding, melody representation) corresponding to the song audio to be identified through a full connection layer in the trained singing voice identification model, and takes the feature vector obtained by fusion as a song vector representation corresponding to the song audio to be identified.
The electronic device can respectively acquire a phoneme weight corresponding to the phoneme sequence vector representation and a melody weight corresponding to the melody vector representation in the process of fusing the phoneme sequence vector representation and the melody vector representation to obtain the fused vector representation, and performs weighted summation on the phoneme sequence vector representation and the melody vector representation according to the phoneme weight and the melody weight to obtain the fused vector representation.
Of course, the electronic device may combine the phoneme sequence vector representation and the melody vector representation to obtain a combined vector representation in the process of combining the phoneme sequence vector representation and the melody vector representation to obtain a combined vector representation, and use the combined vector representation as the combined vector representation.
And then, the electronic equipment can generate a song identification result corresponding to the song audio to be identified according to the fused vector representation.
Specifically, the electronic device may calculate the similarity between the song vector representation corresponding to the song audio to be identified and each candidate song vector representation, and use the song corresponding to the candidate song vector representation with the highest similarity as the song identification result corresponding to the song audio to be identified.
According to the technical scheme, the phoneme sequence vector representation corresponding to the predicted phoneme sequence is obtained, the phoneme sequence vector representation and the melody vector representation are fused to obtain the fused vector representation, and the song recognition result corresponding to the song to be recognized is generated according to the fused vector representation, so that the audio recognition model can understand semantic information in songs and melody information in the songs, similar information in the songs is more fully mined, and a better recognition effect is achieved.
In another embodiment, generating the song recognition result corresponding to the song audio to be recognized according to the fused vector representation includes obtaining candidate song vector representations corresponding to the candidate song audio, determining similarity between the song vector representations corresponding to the song audio to be recognized and the candidate song vector representations, and using the candidate song vector representation with the highest similarity as audio identification information of the candidate song audio corresponding to the song audio to be recognized.
Wherein the audio identification information may refer to song titles.
In a specific implementation, after determining the song vector representation corresponding to the song audio to be identified, the electronic device obtains candidate song vector representations corresponding to each candidate song audio. It should be noted that, the electronic device may acquire the candidate song vector representations corresponding to the candidate song audio by using the method for determining the song vector representation corresponding to the song audio to be identified above, which is not described herein. The electronic device then calculates the similarity (e.g., cosine similarity, vector distance, etc.) between the song vector representation corresponding to the song audio to be identified and each candidate song vector representation, respectively. Then, the electronic device uses the candidate song vector with the highest similarity to represent the corresponding audio identification information of the candidate song audio as a song identification result for the song audio to be identified.
For example, given that the similarity between the song vector representation corresponding to the pop audio and the song vector representation of song a is 80%, and the similarity between the song vector representation corresponding to the pop audio and the song vector representation of song b is 60%, the song name of song a is taken as the song recognition result of the pop audio.
According to the technical scheme, the candidate song vector representations corresponding to the candidate song audios are obtained, the similarity between the song vector representations corresponding to the song audios to be identified and the candidate song vector representations is determined, and the audio identification information of the candidate song audio corresponding to the candidate song vector representation with the highest similarity is used as the song identification result of the candidate song audio, so that the song identification accuracy of the candidate song audio is improved, and the subsequent marking operation of the song audio to be identified by using the audio identification information is facilitated.
In another embodiment, as shown in fig. 7, an audio recognition method is provided, and the method is applied to the terminal in fig. 5 for illustration, and includes the following steps:
step S710, a trained audio recognition model is obtained, wherein the trained audio recognition model comprises a trained singing voice recognition model and a trained melody recognition model.
Step S720, inputting the voice audio in the song audio to be identified into the trained singing voice identification model to obtain a predicted phoneme sequence corresponding to the song audio to be identified, and inputting the song audio to be identified into the trained melody identification model to obtain a melody vector representation corresponding to the song audio to be identified.
In step S730, a phoneme sequence vector representation corresponding to the predicted phoneme sequence is obtained.
In step S740, the phoneme sequence vector representation and the melody vector representation are fused to obtain a fused vector representation.
Step S750, candidate song vector representations corresponding to each candidate song audio are obtained.
Step S760, determining a similarity between the song vector representation corresponding to the song audio to be identified and each candidate song vector representation.
In step S770, the candidate song vector with the highest similarity represents the audio identification information of the corresponding candidate song audio, and is used as the song recognition result corresponding to the song audio to be recognized.
It should be noted that, the specific limitation of the above steps may be referred to the specific limitation of an audio recognition method.
For ease of understanding by those skilled in the art, fig. 8 illustratively provides a model block diagram of a method of training an audio recognition model. As shown in fig. 8, in the process of training the audio recognition model, the electronic device may acquire music audio corresponding to each song and lyric text of each song as training sample data for the audio recognition model.
The electronic device may then extract a first spectral feature of the vocal audio in the sample song audio and input the first spectral feature to the singing voice recognition model. Specifically, the electronic device may input the first spectral feature to the first convolution network 810 in the singing voice recognition model to obtain a first spectral convolution feature, then input the first spectral convolution feature to the first full connection layer 820 in the singing voice recognition model to obtain a reduced-dimension spectral convolution feature, and then input the reduced-dimension spectral convolution feature to the classification layer 830 (Softmax layer) to obtain a predicted phoneme sequence corresponding to the voice audio. Then, the electronic device may determine a loss function value of the singing voice recognition model by using the actual phoneme sequence and the predicted phoneme sequence by using a preset loss function (e.g., CTC (Connectionist temporal classification) loss function), and update network parameters (e.g., weights and offsets) in the singing voice recognition model by using the loss function value until the trained singing voice recognition model meets a preset training ending condition, thereby obtaining a trained singing voice recognition model.
In addition, the electronic device extracts a second spectral feature from the sample song audio, and inputs the second spectral feature to the second convolution network 840 to obtain a second spectral convolution feature, and then, the electronic device inputs the second spectral convolution feature to the second full connection layer 850 to obtain a melody vector representation corresponding to the sample song audio (embedding). Then, the electronic device may perform iterative training on the melody recognition model according to the difference (e.g., the vector distance) between the melody vector representation and the prototype vector representation until a preset training target is satisfied to obtain a trained melody recognition model, so that the trained melody recognition model may learn target melody vector representations corresponding to various sample song audios.
The electronic device may input the vocal audio in the song audio to be identified to the trained singing voice identification model to obtain a predicted phoneme sequence corresponding to the song audio to be identified, and input the song audio to be identified to the trained melody identification model to obtain a melody vector representation corresponding to the song audio to be identified. Then, a song vector representation corresponding to the song audio to be identified is determined by the softmax layer 860 using the predicted phoneme sequence and the melody vector representation corresponding to the song audio to be identified. And by calculating the similarity (e.g., cosine similarity, vector distance, etc.) between the song vector representation corresponding to the song audio to be identified and each candidate song vector representation. And then the candidate song vector with highest similarity represents the corresponding audio identification information of the candidate song audio as a song identification result aiming at the song audio to be identified.
In another embodiment, the singing voice recognition model comprises a first convolution layer, a first full connection layer and a first classification layer, wherein the first spectrum feature is input into the singing voice recognition model to obtain a predicted phoneme sequence corresponding to the voice frequency, the method comprises the steps of inputting the first spectrum feature into the first convolution layer to enable the first convolution layer to extract the first spectrum convolution feature corresponding to the first spectrum feature, inputting the first spectrum convolution feature into the first full connection layer to enable the first full connection layer to convert the dimension type of the first spectrum convolution feature into a time sequence feature from a space feature to obtain a reduced-dimension spectrum convolution feature, and inputting the reduced-dimension spectrum convolution feature into the first classification layer to enable the first classification layer to conduct classification processing on the reduced-dimension spectrum convolution feature to obtain the predicted phoneme sequence corresponding to the voice frequency.
The electronic device may input the sample song audio to the melody recognition model, and in the process of obtaining the melody vector representation corresponding to the sample song audio, the electronic device may extract a Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the sample song audio, so as to implement extraction of a second spectral feature (i.e., an acoustic feature) corresponding to the sample song audio.
Then, the electronic device inputs the second spectral feature to the second convolution layer, so that the second convolution layer extracts a second spectral convolution feature corresponding to the second spectral feature. And the electronic equipment inputs the second spectrum convolution characteristic to the second full-connection layer so that the second full-connection layer transforms the dimension type of the second spectrum convolution characteristic from the space characteristic to the time sequence characteristic to obtain melody vector representation corresponding to the sample song audio.
The electronic device may determine a loss function value of the singing voice recognition model by using a preset loss function (e.g., CTC (Connectionist temporal classification) loss function) and using an actual phoneme sequence and a predicted phoneme sequence, and update network parameters (e.g., weights and offsets) in the singing voice recognition model by using the loss function value until the trained singing voice recognition model meets a preset training ending condition, thereby obtaining a trained singing voice recognition model. In the process of carrying out iterative training on the rhythm recognition model according to the difference between the melody vector representation and the prototype vector representation, the electronic equipment can cluster each melody vector representation to obtain a clustered vector representation (sample) of at least one category, and then determine the prototype vector representation corresponding to each category based on the distance relation between the clustered vector representations of the same category. Specifically, for any class of post-cluster vector representations, the electronic device may determine, from a mean of the post-cluster vector representations of the any class, a class center, i.e., a prototype vector representation, corresponding to the any class.
The electronic device may perform iterative training on the melody recognition model according to a difference (e.g., a vector distance) between the melody vector representation and the prototype vector representation until a preset training target is met to obtain a trained melody recognition model, so that the trained melody recognition model may learn target melody vector representations corresponding to various sample song audio frequencies.
The electronic device may input the vocal audio of the song audio to be identified (e.g., the eversion audio) to the singing recognition model in the trained audio recognition model, generate a predicted phoneme sequence of the song audio to be identified, input the song audio to be identified to the melody recognition model in the trained audio recognition model, generate a melody vector representation corresponding to the song audio to be identified, and then fuse the vector representation (phoneme sequence embedding, phoneme sequence representation) corresponding to the predicted phoneme sequence with the melody vector representation (melody embedding, melody representation) corresponding to the song audio to be identified, and output a song recognition result corresponding to the song audio to be identified based on the vector representation obtained by fusion.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio recognition method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of a training, audio recognition method of an audio recognition model as described above. The steps of the audio recognition model training and audio recognition method may be the steps of the audio recognition model training and audio recognition method in the above embodiments.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, causes the processor to perform the steps of a training, audio recognition method of an audio recognition model as described above. The steps of the audio recognition model training and audio recognition method may be the steps of the audio recognition model training and audio recognition method in the above embodiments.
The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.