Song generation method, song generation device, electronic equipment and storage mediumTechnical FieldThe disclosure relates to the technical field of computers, and in particular relates to a song generating method, a song generating device, electronic equipment and a storage medium.
BackgroundSong synthesis refers to the generation of corresponding singing audio based on lyrics and score. The corresponding song composition algorithm is developed from the original composition technology based on unit splicing to the statistical parameter composition technology to the current composition technology based on deep learning. The song synthesizing technology can make the machine singing, further increases the interestingness of man-machine interaction, and therefore has higher commercial value.
In the related art, when song synthesis is performed, the number and quality requirements on training corpus are generally high, so that the song generation process is complicated, and the song generation effect cannot be ensured.
Disclosure of Invention
The embodiment of the disclosure provides a song generation method, a device, electronic equipment and a storage medium, which can be applied to the technical field of data processing, and can effectively combine the real Mel spectrum characteristics of a target user and a song template corresponding to the target song in the song generation process so as to effectively reduce the degree of dependence on the data volume of user voice data, thereby effectively improving the song generation effect while improving the song generation convenience.
In a first aspect, an embodiment of the present disclosure provides a song generating method, including:
acquiring voice audio input by a target user and a unique identification number of a target song;
Extracting the Mel spectrum characteristics of the voice audio to obtain the real Mel spectrum characteristics of the target user;
obtaining a song template corresponding to the unique identification number according to the unique identification number of the target song;
Inputting the real Mel spectrum characteristics of the target user and the song template into a preset song generation model to obtain target Mel spectrum characteristics output by the song generation model, wherein the song generation model is obtained through machine learning training by using a training set, the training set is from a plurality of sampling users, the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a certain song and lyric text corresponding to the singing audio;
and generating a target song according to the target Mel spectrum characteristics.
In a second aspect, an embodiment of the present disclosure provides a training method of a song generating model, including:
Acquiring a training set from a plurality of sampling users, wherein the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a certain song and lyric text corresponding to the singing audio;
acquiring a pre-built initial neural network model, wherein the initial neural network model comprises initial weight parameters and a loss function;
Acquiring a first sample from the training set, inputting the first sample into the initial neural network model to obtain real Mel spectrum characteristics and predicted Mel spectrum characteristics, wherein the real Mel spectrum characteristics represent Mel spectrum characteristics of singing audio in the first sample, and the predicted Mel spectrum characteristics represent Mel spectrum characteristics predicted by the initial neural network model;
Calculating an error between the predicted mel-spectrum feature and the true mel-spectrum feature according to the loss function;
Adjusting initial weight parameters of the initial neural network model according to the errors to obtain an updated neural network model;
And acquiring subsequent samples one by one from the training set, and repeatedly inputting the subsequent samples into the latest neural network model until the loss function converges to obtain a song generating model after training.
In a third aspect, an embodiment of the present disclosure provides a song generating apparatus, including a first obtaining module, configured to obtain voice audio input by a target user and a unique identifier of the target song;
The first processing module is used for extracting the Mel spectrum characteristics of the voice audio to obtain the real Mel spectrum characteristics of the target user;
The second acquisition module is used for acquiring a song template corresponding to the unique identification number according to the unique identification number of the target song;
The second processing module is used for inputting the real Mel spectrum characteristics of the target user and the song template into a preset song generation model to obtain target Mel spectrum characteristics output by the song generation model, wherein the song generation model is obtained through machine learning training by using a training set, the training set is from a plurality of sampling users, the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a song and lyric text corresponding to the singing audio;
And the generating module is used for generating target songs according to the target Mel spectrum characteristics.
In a fourth aspect, an embodiment of the present disclosure provides a training apparatus for a song generating model, including:
The third acquisition module is used for acquiring a training set from a plurality of sampling users, wherein the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a certain song and lyric text corresponding to the singing audio;
The fourth acquisition module is used for acquiring a pre-built initial neural network model, wherein the initial neural network model comprises initial weight parameters and a loss function;
A fifth obtaining module, configured to obtain a first sample from the training set, and input the first sample into the initial neural network model, to obtain a real mel spectrum feature and a predicted mel spectrum feature, where the real mel spectrum feature represents a mel spectrum feature of singing audio in the first sample, and the predicted mel spectrum feature represents a mel spectrum feature predicted by the initial neural network model;
The third processing module is used for calculating errors between the predicted Mel-spectrum characteristics and the real Mel-spectrum characteristics according to the loss function;
the fourth processing module is used for adjusting the initial weight parameters of the initial neural network model according to the errors to obtain an updated neural network model;
And a sixth acquisition module, configured to acquire subsequent samples from the training set one by one, and repeatedly input the subsequent samples to the latest neural network model until the loss function converges, so as to obtain a song generation model after training is completed.
In a fifth aspect, embodiments of the present disclosure provide an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a song generation method as set forth in the embodiments of the first aspect of the present disclosure, or implements a training method of a song generation model as set forth in the embodiments of the second aspect of the present disclosure, when the program is executed by the processor.
In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a song generating method as set forth in the embodiments of the first aspect of the present disclosure, or implements a training method of a song generating model as set forth in the embodiments of the second aspect of the present disclosure.
In a seventh aspect, embodiments of the present disclosure provide a computer program product, which when executed by a processor, performs a song generation method as set forth in the embodiments of the first aspect of the present disclosure, or performs a training method of a song generation model as set forth in the embodiments of the second aspect of the present disclosure.
In summary, the song generating method, apparatus, electronic device, storage medium, computer program and computer program product provided in the embodiments of the present disclosure may implement the following technical effects:
the method comprises the steps of obtaining voice audio input by a target user and a unique identification number of the target song, extracting the Mel spectrum characteristics of the voice audio to obtain real Mel spectrum characteristics of the target user, obtaining a song template corresponding to the unique identification number according to the unique identification number of the target song, inputting the real Mel spectrum characteristics of the target user and the song template into a preset song generation model to obtain target Mel spectrum characteristics output by the song generation model, generating the target song according to the target Mel spectrum characteristics, and effectively combining the real Mel spectrum characteristics of the target user and the song template corresponding to the target song in the song generation process to effectively reduce the dependence degree on the data volume of voice data of the user, thereby improving the song generation convenience and effectively improving the song generation effect.
DrawingsIn order to more clearly illustrate the technical solutions in the embodiments or the background of the present disclosure, the following description will explain the drawings that are required to be used in the embodiments or the background of the present disclosure.
FIG. 1 is a flow chart of a song generation method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a song generation method according to another embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a song template generation process according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of a song generation method according to another embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a timbre coding submodel according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a song generation process according to an embodiment of the present disclosure;
FIG. 7 is a flow chart of a training method of a song generation model according to an embodiment of the present disclosure;
FIG. 8 is a training flow diagram of an initial neural network model according to an embodiment of the present disclosure;
Fig. 9 is a schematic structural diagram of a song generating apparatus according to an embodiment of the present disclosure;
Fig. 10 is a schematic structural diagram of a song generating apparatus according to another embodiment of the present disclosure;
FIG. 11 is a schematic diagram of a training device for a song-generation model according to an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a training device for song-generation models according to another embodiment of the present disclosure;
Fig. 13 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed DescriptionReference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the accompanying claims.
The terminology used in the embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the disclosure. As used in this disclosure of embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The words "if" and "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.
For ease of understanding, the terms referred to in this disclosure are first introduced.
1. Mel spectrum
Mel-spectrum is a common feature in the process of deep learning of speech. The ordinary spectrogram is linear, and the mel spectrum converts the frequency of the ordinary spectrogram from linear to mel scale based on the characteristics of human hearing (relatively sensitive to low-frequency sounds and relatively poor in resolving power to high-frequency sounds), and the mel scale is a logarithmic scale on which human perception of frequency is more sensitive.
2. Phonemes
The phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and can be analyzed according to the pronunciation actions in syllables, and one action forms one phoneme. Phonemes are divided into two major classes, vowels and consonants.
3. Tone color
Tone color, meaning that different sounds always have distinctive characteristics in terms of waveform, and different objects vibrate differently.
Fig. 1 is a flowchart of a song generating method according to an embodiment of the present disclosure.
It should be noted that, the main execution body of the song generating method in this embodiment is a song generating device, and the device may be implemented in a software and/or hardware manner, and the device may be configured in an electronic device, where the electronic device may include, but is not limited to, a terminal, a server, and the like, and the terminal may be, for example, a smart phone, a smart television, a smart watch, a smart car, and the like.
As shown in fig. 1, the song generating method may include, but is not limited to, the following steps:
Step S101, acquiring voice audio input by a target user and a unique identification number of the target song.
The target user refers to a user to use the song generating method. The voice audio refers to audio data input by the target user, and the voice audio may be audio data of the target user or audio data of other users, which is not limited. And the target song refers to a song to be generated by the song generating method. The unique identification number refers to identification information, such as a number or a name, corresponding to the target song.
It will be appreciated that the number of target songs may be plural, and that accurate positioning of the target songs during song generation may be achieved when the unique identification of the target song is obtained.
In the embodiment of the present disclosure, when the voice audio input by the target user is acquired, the audio acquisition device may be configured in advance in the execution body of the embodiment of the present disclosure, and then the voice audio of the target user is acquired by the audio acquisition device, or the data interface may also be configured in advance in the execution body of the embodiment of the present disclosure, and a song generation request is received via the data interface, and then the voice audio is obtained by parsing from the song generation request, which is not limited.
In the embodiment of the disclosure, when the unique identification number of the target song is obtained, a relationship table may be adopted, in which the unique identification number corresponding to the target song may be recorded, or a database may be established in advance based on the mapping relationship between a plurality of target songs and the unique identification number of the target song, and then the corresponding unique identification number is obtained from the database based on the target song, which is not limited.
Step S102, extracting the Mel spectrum characteristics of the voice audio to obtain the real Mel spectrum characteristics of the target user.
The mel spectrum is a spectrogram extracted based on audio data, and belongs to a logarithmic spectrum. And the mel spectrum characteristic refers to characteristic information corresponding to the mel spectrum. It will be appreciated that the sound level heard by the human ear is not linear with the actual (Hz) frequency, and that the mel-spectrum characteristics are more consistent with the auditory characteristics of the human ear. The real mel-spectrum features refer to the mel-spectrum features obtained based on the above-mentioned voice data.
In the embodiment of the disclosure, when the Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, the feature extraction of the voice audio can be realized, so that reliable reference data is provided for the song generation process.
Step S103, obtaining a song template corresponding to the unique identification number according to the unique identification number of the target song.
The song template refers to a template describing relevant information of a target song.
Optionally, in some embodiments, the song template may include lyric text information and song melody information, where the lyric text information includes a phoneme sequence and a phoneme duration, and the song melody information includes a song note sequence and a song energy sequence, so that the characterizing content in the song template may be enriched to a greater extent, so as to provide comprehensive reference information of the target song for the song generating model, so as to effectively improve the applicability of the song template in the song generating process.
The lyric text information refers to text information of lyrics corresponding to a target song. Song melody information may be used to describe related information corresponding to the target song melody. The phonemes refer to the smallest phonetic units divided according to the natural properties of the speech. And the phoneme sequence refers to a sequence composed of a plurality of phonemes. And the phoneme duration refers to duration information corresponding to the phonemes. Note that a running symbol is used to record tones of different lengths. And the song note sequence refers to a sequence formed by corresponding notes of song audio.
The energy may refer to energy contained in song audio, such as sound intensity, and a song energy sequence may be used to describe energy changes corresponding to song audio corresponding to different time points.
In the embodiment of the disclosure, when the song template corresponding to the unique identification number of the target song is obtained according to the unique identification number of the target song, a plurality of song templates may be obtained in advance, and then matching processing is performed based on the unique identification number and the plurality of song templates, so as to obtain the song template corresponding to the unique identification number, or the song template corresponding to the unique identification number may be obtained by the third party search device according to the unique identification number of the target song, which is not limited.
Step S104, inputting the real Mel spectrum characteristics of the target user and a song template into a preset song generation model to obtain target Mel spectrum characteristics output by the song generation model, wherein the song generation model is obtained by machine learning training by using a training set, the training set is from a plurality of sampling users, the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a certain song and lyric text corresponding to the singing audio.
The song generation model is used for processing the real Mel spectrum characteristics and the song templates and outputting the model of the target Mel spectrum characteristics. The song-generation model may be a neural network model. And the target Mel spectrum features refer to Mel spectrum features obtained by processing the real Mel spectrum features of the target user and the song template by the song generating model. The training set refers to a sample set used in the training process of the song generating model.
Wherein, the sampling user refers to a user who provides a sample for the training process of the song generating model. And a sample may refer to singing audio and lyrics text used for model training. And singing audio refers to audio picked up by sampling when a user sings a certain song.
Optionally, in some embodiments, the song generating model includes a tone color coding sub-model, a text coding sub-model and an acoustic decoding sub-model, and the song generating model is obtained by jointly training the tone color coding sub-model, the text coding sub-model and the acoustic decoding sub-model by using the same training set, so that the structural rationality of the song generating model can be effectively improved, and when the tone color coding sub-model, the text coding sub-model and the acoustic decoding sub-model are jointly trained by using the same training set, the consistency among all the sub-models can be effectively improved, thereby effectively improving the output accuracy of the obtained song generating model.
The tone color refers to different characteristics of different sounds in terms of waveforms, and different object vibrations have different characteristics, that is, different tone colors of different users are different. The timbre coding submodel refers to a model used for processing the real mel spectrum characteristics to obtain timbre characteristic vectors of a target user.
The text coding sub-model refers to a model used for processing a phoneme sequence to obtain a text feature vector corresponding to a target song.
The acoustic decoding submodel refers to a model used for processing a plurality of feature information to obtain a target mel spectrum feature, and the acoustic decoding submodel can be a decoder of a fast end-to-end non-autoregressive synthesis system.
In the embodiment of the disclosure, when the real Mel-spectrum characteristics of the target user and the song template are input into the preset song generation model to obtain the target Mel-spectrum characteristics output by the song generation model, the real Mel-spectrum characteristics and the related information of the song template can be fused rapidly and accurately based on the song generation model, so that the model generation efficiency is effectively improved.
Step S105, generating a target song according to the target Mel spectrum characteristics.
When the target song is generated according to the target mel-frequency spectrum characteristic, the target mel-frequency spectrum characteristic is input into the vocoder, and the vocoder analyzes and processes the target spectrum characteristic to obtain the target song.
For example, when generating a target song according to the target mel spectrum feature, the target mel spectrum feature may be input into a preset vocoder model to obtain a target linear spectrum feature, and the linear spectrum feature is subjected to inverse fourier transform to obtain audio data of the target song.
Wherein the vocoder model is a neural network model, and the vocoder model is also obtained by machine learning training by using a training set different from the song generating model. The vocoder model may be based on a generated countermeasure network (GENERATIVE ADVERSIAL Networks, GAN), a distillation-free countermeasure generation network, etc., and the training set may be a training set commonly used in the art.
The training process of the vocoder model can be to input the real Mel spectrum characteristic in one sample into the built initial model to obtain the predicted linear spectrum characteristic, calculate the error of the predicted linear spectrum characteristic and the real linear spectrum characteristic in the sample through the loss function, modify the weight of the initial model according to the error, and input the sample in a reciprocating way until the loss function converges to obtain the trained vocoder model.
According to the method and the device, voice audio input by a target user and a unique identification number of the target song are obtained, the voice audio is subjected to Mel spectrum feature extraction to obtain real Mel spectrum features of the target user, a song template corresponding to the unique identification number is obtained according to the unique identification number of the target song, the real Mel spectrum features of the target user and the song template are input into a preset song generation model to obtain target Mel spectrum features output by the song generation model, the target song is generated according to the target Mel spectrum features, the real Mel spectrum features of the target user and the song template corresponding to the target song can be effectively combined in the song generation process, so that the degree of dependence on the data amount of voice data of the user is effectively reduced, and song generation convenience is improved while song generation effect is effectively improved.
Fig. 2 is a flow chart illustrating a song generating method according to another embodiment of the present disclosure.
As shown in fig. 2, the song generating method may include, but is not limited to, the following steps:
Step S201, obtaining the voice audio input by the target user and the unique identification number of the target song.
Step S202, extracting the Mel spectrum characteristics of the voice audio to obtain the real Mel spectrum characteristics of the target user.
Step S203, obtaining a song template corresponding to the unique identification number according to the unique identification number of the target song.
The descriptions of step S201 to step S203 may be specifically referred to the above embodiments, and are not repeated herein.
And S204, inputting the real Mel spectrum characteristics of the target user into a tone coding submodel to obtain tone characteristic vectors of the target user.
The tone color feature vector refers to a vector used for representing the tone color feature corresponding to the target user.
Step S205, inputting the phoneme sequence into a text coding submodel to obtain text feature vectors of the lyric text in the song template.
The lyric text refers to text data describing corresponding lyric information of a target song in a song template.
The text feature vector refers to a vector used for representing corresponding text features of the lyric text.
Optionally, in some embodiments, the song template is configured by a phoneme sequence, a phoneme duration, a song note sequence, a song energy sequence of the target song, and a unique identification number of the target song, where the phoneme sequence and the phoneme duration of the target song are determined by song audio and song lyrics of the target song, and the song note sequence and the song energy sequence of the target song are determined by song audio, so that quick positioning of the target song can be achieved based on the unique identification number, so as to effectively improve the practicability of the obtained song template, and meanwhile, effectively improve the characterization accuracy of the song template on relevant information of the target lyrics.
The song audio refers to singing audio corresponding to the target song. And song lyrics refer to lyric information corresponding to a target song.
Optionally, in some embodiments, the phoneme sequence includes a plurality of phonemes obtained by parsing song lyrics, and the phoneme duration includes a first frame number occupied by each phoneme in song audio, so that suitability between the obtained phoneme sequence and the song lyrics can be effectively improved, and meanwhile accuracy of the obtained first frame number on the corresponding phonemes can be effectively improved.
The first frame number refers to the number of video frames corresponding to phonemes in song audio.
Optionally, in some embodiments, the song energy sequence is obtained by quantization processing of a song energy feature of the song audio, and the song note sequence is obtained by quantization processing of a song fundamental frequency feature of the song audio, so that the clarity of characterization of the obtained song energy sequence and the song note sequence on the song energy feature and the song fundamental frequency feature can be effectively improved based on quantization processing, and meanwhile, the obtained song energy sequence and the song note sequence can provide reliable reference data for a subsequent calculation process.
Wherein the song energy characteristics may be used to describe the relevant characteristics corresponding to the song energy. And the fundamental frequency characteristics of the song may be used to describe the corresponding relevant characteristics of the fundamental frequency of the song.
Optionally, in some embodiments, the song energy features include a plurality of energy values, the song energy sequence is formed according to a plurality of range coding values, the range coding values are obtained by performing a single-heat coding process on an energy range corresponding to the energy values, the song energy features can be effectively expanded based on the single-heat coding process, so that the plurality of energy values are distinguished based on the obtained plurality of range coding values, and when the song energy sequence is formed according to the plurality of range coding values, the characterization effect of the obtained song energy sequence on the song energy features can be effectively improved.
The energy value may refer to a value corresponding to energy of the song. The energy range refers to the value range corresponding to the energy value, such as 0-10.
Wherein the one-hot code, which may also be referred to as one-bit valid code, may use an N-bit state register to encode N states, each having its own register bit, and at any time, only one of the bits is valid.
The range code value refers to a code value obtained by the energy range through the single-heat coding treatment.
For example, six states are encoded, assuming a natural sequence code of 000,001,010,011,100,101 for the six states.
The one-hot encoding may be correspondingly configured to 000001,000010,000100,001000,010000,100000.
Optionally, in some embodiments, the song fundamental frequency feature includes a plurality of fundamental frequency values, and the song note sequence includes a note number corresponding to each fundamental frequency value, so that the song note sequence can effectively combine the corresponding relationship between the fundamental frequency values and the note numbers to adapt to the personalized application scenario, thereby effectively improving the applicability of the obtained song note sequence in the song generation process.
The fundamental frequency value refers to a value corresponding to the fundamental frequency of the song. The phonetic symbols, which are numbers corresponding to the notes, can be obtained based on a related database in the music domain.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a song template generation process according to an embodiment of the present disclosure, where initial data of the song template may include song audio and song lyrics corresponding to a target song, and the song template generation process may include (1) processing song lyrics based on a text transcription method to obtain a phoneme sequence corresponding to the target song, (2) processing the obtained song phoneme sequence and song audio based on a forced alignment method to obtain a phoneme duration of the target song, where the phoneme duration may be processed by the forced alignment method, and also may be manually calibrated after the forced alignment operation to improve accuracy of the obtained phoneme duration, (3) processing song audio based on an acoustic feature extraction method to obtain song energy features and song fundamental frequency features corresponding to the target song, and then changing values of song corresponding energy and tone based on energy track translation and fundamental frequency track translation to improve flexibility of the song template, (4) performing quantization processing on the song energy features and song fundamental frequency features to obtain a song energy sequence and a note sequence, (5) generating a song template based on the phoneme sequence, the phoneme duration, the song energy sequence and the song sequence, and (6) generating a song template based on the phoneme sequence, and generating a song template based on the song energy sequence, and generating a unique identification number based on the song template, and generating a song template based on the unique identification number.
And S206, performing time length normalization on the text feature vector and the tone feature vector according to the phoneme time length to obtain a frame-level text feature vector and a frame-level tone feature vector.
The frame-level text feature vector refers to a vector describing text features corresponding to a plurality of audio frames. And a frame-level tone color feature vector refers to a vector of tone color features corresponding to a plurality of audio frames.
It can be understood that the same phoneme may include a plurality of audio frames, and the plurality of audio frames corresponding to the same phoneme have higher similarity, when the text feature vector and the tone feature vector are time-ordered according to the phoneme time length to obtain a frame-level text feature vector and a frame-level tone feature vector, the phoneme-level text feature vector and the tone feature vector can be converted into the frame-level text feature vector and the tone feature vector in a duplication manner, so that the frame-level text feature vector and the frame-level tone feature vector can be added later.
Step S207, adding the frame-level text feature vector, the frame-level tone feature vector and the song melody information, and inputting the added result into an acoustic decoding submodel to obtain the target Mel spectrum feature.
The addition refers to addition operation of dimensions, and the addition is performed by using the values of the corresponding dimensions on the assumption that the dimensions of the frame-level text feature vector, the frame-level tone feature vector, the song note sequence and the song energy sequence are all 10 dimensions.
That is, in the embodiment of the disclosure, after a song template corresponding to a unique identification number of a target song is obtained, real mel spectrum features of the target user may be input into a tone encoding submodel to obtain tone feature vectors of the target user, a phoneme sequence is input into a text encoding submodel to obtain text feature vectors of lyrics texts in the song template, the text feature vectors and the tone feature vectors are time-ordered according to phoneme time lengths to obtain frame-level text feature vectors and frame-level tone feature vectors, the frame-level text feature vectors, the frame-level tone feature vectors and song melody information are added and then input into an acoustic decoding submodel to obtain target mel spectrum features, thereby, feature extraction of the real mel spectrum features and phoneme sequences may be rapidly realized based on the tone encoding submodel and the text encoding submodel, the corresponding tone features and text features may be quantized in the form of vectors, and the text feature vectors may be time-ordered based on the phoneme time lengths, and the consistency between the obtained frame-level text feature vectors and frame-level tone feature vectors may be effectively improved, so that the acoustic decoding effect of the frame-level tone feature vectors and the acoustic decoding submodel may be effectively improved.
Step S208, generating a target song according to the target Mel spectrum characteristics.
The description of step S208 may be specifically referred to the above embodiments, and will not be repeated here.
In this embodiment, the real mel spectrum feature of the target user is input into the tone coding submodel to obtain a tone feature vector of the target user, the phoneme sequence is input into the text coding submodel to obtain a text feature vector of the lyric text in the song template, the text feature vector and the tone feature vector are time-ordered according to the phoneme time length to obtain a frame-level text feature vector and a frame-level tone feature vector, the frame-level text feature vector, the frame-level tone feature vector and song melody information are added and then input into the acoustic decoding submodel to obtain the target mel spectrum feature, thereby the feature extraction of the real mel spectrum feature and the phoneme sequence can be rapidly realized based on the tone coding submodel and the text coding submodel, the corresponding tone feature and the text feature are quantized in the form of vectors, and the time length is then ordered based on the phonemes, and the consistency between the obtained frame-level text feature vector and the frame-level tone feature vector can be effectively improved, so that the processing effect of the acoustic decoding submodel on the frame-level text feature vector and the frame-level tone feature vector can be effectively improved.
Fig. 4 is a flowchart illustrating a song generating method according to another embodiment of the present disclosure.
As shown in fig. 4, the song generating method may include, but is not limited to, the following steps:
step S401, acquiring voice audio input by a target user and a unique identification number of the target song.
And step S402, extracting the Mel spectrum characteristics of the voice audio to obtain the real Mel spectrum characteristics of the target user.
Step S403, obtaining the song template corresponding to the unique identification number according to the unique identification number of the target song.
The descriptions of step S401 to step S403 may be specifically referred to the above embodiments, and are not repeated herein.
And S404, inputting the real Mel spectrum characteristics of the target user into a reference encoder to obtain the tone color hidden space distribution vector of the target user.
The reference encoder may be an encoder that is used to process the real mel spectrum feature to obtain a timbre hidden space distribution vector, and the timbre hidden space distribution vector output by the reference encoder may be a hidden layer variable corresponding to the real mel spectrum feature.
It can be understood that the tone color hidden space distribution vector follows a spherical gaussian distribution, and in the embodiment of the disclosure, the reference encoder can output the mean value and the variance corresponding to the spherical gaussian distribution while outputting the tone color hidden space distribution vector of the target user.
Step S405, inputting the tone color hidden space distribution vector into an autoregressive encoder to obtain a tone color distribution vector of a target user, wherein the tone color distribution vector is obtained by sampling the tone color hidden space distribution vector by the autoregressive encoder.
The autoregressive encoder is used for processing the tone color hidden space distribution vector to obtain the tone color distribution vector.
It will be appreciated that the above-described reference encoder and autoregressive encoder structures may be multi-layered linear layers or convolutional layers, without limitation.
And step S406, taking the tone color distribution vector as a tone color feature vector of the target user.
The timbre coding submodel comprises a reference coder and an autoregressive coder, wherein after a song template corresponding to a unique identification number of a target song is acquired according to the unique identification number, real Mel spectrum characteristics of the target user can be input into the reference coder to obtain timbre hidden space distribution vectors of the target user, the timbre hidden space distribution vectors are input into the autoregressive coder to obtain timbre distribution vectors of the target user, wherein the timbre distribution vectors are obtained by sampling the timbre hidden space distribution vectors by the autoregressive coder, and the timbre distribution vectors are used as timbre characteristic vectors of the target user, so that redundant information in the obtained timbre characteristic vectors can be effectively reduced, and meanwhile, more complex real Mel spectrum characteristics are converted into a vector form, so that the practicability of the obtained timbre characteristic vectors is effectively improved.
For example, as shown in fig. 5, fig. 5 is a schematic diagram of a timbre coding submodel according to an embodiment of the disclosure, where the random sampling points epsilon refer to random sampling points with gaussian distribution, which may be represented as epsilon-N (0,I).
After receiving the real mel spectrum characteristic, the tone color coding submodel can obtain a tone color hidden space distribution vector h and two parameters through processing of a reference encoder, the two parameters can be respectively used as a mean value a1 and a variance b1 of gaussian distribution, the random sampling points epsilon are processed by combining the mean value a1 and the variance b1, a random sampling point z (z=b1 epsilon+a1) of approximate experimental distribution can be obtained, wherein, the term of the matrix multiplication is adopted, the random sampling point z and the tone color hidden space distribution vector h can be processed through an autoregressive encoder to obtain a mean value a2 and a variance b2 corresponding to the random sampling point z, and then the tone color characteristic vector s (s=b2+z+a2) can be obtained based on the random sampling point z and the mean value a2 and the variance b 2.
It will be appreciated that the timbre coding submodel may be a sampling process based on an inverse autoregressive stream (Inverse Autoregressive Flow, IAF) that belongs to a normalized stream. The normalized stream may generate a distribution that is easy to sample. The normalized stream is able to transform a complex input distribution into a probability distribution that is easy to process by a series of reversible transformation operations, and the distribution of the output is typically chosen to be an isotropic unit gaussian distribution, i.e. a spherical unit gaussian distribution, allowing smooth interpolation and efficient sampling. The tone characteristic vector is obtained through learning in an inverse autoregressive flow mode, the generated tone hidden space distribution vector h can obey spherical Gaussian distribution, so that the tone characteristic vector can be obtained through sampling on the distribution, and more accurate vector distribution can be obtained through learning aiming at an untreated user. In the training and reasoning stage, sampling is carried out from spherical Gaussian distribution to represent tone characteristic vectors, so that consistency of the training and reasoning stage is ensured, and users outside the training set are more adapted. At the same time, the corresponding tone color hidden space distribution vector h of the user is sampled instead of averaged, so that the convergence of the user space is further increased, and smoother interpolation between tone color feature vectors is allowed, that is, the tone color feature vectors of the users outside the training set can be learned from a sentence of the users outside the training set.
Step S407, inputting the phoneme sequence into a text coding submodel to obtain text feature vectors of the lyric text in the song template.
The description of step S407 may be specifically referred to the above embodiments, and will not be repeated here.
Step S408, determining an initial text code corresponding to each phoneme in the sequence of phonemes from the text feature vector.
The initial text code refers to a text code contained in the text feature vector.
In the embodiment of the disclosure, when determining the initial text codes corresponding to each phoneme in the phoneme sequence from the text feature vector, reliable reference data can be provided for the subsequent determination of the target text codes.
Step S409, determining a first frame number corresponding to the phonemes according to the phoneme duration.
The first frame number refers to the number of video frames corresponding to each phoneme. For example, the phoneme duration of one phoneme may be 25ms, and if one audio frame is set to 5ms, one phoneme corresponds to information including 5 frames.
And step S410, copying the initial text codes, and performing splicing processing on the initial text codes of the first frame number obtained by copying to obtain target text codes.
The target text code refers to a text code obtained by splicing initial text codes of a first frame number.
It can be understood that the phoneme duration corresponding to one phoneme may be smaller, so that more redundant information may exist between a plurality of audio frames corresponding to the same phoneme, and when the initial text codes are copied and the initial text codes of the first frame number obtained by the copying are spliced to obtain the target text codes, the practicability of the obtained target text codes can be effectively improved.
Step S411, forming a frame-level text feature vector according to a plurality of target text codes.
That is, in the embodiment of the disclosure, after inputting a phoneme sequence into a text coding submodel to obtain text feature vectors of lyrics text in a song template, an initial text code corresponding to each phoneme in the phoneme sequence may be determined from the text feature vectors, a first frame number corresponding to the phoneme is determined according to a phoneme duration, the initial text codes are copied, the initial text codes of the first frame number obtained by copying are spliced to obtain a target text code, and a frame-level text feature vector is formed according to a plurality of target text codes.
Step S412, determining a second frame number of the voice audio according to the phoneme duration.
The second frame number refers to the frame number of the voice audio determined based on the phoneme duration.
And S413, copying the tone characteristic vector, and performing splicing processing on the tone characteristic vector of the second frame number obtained by copying to obtain a frame-level tone characteristic vector.
That is, after the frame-level text feature vector is formed according to the multiple target text codes, the embodiment of the disclosure may determine the second frame number of the voice audio according to the phoneme duration, copy the tone feature vector, and splice the tone feature vector of the second frame number obtained by copying to obtain the frame-level tone feature vector, so that the obtained frame-level tone feature vector may effectively represent relevant feature information of the voice to be processed from the magnitude of the audio frame, so as to effectively improve the suitability between the obtained frame-level tone feature vector and the frame-level text feature vector, and simultaneously effectively improve the characterization effect of the obtained frame-level tone feature vector.
Step S414, adding the frame-level text feature vector, the frame-level tone feature vector and the song melody information, and inputting the added result into an acoustic decoding submodel to obtain the target Mel spectrum feature.
Step S415, generating a target song according to the target Mel spectrum characteristics.
The descriptions of step S414 to step S415 may be specifically referred to the above embodiments, and are not repeated herein.
For example, as shown in fig. 6, fig. 6 is a schematic song generating flow chart according to an embodiment of the present disclosure, after a new user provides a piece of voice audio, an operation flow corresponding to a song generating model may include (1) processing the voice audio based on an acoustic feature extraction method to obtain a real mel spectrum feature, (2) inputting the real mel spectrum feature into a timbre coding submodel to obtain a timbre feature vector, (3) inputting a phoneme sequence in a song template into a text coding submodel to obtain a text feature vector of a target song, (4) inputting a phoneme duration, the timbre feature vector and the text feature vector into a duration normalization submodel to obtain a frame-level text feature vector and a frame-level timbre feature vector, (5) adding the frame-level text feature vector, the frame-level timbre feature vector, the song note sequence and the song energy sequence and then inputting the real mel spectrum feature into an acoustic decoding submodel to obtain the target mel spectrum feature, (6) inputting the obtained target spectrum feature into a vocoder to obtain the target song. The vocoder may be a neural network vocoder.
That is, in the song generation process of the embodiment of the disclosure, a plurality of users can share the pre-trained song generation model, and songs deduced by the users can be obtained based on a section of audio of the users, so that convenience in the song generation process is effectively improved, computing resources are effectively reduced, and storage cost is reduced.
In this embodiment, the real mel spectrum feature of the target user is input into the reference encoder to obtain a tone hidden space distribution vector of the target user, the tone hidden space distribution vector is input into the autoregressive encoder to obtain a tone distribution vector of the target user, wherein the tone distribution vector is obtained by sampling the tone hidden space distribution vector by the autoregressive encoder, the tone distribution vector is used as a tone feature vector of the target user, thereby effectively reducing redundant information in the obtained tone feature vector, simultaneously converting the relatively complex real mel spectrum feature into a vector form, thereby effectively improving the practicability of the obtained tone feature vector, determining an initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector, determining a first frame number corresponding to the phoneme according to the phoneme duration, copying the initial text codes, performing splicing processing on the initial text codes of the first frame number obtained by copying to obtain target text codes, forming frame-level text feature vectors according to a plurality of target text codes, and since the time range corresponding to each phoneme is smaller and the representation contents of different audio frames in the same phoneme have higher similarity, when the initial text codes are copied, and the initial text codes of the first frame number obtained by copying are subjected to splicing processing to obtain the target text codes, the calculation cost can be reduced to a greater extent, thereby effectively improving the determination efficiency of the frame-level text feature vectors, determining the second frame number of voice audios according to the phoneme duration, copying tone feature vectors, and performing splicing processing on the tone feature vectors of the second frame number obtained by copying to obtain frame-level tone feature vectors, thereby, the obtained frame-level tone characteristic vector can effectively represent relevant characteristic information of the voice to be processed from the magnitude of the audio frame so as to effectively improve the suitability between the obtained frame-level tone characteristic vector and the frame-level text characteristic vector and effectively improve the representation effect of the obtained frame-level tone characteristic vector.
Fig. 7 is a flowchart of a training method of a song generating model according to an embodiment of the present disclosure.
It should be noted that, the execution body of the training method of the song generating model in this embodiment is a training device of the song generating model, and the device may be implemented in a software and/or hardware manner, and the device may be configured in an electronic device, where the electronic device may include, but is not limited to, a terminal, a server, and the like, and the terminal may be, for example, a smart phone, a smart television, a smart watch, a smart car, and the like.
As shown in fig. 7, the training method of the song generating model may include, but is not limited to, the following steps:
step S701, a training set is obtained, wherein the training set is from a plurality of sampling users, the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, and each sample comprises singing audio picked up by the sampling user when singing a certain song and lyric text corresponding to the singing audio.
In the embodiment of the present disclosure, when the training set is acquired, the communication link between the execution body and the big data server in the embodiment of the present disclosure may be pre-established, and then the training set is acquired from the big data server, or the training set may be acquired from a plurality of sampling users based on the sample collection device, which is not limited.
Step S702, acquiring a pre-built initial neural network model, wherein the initial neural network model comprises initial weight parameters and a loss function.
Among them, the neural network model is a complex network system formed by a large number of simple processing units (called neurons) widely connected to each other, and reflects many basic features of human brain functions. And the initial neural network model refers to a neural network model to be subjected to model training. The initial weight parameter refers to a weight parameter to be iteratively updated in the model training process. And a loss function may be used to describe error information between the predicted mel-spectrum features and the true mel-spectrum features output by the initial neural network model during the training process.
In the embodiment of the disclosure, the performance of the model can be evaluated in real time in the model training process based on the loss function, and whether the model converges or not can be timely judged.
Step 703, obtaining a first sample from the training set, inputting the first sample into the initial neural network model to obtain a real Mel spectrum feature and a predicted Mel spectrum feature, wherein the real Mel spectrum feature represents the Mel spectrum feature of singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model.
Wherein, the first sample refers to a sample used for model training among a plurality of samples in the training set.
The embodiment of the disclosure can randomly acquire one sample from the training set as the first sample when acquiring the first sample from the training set, or can acquire the first sample from the training set based on the number information of a plurality of samples in the training set, which is not limited.
Optionally, in some embodiments, when inputting the first sample into the initial neural network model to obtain the real mel spectrum feature and the predicted mel spectrum feature, text transcription may be performed on the lyric text in the first sample to obtain a phoneme sequence, the singing audio pair in the first sample is aligned according to the phoneme sequence to obtain a phoneme duration, acoustic feature extraction is performed on the singing audio in the first sample to obtain the real mel spectrum feature, audio energy and a fundamental frequency track of the first sample, the phoneme sequence is input into an initial text coding submodel to obtain a text feature vector of the first sample, the real mel spectrum feature of the first sample is input into an initial timbre coding submodel to obtain a timbre feature vector of the first sample, the text feature vector and the timbre feature vector are time-ordered according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector, the frame-level text feature vector, the audio energy and the fundamental frequency track are added and then input into the initial acoustic decoding submodel to obtain the real mel spectrum feature of the first sample, thereby the predicted audio feature of the first sample can be accurately converted into a text feature vector in a multiple-feature-to realize the multiple-feature conversion process at the same time.
The initial text coding sub-model refers to a text coding sub-model to be subjected to model training. The initial timbre coding submodel refers to a timbre coding submodel to be subjected to model training. The initial acoustic decoding sub-model refers to an acoustic decoding sub-model to be model trained.
The audio energy refers to energy information corresponding to singing audio in a first sample.
The track of the fundamental frequency refers to track information of the singing audio corresponding to the fundamental frequency in the first sample.
Step S704, calculating the error between the predicted Mel-spectrum characteristic and the real Mel-spectrum characteristic according to the loss function.
Wherein the error may be used to describe difference information between the predicted mel-spectrum feature and the true mel-spectrum feature.
In the embodiment of the disclosure, when calculating the error between the predicted mel-spectrum characteristic and the real mel-spectrum characteristic according to the loss function, the output accuracy of the initial neural network model can be evaluated in real time to determine the model performance, and the obtained error can provide reliable reference data for determining the model optimization direction.
Step 705, the initial weight parameters of the initial neural network model are adjusted according to the errors, and the updated neural network model is obtained.
In the embodiment of the disclosure, when the initial weight parameters of the initial neural network model are adjusted according to the errors, the accurate adjustment of the initial weight parameters can be realized based on the errors, so that the training effect of the neural network model is effectively improved.
Step S706, obtaining subsequent samples one by one from the training set, and repeatedly inputting the subsequent samples into the latest neural network model until the loss function converges, so as to obtain a song generating model after training.
The subsequent samples refer to samples in the training set except for the first sample.
For example, as shown in fig. 8, fig. 8 is a training flowchart of an initial neural network model according to an embodiment of the present disclosure, where the initial neural network model may include an initial tone coding sub-model, an initial text coding sub-model, and an initial acoustic decoding sub-model, and the training flowchart may include (1) processing song lyrics via text transcription to obtain a corresponding phoneme sequence, processing the obtained phoneme sequence via the initial text coding sub-model to obtain a corresponding text feature vector, (2) processing the text feature vector and a song phoneme duration based on forced alignment to obtain an initial text coding, (3) processing song audio based on acoustic feature extraction to obtain a true mel spectrum feature, a song energy feature, and a song fundamental frequency feature, (4) processing the true mel spectrum feature via the initial tone coding sub-model to obtain a true tone spectrum feature vector, (5) dividing a plurality of energy values in the song audio energy feature into different energy bands (for example, the energy values ranging from 0 to 10 energy bands or 20 energy bands may be divided based on application environments), and processing the song audio feature data based on the application environment to obtain a single-note frequency, and performing a quantization method based on the song feature, and obtaining a relevant music feature sequence based on the heat-encoded music feature sequence. For example, the note number corresponding to the fundamental frequency 261.63Hz is 60, the note number corresponding to the fundamental frequency 277.18Hz is 61, (7) a time length normalization method is used for processing the initial coding text based on the song phoneme time length to obtain a frame-level text feature vector, (8) a time length normalization sub-model is used for processing the tone feature vector based on the song phoneme time length to obtain a frame-level tone feature vector, (9) the frame-level text feature vector, the frame-level tone feature vector, the song energy sequence and the song note sequence are input into an acoustic decoding sub-model to obtain a predicted Mel spectrum feature, and (10) a loss function of the song generation model is determined based on the real Mel spectrum feature and the predicted spectrum feature. Through the loss function, each weight parameter in the song generation model can be iteratively updated in a gradient back-propagation mode, so that the loss function tends to converge.
In the embodiment of the disclosure, a training set is obtained, the training set is from a plurality of sampling users, the training set comprises a plurality of samples, one sampling user at least corresponds to one sample, each sample comprises singing audio picked up by the sampling user when singing a song and lyric text corresponding to the singing audio, a pre-built initial neural network model is obtained, the initial neural network model comprises initial weight parameters and a loss function, a first sample is obtained from the training set, the first sample is input into the initial neural network model to obtain real Mel spectrum characteristics and predicted Mel spectrum characteristics, the real Mel spectrum characteristics represent Mel spectrum characteristics of the singing audio in the first sample, the predicted Mel spectrum characteristics represent Mel spectrum characteristics predicted by the initial neural network model, errors between the predicted Mel spectrum characteristics and the real Mel spectrum characteristics are calculated according to the loss function, the initial weight parameters of the initial neural network model are adjusted according to the errors, the updated neural network model is obtained, the subsequent samples are obtained from the training set, the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, thereby obtaining the completed song, the predicted spectrum characteristics can be accurately judged according to the real spectrum characteristics, and the real error can be generated in a real-time, and the error can be accurately judged according to the predicted by the model.
Fig. 9 is a schematic structural diagram of a song generating apparatus according to an embodiment of the present disclosure.
As shown in fig. 9, the song generating apparatus 90 includes:
a first obtaining module 901, configured to obtain voice audio input by a target user and a unique identifier of a target song;
The first processing module 902 is configured to extract mel spectrum features of the voice audio to obtain real mel spectrum features of the target user;
A second obtaining module 903, configured to obtain a song template corresponding to the unique identifier according to the unique identifier of the target song;
The second processing module 904 is configured to input a real mel spectrum feature of a target user and a song template into a preset song generating model to obtain a target mel spectrum feature output by the song generating model, where the song generating model is obtained by machine learning training using a training set, the training set is from a plurality of sampling users, the training set includes a plurality of samples, and one sampling user corresponds to at least one sample, and each sample includes singing audio picked up when the sampling user sings a certain song and lyric text corresponding to the singing audio;
A generating module 905 is configured to generate a target song according to the target mel spectrum feature.
In some embodiments of the present disclosure, as shown in fig. 10, fig. 10 is a schematic structural diagram of a song generating apparatus according to another embodiment of the present disclosure, where a song generating model includes a timbre coding sub-model, a text coding sub-model, and an acoustic decoding sub-model, and the song generating model is obtained by jointly training the timbre coding sub-model, the text coding sub-model, and the acoustic decoding sub-model using the same training set.
In some embodiments of the present disclosure, the song template includes lyric text information including a phoneme sequence and a phoneme duration, and song melody information including a song note sequence and a song energy sequence.
In some embodiments of the disclosure, the second processing module 904 includes a first processing sub-module 9041 configured to input a real mel-spectrum feature of a target user into a timbre coding sub-model to obtain a timbre feature vector of the target user, a second processing sub-module 9042 configured to input a phoneme sequence into a text coding sub-model to obtain a text feature vector of a lyric text in a song template, a third processing sub-module 9043 configured to time-length-scale the text feature vector and the timbre feature vector according to a phoneme time length to obtain a frame-level text feature vector and a frame-level timbre feature vector, and a fourth processing sub-module 9044 configured to add the frame-level text feature vector, the frame-level timbre feature vector and song melody information and input the added result to an acoustic decoding sub-model to obtain the target mel-spectrum feature.
The embodiments of the disclosure are characterized in that the first processing submodule 9041 is specifically configured to input a real mel spectrum feature of a target user into a reference encoder to obtain a timbre hidden space distribution vector of the target user, input the timbre hidden space distribution vector into an autoregressive encoder to obtain a timbre distribution vector of the target user, wherein the timbre distribution vector is obtained by sampling the timbre hidden space distribution vector by the autoregressive encoder, and take the timbre distribution vector as a timbre feature vector of the target user.
In some embodiments of the present disclosure, the third processing sub-module 9043 is specifically configured to determine an initial text code corresponding to each phoneme in the phoneme sequence from the text feature vectors, determine a first frame number corresponding to the phoneme according to the phoneme duration, copy the initial text code, and splice the copied first frame number of the initial text code to obtain a target text code, and form a frame-level text feature vector according to the plurality of target text codes.
In some embodiments of the present disclosure, the third processing submodule 9043 is further configured to determine a second frame number of the voice audio according to the phoneme duration, copy the timbre feature vector, and splice the timbre feature vector of the second frame number obtained by copying to obtain a frame-level timbre feature vector. In some embodiments of the present disclosure, the song template is configured from a phoneme sequence, a phoneme duration, a song note sequence, a song energy sequence, and a unique identification of the target song, wherein the phoneme sequence and the phoneme duration of the target song are determined from song audio and song lyrics of the target song, and the song note sequence and the song energy sequence of the target song are determined from song audio.
In some embodiments of the present disclosure, the phoneme sequence comprises a plurality of phonemes obtained by parsing lyrics of a song, and the phoneme duration comprises a first number of frames each phoneme occupies in the song audio.
In some embodiments of the present disclosure, the song energy sequence is derived from a quantization process of song energy characteristics of the song audio, and the song note sequence is derived from a quantization process of song fundamental frequency characteristics of the song audio.
In some embodiments of the present disclosure, the song energy characteristics include a plurality of energy values, and the song energy sequence is formed from a plurality of range encoding values, the range encoding values being obtained by a single-heat encoding process of energy ranges corresponding to the energy values.
In some embodiments of the present disclosure, the song fundamental frequency feature comprises a plurality of fundamental frequency values, and the song note sequence comprises a musical notation corresponding to each fundamental frequency value.
It should be noted that the explanation of the song Qu Shengcheng method is also applicable to the song generating apparatus of the present embodiment, and will not be repeated here.
In this embodiment, by acquiring the voice audio input by the target user and the unique identification number of the target song, extracting the mel spectrum feature of the voice audio to obtain the real mel spectrum feature of the target user, acquiring the song template corresponding to the unique identification number according to the unique identification number of the target song, inputting the real mel spectrum feature of the target user and the song template into the preset song generation model to obtain the target mel spectrum feature output by the song generation model, and generating the target song according to the target mel spectrum feature, the real mel spectrum feature of the target user and the song template corresponding to the target song can be effectively combined in the song generation process, so that the degree of dependence on the data volume of the voice data of the user can be effectively reduced, and the song generation convenience can be improved.
Fig. 11 is a schematic structural diagram of a training device for a song generating model according to an embodiment of the present disclosure.
As shown in FIG. 11, the training device 110 of the song generating model includes a third obtaining module 1101 for obtaining a training set from a plurality of sampling users, the training set including a plurality of samples, one sampling user corresponding to at least one sample, each sample including singing audio picked up by the sampling user when singing a song and lyrics text corresponding to the singing audio; the system comprises a first acquisition module 1102 for acquiring a pre-built initial neural network model, wherein the initial neural network model comprises initial weight parameters and a loss function, a fifth acquisition module 1103 for acquiring a first sample from a training set and inputting the first sample into the initial neural network model to obtain real Mel spectrum characteristics and predicted Mel spectrum characteristics, wherein the real Mel spectrum characteristics represent Mel spectrum characteristics of singing voice in the first sample, the predicted Mel spectrum characteristics represent Mel spectrum characteristics predicted by the initial neural network model, a third processing module 1104 for calculating errors between the predicted Mel spectrum characteristics and the real Mel spectrum characteristics according to the loss function, a fourth processing module 1105 for adjusting the initial weight parameters of the initial neural network model according to the errors to obtain an updated neural network model, and a sixth acquisition module 1106 for repeatedly inputting the subsequent sample into the latest neural network model until the loss function converges to obtain a song generating model after training is completed.
In some embodiments of the disclosure, as shown in fig. 12, fig. 12 is a schematic structural diagram of a training device of a song generating model according to another embodiment of the disclosure, where an initial neural network model includes an initial tone coding sub-model, an initial text coding sub-model, and an initial acoustic decoding sub-model, a fifth obtaining module 1103 includes a fifth processing sub-module 11031 for text transcription of a lyric text in a first sample to obtain a phoneme sequence, aligning singing audio in the first sample according to the phoneme sequence to obtain a phoneme duration, a sixth processing sub-module 11032 for acoustic feature extraction of singing audio in the first sample to obtain a real mel spectrum feature, audio energy, and a base frequency track of the first sample, a seventh processing sub-module 11033 for inputting the phoneme sequence into the initial text coding sub-model to obtain a text feature vector of the first sample, an eighth processing sub-module 11034 for inputting the real mel spectrum feature of the first sample into the initial tone coding sub-model to obtain a text feature vector of the first sample, and a ninth processing sub-module for performing acoustic feature vector addition of the phoneme sequence to the text feature vector of the first sample, and the base frequency sub-frame stage speech feature vector to the initial tone model, and the voice feature vector to obtain a speech feature vector of the first sample, and the initial tone model stage feature vector and the initial tone model.
It should be noted that the explanation of the foregoing method for training the song-generation model is also applicable to the training device for the song-generation model in this embodiment, and will not be repeated here.
In the embodiment, a training set is obtained, the training set comprises a plurality of samples, one sampling user corresponds to at least one sample, each sample comprises singing voice frequency picked up by the sampling user when singing a certain song and lyric text corresponding to the singing voice frequency, a pre-built initial neural network model is obtained, the initial neural network model comprises initial weight parameters and a loss function, a first sample is obtained from the training set, the first sample is input into the initial neural network model to obtain real Mel spectrum characteristics and predicted Mel spectrum characteristics, the real Mel spectrum characteristics represent Mel spectrum characteristics predicted by the initial neural network model, errors between the predicted Mel spectrum characteristics and the real Mel spectrum characteristics are calculated according to the loss function, the initial weight parameters of the initial neural network model are adjusted according to the errors to obtain updated neural network models, the subsequent samples are obtained one by one from the training set and are repeatedly input into the latest neural network model until the loss function converges, and accordingly the calculated song can be accurately converged, the error between the predicted spectrum characteristics and the real spectrum characteristics can be accurately judged in a real-time basis, and the error can be generated in the training model, and the error can be accurately judged.
Fig. 13 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present disclosure. The electronic device 12 shown in fig. 13 is merely an example and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.
As shown in fig. 13, the electronic device 12 is in the form of a general purpose computing device. The components of the electronic device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECTION; hereinafter PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 13, commonly referred to as a "hard disk drive").
Although not shown in fig. 13, a disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described in this disclosure.
The electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a person to interact with the electronic device 12, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks, such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN), and/or a public network, such as the Internet, through the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 over the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 12, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the song-generation method and the training method of the song-generation model mentioned in the foregoing embodiments.
To achieve the above-described embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a song-generation method and a training method of a song-generation model as proposed by the foregoing embodiments of the present disclosure.
To achieve the above-described embodiments, the present disclosure also proposes a computer program product which, when executed by an instruction processor in the computer program product, performs a song-generation method and a training method of a song-generation model as proposed by the foregoing embodiments of the present disclosure.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer programs. When the computer program is loaded and executed on a computer, the flow or functions described in accordance with the embodiments of the present disclosure are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer program may be stored in or transmitted from one computer readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a solid-state disk (solid-state drive STATE DISK, SSD)), or the like.
It will be appreciated by those of ordinary skill in the art that the various numbers of first, second, etc. referred to in this disclosure are for ease of description only and are not intended to limit the scope of the disclosed embodiments, nor to indicate sequencing.
At least one of the present disclosure may also be described as one or more, a plurality may be two, three, four or more, and the present disclosure is not limited. In the embodiment of the disclosure, for a technical feature, the technical features in the technical feature are distinguished by "first", "second", "third", "a", "B", "C", and "D", and the technical features described by "first", "second", "third", "a", "B", "C", and "D" are not in sequence or in order of magnitude.
The correspondence relationships shown in the tables in the present disclosure may be configured or predefined. The values of the information in each table are merely examples, and may be configured as other values, and the present disclosure is not limited thereto. In the case of the correspondence between the configuration information and each parameter, it is not necessarily required to configure all the correspondence shown in each table. For example, in the table in the present disclosure, the correspondence shown by some rows may not be configured. For another example, appropriate morphing adjustments, e.g., splitting, merging, etc., may be made based on the tables described above. The names of the parameters indicated in the tables may be other names which are understood by the communication device, and the values or expressions of the parameters may be other values or expressions which are understood by the communication device. When the tables are implemented, other data structures may be used, for example, an array, a queue, a container, a stack, a linear table, a pointer, a linked list, a tree, a graph, a structure, a class, a heap, a hash table, or a hash table.
Predefined in this disclosure may be understood as defining, predefining, storing, pre-negotiating, pre-configuring, curing, or pre-sintering.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.