Disclosure of Invention
The embodiment of the invention provides an audio production method, an audio production device, audio production equipment and a storage medium, which can simplify audio production steps. The technical scheme is as follows:
in one aspect, there is provided an audio production method, the method comprising:
an audio editing interface for displaying a first audio, the audio editing interface comprising at least one sentence of lyrics of the first audio and a lyrics editing control, the at least one sentence of lyrics comprising a first lyrics;
Receiving a lyric editing operation on the lyric editing control for the first lyrics, wherein the lyric editing operation comprises inputting second lyrics;
and replacing the first lyrics in the first audio with the second lyrics to generate second audio, wherein the second audio comprises voice audio generated according to the second lyrics.
Optionally, the method further comprises:
acquiring a target tone, wherein the target tone is used for generating the voice audio;
the replacing the first lyrics in the first audio with the second lyrics to generate second audio includes:
And replacing the first lyrics in the first audio with the second lyrics according to the target tone color, and generating the second audio.
Optionally, the replacing the first lyrics in the first audio with the second lyrics according to the target tone color, generating the second audio includes:
Generating the voice audio containing the second lyrics according to the target tone, the phonemes of the second lyrics and notes corresponding to the first lyrics in the first audio;
Acquiring template audio of the first audio, wherein the template audio comprises at least one of accompaniment audio and main melody audio;
and generating the second audio according to the template audio and the voice audio.
Optionally, the generating the voice audio including the second lyrics according to the target tone color, the phonemes of the second lyrics, and the notes corresponding to the first lyrics in the first audio includes:
inputting a tone mark of the target tone, the phonemes of the second lyrics and notes corresponding to the first lyrics in the first audio into an acoustic model to obtain a Mel frequency spectrum;
and calling a vocoder to convert the Mel frequency spectrum into the voice audio.
Optionally, the second audio includes:
the audio duration is smaller than the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is smaller than the first audio, and the human voice audio of all lyrics is generated according to the target tone;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the vocal audio of all lyrics is audio generated from the target tone.
Optionally, the method further comprises:
Acquiring training data, the training data comprising: at least one of phonemes of training lyrics, notes of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, tone color identification of training audio, and mel frequency spectrum of the training audio;
and training an initial model according to the training data to obtain the acoustic model.
Optionally, the method further comprises:
Displaying an audio playing interface of the second audio, wherein the audio playing interface comprises a playing control;
And playing the second audio in response to receiving a play operation triggering the play control.
Optionally, the acquiring the target tone color includes:
Displaying a tone color selection interface, wherein the tone color selection interface comprises at least one candidate tone color and a selection control;
in response to receiving a selection operation triggering the selection control, determining the target timbre from the candidate timbres according to the selection operation;
the method further includes, after the step of replacing the first lyrics in the first audio with the second lyrics according to the target tone color and generating the second audio including the second lyrics, the steps of:
And playing the second audio.
In another aspect, there is provided an audio production apparatus, the apparatus comprising:
The display module is used for displaying an audio editing interface of the first audio, the audio editing interface comprises at least one sentence of lyrics of the first audio and a lyrics editing control, and the at least one sentence of lyrics comprises the first lyrics;
The interaction module is used for receiving lyric editing operation on the lyric editing control for the first lyrics, and the lyric editing operation comprises the step of inputting second lyrics;
and the generation module is used for replacing the first lyrics in the first audio with the second lyrics to generate second audio, and the second audio comprises voice audio generated according to the second lyrics.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring a target tone, wherein the target tone is used for generating the voice audio;
the generation module is further configured to replace the first lyrics in the first audio with the second lyrics according to the target tone color, and generate the second audio.
Optionally, the generating module is further configured to generate the vocal audio including the second lyrics according to the target tone, phonemes of the second lyrics, and notes corresponding to the first lyrics in the first audio;
The acquisition module is further configured to acquire template audio of the first audio, where the template audio includes at least one of accompaniment audio and main melody audio;
the generating module is further configured to generate the second audio according to the template audio and the voice audio.
Optionally, the generating module includes:
the model submodule is used for inputting the tone mark of the target tone, the phonemes of the second lyrics and notes corresponding to the first lyrics in the first audio into an acoustic model to obtain a Mel frequency spectrum;
and the vocoder sub-module is used for calling a vocoder to convert the Mel frequency spectrum into the voice frequency.
Optionally, the second audio includes:
the audio duration is smaller than the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is smaller than the first audio, and the human voice audio of all lyrics is generated according to the target tone;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the vocal audio of all lyrics is audio generated from the target tone.
Optionally, the apparatus further comprises:
The acquisition module is further configured to acquire training data, where the training data includes: at least one of phonemes of training lyrics, notes of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, tone color identification of training audio, and mel frequency spectrum of the training audio;
And the training module is used for training an initial model according to the training data to obtain the acoustic model.
Optionally, the apparatus further comprises:
the display module is further configured to display an audio playing interface of the second audio, where the audio playing interface includes a playing control;
The interaction module is further used for receiving a play operation triggering the play control;
And the playing module is used for responding to the receiving of the playing operation triggering the playing control and playing the second audio.
Optionally, the apparatus further comprises:
The display module is further used for displaying a tone color selection interface, and the tone color selection interface comprises at least one candidate tone color and a selection control;
The interaction module is further used for receiving a selection operation triggering the selection control;
the acquisition module is further configured to determine, in response to receiving a selection operation triggering the selection control, the target timbre from the candidate timbres according to the selection operation.
And the playing module is used for playing the second audio.
In another aspect, a computer device is provided, the computer device comprising: the audio production method comprises a processor and a memory, wherein the memory stores instructions which are executed by the processor to realize the audio production method.
In another aspect, a computer readable storage medium having instructions stored thereon that when executed by a processor implement the above-described audio production method is provided.
In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the above-described audio production method.
The technical scheme provided by the embodiment of the invention has the beneficial effects that:
By receiving the change of the song lyrics from the user on the audio editing interface and generating the changed song according to the changed lyrics of the user and the original song, the user can modify the lyrics of the song by one key to quickly generate a new song, the operation steps of generating the audio by the user are simplified, and the efficiency of audio editing is improved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
Before the embodiment of the invention is described in detail, an application scene and an implementation environment related to the embodiment of the invention are briefly described.
First, terms related to the embodiments of the present invention will be briefly explained.
A User Interface (UI) control, any visual control or element that can be seen on the User Interface of the application, such as a picture, an input box, a text box, a button, a label, etc., wherein some UI controls respond to a User operation, such as a User triggering an edit control, and can input text. UI controls involved in embodiments of the present application include, but are not limited to: lyrics editing control, playing control and selecting control.
Phoneme (phone): the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme. Phonemes are divided into two major classes, vowels and consonants. For example, the Chinese syllable o (ā) has only one phoneme, the love (a i) has two phonemes, the generation (d a i) has three phonemes, etc. The phonemes are the smallest units or smallest speech segments constituting syllables, and are the smallest linear speech units divided from the viewpoint of sound quality. Phonemes are physical phenomena that exist in particular. Phonetic symbols of the international phonetic symbols (letters prepared by the international phonetic society for uniformly marking voices of various countries, also called "international phonetic letters" and "ten thousand phonetic letters") are in one-to-one correspondence with phonemes of the all-human language. Phonemes are typically marked with international phonetic symbols (International Phonetic Alphabet, IPA). International phonetic symbols are a popular phonetic symbol internationally, formulated and published by the international phonetic association in 1888, and modified a number of times. The international sound is used to represent phonemic details on the pronunciation [ ] and///for phonemes. Phonemes are generally divided into two major classes, vowels and consonants.
Then, a brief description will be given of a real-time environment related to an embodiment of the present invention.
Referring to fig. 1, there is shown a schematic structure of a computer system including a terminal 120 and a server 140 according to an exemplary embodiment of the present application. The terminal 120 and the server 140 are connected to each other through a wired or wireless network.
Alternatively, the terminal 120 may be at least one of a notebook computer, a desktop computer, a smart phone, a tablet computer, a smart speaker, and a smart robot.
The terminal 120 includes a first memory and a first processor. The first memory stores a first program; the first program is called and executed by the first processor to implement an audio production method. The first memory may include, but is not limited to, the following: random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM).
The first processor may be one or more integrated circuit chips. Alternatively, the first processor may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a network processor (Network Processor, NP). Optionally, the first processor may implement the audio production method or the training method of the acoustic model provided by the present application.
The server 140 includes a second memory and a second processor. The second memory stores a second program, and the second program is called by the second processor to realize the audio production method provided by the application. Illustratively, the second memory has a second program stored therein; the second program is called and executed by the second processor to realize the audio production method. Alternatively, the second memory may include, but is not limited to, the following: RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.
The audio production method provided by the application can be applied to scenes such as song adaptation, song production, song preview and the like.
The audio production method provided by the embodiment of the invention can be executed by the terminal or by the terminal and the server; the terminal has an audio production function, and further, the terminal also has an audio playing function. In some embodiments, the terminal may be a mobile phone, a tablet computer, a desktop computer, a portable computer, etc., which is not limited in this embodiment of the present invention.
Fig. 2 is a flowchart illustrating an audio production method according to another exemplary embodiment, which is applied to a terminal for illustration, and the audio production method may include the following steps:
step 210, displaying an audio editing interface of the first audio, wherein the audio editing interface comprises at least one sentence of lyrics of the first audio and a lyrics editing control, and the at least one sentence of lyrics comprises the first lyrics.
The audio editing interface is used for editing the first audio. For example, an audio selection interface may be included prior to the audio editing interface, the audio selection interface being used to determine the target audio that requires audio editing. In the present embodiment, the editing of the first audio by the user selection is exemplified.
Illustratively, the audio editing interface is for presenting audio information of the first audio. The audio information includes at least one of lyrics of the first audio, MV (Music Video), track, tone map (tone level for identifying the main melody), time domain signal, frequency domain signal, audio producer information, related pictures (album art, singer picture, etc.), play progress bar, audio duration.
The audio editing interface also includes, for example, an editing control for editing the first audio. The editing control comprises at least one of lyric editing control, tone editing control, music score editing control, resetting control, listening trial control, finishing control, saving control, tone selection control, sharing control and clipping control (selecting control).
The lyric editing control is used for editing lyrics of the first audio, optionally, the lyric editing control is used for displaying a lyric editing interface after being triggered, a user can input the lyrics in the lyric editing interface, and the client receives the lyrics input by the user in the lyric editing interface. Illustratively, the lyric editing interface may be a new user interface, or may refer to an editing interface located above an audio editing interface; the lyric editing interface comprises an editing box, and the editing box is used for receiving text information input by a user. Illustratively, each lyric of the first audio corresponds to a lyric editing control, and the client displays a lyric editing interface corresponding to the lyrics after receiving a triggering operation on the lyric editing control, illustratively, the method can also be that one section of lyrics of the first audio corresponds to one lyric editing control, and after receiving triggering operation on the lyric editing control, the client displays a lyric editing interface corresponding to the section of lyrics, and a user can edit the section of lyrics in the lyric editing interface. For example, there may be only one lyrics editing control, and after receiving a trigger operation on the lyrics editing control, the client displays a lyrics editing interface of the whole first audio, where the user may edit all lyrics of the first audio. The lyric editing control may be an invisible UI control that is set on the audio editing interface and is bound to the lyrics, and the user triggers the lyric editing control corresponding to the lyrics by clicking, double clicking or long pressing on the lyrics or the area corresponding to the lyrics, and enters the lyric editing interface of the lyrics. Illustratively, the lyrics editing control may also be an icon visible on the audio editing interface, and the user triggers the lyrics editing control by clicking, double clicking or long pressing, entering the lyrics editing interface.
Illustratively, the tone edit control is used to edit the tone of the first audio, for example, to adjust the tone level of the voice audio corresponding to a sentence of lyrics in the first audio. Illustratively, the melody editing control is for editing a main melody or accompaniment melody or a vocal pitch of the first audio. The tone color selection control is for selecting a tone color of a human voice generating the second audio. Illustratively, the client provides the user with a different timbre of the song pad from which the user may select a favorite timbre to generate a new song, e.g., the timbre of the song pad includes: young, dali, yujie, datery, etc. Illustratively, the reset control is configured to clear a historical editing operation of the first audio by the user, so that the user can edit the first audio again. The listening test control is used for playing second audio obtained according to the modification of the first audio by the user. The completion control is used for completing the audio editing and generating a second audio. The save control is for saving the generated second audio. The sharing control is used for sharing the second audio. The clip control user selects a part of audio fragments from the first audio and generates second audio based on the audio fragments.
Illustratively, the first audio is the audio of a song. Illustratively, the first audio includes at least one of a vocal audio, a main melody audio, and an accompaniment audio. The human voice audio is audio in which lyrics of the first audio are singed. The main melody audio is an audio of a main melody tone of the first audio. The first audio is, for example, audio synthesized from at least two of a human voice audio, a main melody audio, and an accompaniment audio. Illustratively, the first audio corresponds to at least one sentence of lyrics, and the lyrics refer to text information sung by the first audio.
For example, as shown in fig. 3, an audio editing interface of a first audio is provided, in which lyrics 301 of the first audio are displayed, a lyrics editing control 302 corresponding to the first lyrics, when a client receives a trigger operation on the lyrics editing control 302, a lyrics editing interface 303 of the first lyrics is displayed, and a user may input a second lyrics on the lyrics editing interface 303 to replace the first lyrics.
Step 230, receiving a lyric editing operation on the lyric editing control for the first lyrics, the lyric editing operation including inputting a second lyric.
The client receives a lyric editing operation of a user on a lyric editing control on a first lyric, and the client receives a second lyric input by the user in the lyric editing interface through the lyric editing control.
Illustratively, the number of words of the second lyrics may be the same as or different from the first lyrics. For example, in order to secure the audio effect of the second audio generated from the second lyrics, the number of words of the second lyrics input by the user may be limited, for example, to five words floating up and down based on the number of words of the first lyrics.
For example, the first lyrics may be at least one lyric in the first audio, and the corresponding second lyrics are at least one lyric corresponding to the first lyrics.
Step 250, replacing the first lyrics in the first audio with second lyrics, generating second audio, the second audio comprising human voice audio generated from the second lyrics.
And the client replaces the voice audio fragment corresponding to the first lyrics in the first audio with the voice audio fragment of the second lyrics according to the second lyrics input by the user, so as to obtain the second audio. That is, the second audio is audio synthesized from accompaniment audio or/and main melody audio of the first audio, and vocal audio of the second lyrics.
Illustratively, the second audio has an audio duration that is no greater than the first audio. That is, the audio duration of the second audio may be equal to the first audio, or the audio duration of the second audio may be less than the first audio. For example, if the audio duration of the second audio is equal to the first audio, the second audio is synthesized audio based on all accompaniment audio or all main melody audio of the first audio, voice audio corresponding to other lyrics of the first audio except for the first lyrics, and voice audio corresponding to the second lyrics. For another example, if the audio duration of the second audio is smaller than the first audio, the second audio is generated according to an audio segment intercepted from the first audio, and the audio segment of the first audio is identical to the audio duration of the second audio, and if the second audio is synthesized according to the accompaniment audio or the main melody audio of the audio segment, the voice audio corresponding to other lyrics except the first lyrics, and the voice audio corresponding to the second lyrics.
The step of generating the second audio may be performed by a client on the terminal or by a server, which sends the second audio to the client after generating the second audio.
Illustratively, the vocal audio of the second lyrics is audio generated by the computer according to a timbre selected by the user. As shown in fig. 4, step 250 is preceded by step 240, and step 250 further includes step 251.
In step 240, a target timbre is obtained, the target timbre being used to generate the vocal audio.
The target timbre is used for enabling the client to generate voice audio according to the sound characteristics of the target timbre.
The target timbre may be a client-side default timbre or a timbre selected by the user from a plurality of candidate timbres, for example. Illustratively, one timbre represents a vocal feature, and the client may refer to a different timbre with a virtual song pad. For example, virtual song agate a refers to Tong Yin and virtual song agate B refers to the sound of an adult male.
For example, the client may display a timbre selection interface including at least one candidate timbre and a selection control; and the client responds to the receiving of the selection operation triggering the selection control, and determines the target tone color from the candidate tone colors according to the selection operation. For example, the client may generate the second audio in real time according to the tone color selected by the user, and play the second audio.
For example, as shown in fig. 5, a tone color selection interface is provided, in which two candidate tone colors, i.e., a song-Ji a and a song-Ji B, are included, and a selection control 401, a user may select one of the song-Ji a and the song-Ji B as a target tone color, and generate second audio. For example, the currently selected timbre is Song Ji A, and the client is playing the second audio generated from the timbre of Song Ji A. If the user wants to generate a second audio using the timbre of Song Ji B, the user can click on Song Ji B and then click on selection control 401 to slide the target timbre more towards Song Ji B.
For example, the human voice audio of the second audio may be generated entirely with the target tone color, may be generated using only the second lyrics in part with the target tone color, and may be generated using only a partial fragment including the second lyrics in part with the target tone color. That is, the human voice audio of the second audio may include a tone color: target timbre. The human voice audio of the second audio may also contain two timbres: the target tone color and the primary tone color of the primary audio script.
Then, the second audio includes: the audio duration is smaller than the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio; or, the audio duration is equal to the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio; or, the audio duration is smaller than the first audio, and the human voice audio of all lyrics is generated according to the target tone; or, the audio duration is equal to the first audio, and the human voice audio of all lyrics is any one of the audio generated according to the target tone color.
Step 251, replacing the first lyrics in the first audio with the second lyrics according to the target tone color, and generating the second audio.
Illustratively, the client replaces the vocal audio of the first lyrics in the first audio according to the vocal audio of the second lyrics generated by using the target tone, and synthesizes the second audio by using the accompaniment audio or/and the main melody audio of the first audio.
Illustratively, a method for generating vocal audio using a target timbre is provided, as shown in fig. 6, step 251 further includes steps 2511 through 2513.
Step 2511, generating a vocal audio containing the second lyrics according to the target tone, the phonemes of the second lyrics, and the notes corresponding to the first lyrics in the first audio.
The client generates the vocal audio of the second lyrics by using the target tone, the phonemes of the second lyrics, and the notes corresponding to the first lyrics in the first audio. If the vocal audio of the other lyrics in the second audio is also generated with the target tone, the vocal audio of the other lyrics needs to be generated by using the target tone, phonemes of the other lyrics, and notes of the other lyrics corresponding to the first audio. If the voice audio of the other lyrics in the second audio uses the voice audio of the first audio, the voice audio of the first audio can be cut, and then the voice audio of the first audio is spliced with the voice audio of the second lyrics to obtain the complete voice audio of the second audio.
For example, the synthesis of the vocal audio also uses the position information of the phonemes, the position information of the notes, and the like. The position information of a phoneme is used to mark the position of the phoneme in audio. For example, the first phoneme takes up the 1 st to 100 th frames of the audio, the second phoneme takes up the 101 st to 200 th frames of the audio, and so on. The location information of a note is used to mark the location of the note in the audio, e.g., the location of the first note in the audio from frame 50 to frame 200. Illustratively, according to the above positional relationship, information of phonemes and notes corresponding to each frame of audio may be obtained.
At step 2512, template audio of the first audio is obtained, the template audio including at least one of accompaniment audio and main melody audio.
Illustratively, the client obtains template audio of the first audio, the template audio being audio other than human voice audio in the first audio. Illustratively, the template audio includes at least one of accompaniment audio and main melody audio.
Illustratively, for each song, template audio for the song needs to be produced in advance. The template audio production step comprises the following steps: 1. the accompaniment audio of the song is obtained, and illustratively, the accompaniment of the song may be separated from the audio of the original song or the accompaniment audio of the song may be directly obtained. Illustratively, a portion of the song may also clip the accompaniment, e.g., clip out a sub-song portion of the song. 2. Manual music scraping is performed to make midi (Musical Instrument DIGITAL INTERFACE ) file of main melody audio. 3. And (3) making a template audio, and aligning and synthesizing the accompaniment audio and the main melody audio to obtain the template audio.
Step 2513, generating a second audio from the template audio and the human voice audio.
And the client synthesizes the second audio according to the template audio of the first audio and the voice audio containing the second lyrics.
The second audio is the same as the first audio melody, tone but different from the melody.
The client may also play the second audio after obtaining the second audio. As shown in fig. 4, step 260 and step 270 are also included after step 250.
Step 260, displaying an audio playing interface of the second audio, wherein the audio playing interface comprises a playing control.
For example, the audio playback interface and the audio editing interface may be the same interface. That is, the client plays the second audio immediately after generating the second audio, so that the user can preview the generated second audio at the audio editing interface.
The audio playback interface and the audio editing interface may also be different interfaces, for example. That is, after the client generates the second audio, the user may click on the completion control to jump to the audio playing interface to play, save or share the second audio.
Step 270, playing the second audio in response to receiving a play operation triggering the play control.
In summary, according to the method provided by the embodiment, the user receives the change of the song lyrics from the user on the audio editing interface, and generates the changed song according to the changed lyrics from the user and the original song, so that the user can modify the lyrics of the song by one key to quickly generate a new song, the operation steps of generating the audio by the user are simplified, and the efficiency of audio editing is improved.
In the method provided by the embodiment, after the user changes the lyrics, the user also needs to select the song with the sound of the song with the lyrics adapted by the user to generate the voice audio, and replacing the human voice audio of the original lyrics of the original song with the human voice audio to obtain a new song. The user can select different song segments to sing the adapted lyrics, so that different new songs are obtained, the editability of the user to the songs is enriched, the operation steps of generating the audio by the user are simplified, and the editing efficiency is improved.
According to the method provided by the embodiment, firstly, the tone color of the song, the phonemes of the lyrics adapted by the user and the notes of the original lyrics are used for generating the voice audio for singing the adapted lyrics by using the original lyrics melody, and then the accompaniment of the original song and the newly generated voice audio are used for synthesizing to obtain a new song, so that the lyrics of the song can be replaced by one key, the new song is generated, and the operation of generating the audio by the user is simplified.
According to the method provided by the embodiment, the new song can be a part of a fragment of the original song or a whole song, and the new song can be a song with only a lyric changing part and using the designated song to sing, or a whole song with the designated song to sing.
According to the method provided by the embodiment, after the user selects the song, a new song is generated according to the selected song, and then the new song is immediately played, so that the user can preview the song generated according to the current song in real time, and if the user is not satisfied with the song effect, the song can be replaced in real time.
According to the method provided by the embodiment, after the new song is generated, the preview playing interface of the new song can be displayed, so that the user previews the generated new song.
Illustratively, the present embodiment provides a method for obtaining vocal audio using a neural network model. Fig. 7 is a flowchart of an audio production method according to another exemplary embodiment, which is applied to a terminal for illustration, where step 2511 further includes steps 2511-1 to 2511-2.
Step 2511-1, inputting the tone mark of the target tone, the phonemes of the second lyrics, and the notes corresponding to the first lyrics in the first audio into an acoustic model to obtain a mel frequency spectrum.
Illustratively, the acoustic model is a deep neural network acoustic model. The acoustic model is used for generating a mel spectrum (mel spectrum) according to the input two-dimensional text information. Illustratively, the acoustic model is a neural network model employing Long Short-Term Memory (LSTM) architecture.
The mel spectrum labels the frequency domain features of the audio using a mel scale. The method comprises the steps of carrying out frame-by-frame windowing on the time domain waveform of an audio signal, carrying out Fourier transform to obtain a frequency domain signal of the audio signal in each frame, stacking the frequency domain signals of each frame to obtain a spectrogram of the audio signal, and marking frequencies in the spectrogram by using a Mel scale to obtain the Mel spectrum of the audio signal. Also, the audio signal may be restored according to the mel spectrum of the audio signal. Mel scale, named by Stevens, volkmann and Newman in 1937. The unit of frequency is hertz (Hz), the range of frequencies audible to the human ear is 20-20000Hz, but the unit of scale of the human ear to Hz is not a linear perceived relationship. For example, if one fits a tone of 1000Hz, and then increases the tone frequency to 2000Hz, one can only perceive a small increase in frequency, and no increase in frequency at all can be perceived. If the ordinary frequency scale is converted into a mel frequency scale, the perception of frequency by the human ear is linear. That is, on the mel scale, if the mel frequencies of two pieces of speech differ by a factor of two, the tones that can be perceived by the human ear are also approximately different by a factor of two. The mapping of mel scale and frequency scale is shown as follows:
Where mel (f) is the Mel scale and f is the frequency.
Illustratively, a timbre identifier of a target timbre, a phoneme of the second lyrics, position information of the phoneme, a note corresponding to the first lyrics in the first audio and position information of the note are input into an acoustic model to obtain a mel frequency spectrum.
Wherein the position information of the phonemes of the second lyrics may be determined according to the position information of the phonemes of the first lyrics. When the number of words of the second lyrics is the same as that of the first lyrics, word bits (positions of respective words) of the second lyrics may be determined according to the word bits of the first lyrics, and thus positions of phonemes may be determined according to the word bits. When the number of words of the second lyrics is different from that of the first lyrics, various word bits may be designed for the first lyrics in advance, for example, the first lyrics originally have 5 words and correspond to the 5 word bits, then a distribution of 6 word bits when the second lyrics have 6 lyrics and a distribution of 8 word bits when the second lyrics have 8 words may be preset, and then positions of the words of the second lyrics are sequentially determined according to the preset word bit distribution, so that positions of phonemes of the second lyrics are determined. The word with the longest occupied duration can be selected from the first lyrics, the occupied duration of the word is equally divided according to the first word number, equal division results of added word bits with the same equal division duration are obtained, the added word bits are filled in the word bits of the first lyrics according to the equal division results, word bits conforming to the word number of the second lyrics are obtained, then the word bits are sequentially filled in the second lyrics, the positions of the words are determined, and the positions of phonemes are further determined; the first word number is equal to the number of words of the second lyrics which are greater than the first lyrics by one. For example, the first lyrics includes three words "ABC", the second lyrics includes five words "12345", where a of the first lyrics lasts for 3 seconds, B lasts for 1 second, and C lasts for 1 second, the duration of a is divided equally according to the word number difference between the first lyrics and the second lyrics, that is, 3 seconds is divided equally to obtain one word bit in the first second, one word bit in the second, and one word bit in the third second, so that more than two word bits can be obtained, five word bits can be obtained after filling in the original word bits of the first lyrics, five word bits can be respectively filled in the five words of the second lyrics, and then the positions of the respective words can be determined, and further the positions of phonemes can be determined.
Or the method can also be used for determining the position of the phonemes of the second lyrics directly according to the position of the phonemes of the first lyrics, and when the number of the phonemes of the first lyrics and the second lyrics is unequal, a new gap can be added in the original phoneme position of the first lyrics by using the method, so that the phonemes of the second lyrics can be filled in the gap.
The client side calls the acoustic model corresponding to the tone mark according to the input tone mark of the target tone, so as to obtain a mel frequency spectrum.
Step 2511-2, call vocoder to convert mel spectrum into voice audio.
The client calls the vocoder to convert the mel spectrum to human voice audio. Illustratively, the vocoder may use WaveRNN or WaveGlow vocoders.
In summary, in the method provided in this embodiment, for obtaining the voice audio, an acoustic model is first used to obtain the mel spectrum of the voice audio, where the acoustic model is a deep neural network acoustic model, and is used to generate the mel spectrum of the audio according to the input two-dimensional text information, and after obtaining the mel spectrum, the vocoder is used to convert the mel spectrum into the voice audio, so as to obtain the voice singing audio generated by using the specified singing jie voice.
Illustratively, the present embodiment provides a method of training an acoustic model. Fig. 8 is a flowchart illustrating an acoustic model training method according to another exemplary embodiment, which is illustrated as a method applied to a terminal, and includes the following steps.
Step 310, obtaining training data, the training data comprising: at least one of phonemes of training lyrics, notes of training lyrics, phoneme position information of training lyrics, note position information of training lyrics, tone color identification of training audio, mel frequency spectrum of training audio.
Illustratively, the training audio includes clean singing audio (clean singing data), and the training lyrics are lyrics of the clean singing audio. The client obtains the singing data as training data, the singing data only comprises voice audio, then marks such as phonemes, notes, phoneme position information, note position information, tone marks and the like are manually carried out on the singing data, and a mel frequency spectrum of the singing data is generated according to the singing data, so that a group of training data (phonemes, notes, phoneme position information, note position information, tone marks and mel frequency spectrum) corresponding to the singing data is obtained. Illustratively, the client obtains multiple sets of training data corresponding to the multiple singing data.
Step 320, training the initial model according to the training data to obtain an acoustic model.
The client takes the mel frequency spectrum as an expectation, inputs the phonemes, notes, phoneme position information and note position information into an initialization model, and trains the initialization model to obtain an acoustic model.
In summary, according to the method provided by the embodiment, through training the acoustic model, the client can obtain the mel frequency spectrum by using the acoustic model, and obtain the voice audio according to the E-mail frequency spectrum, so that the operation step of editing the audio by the user is simplified, the step is completed by using the deep neural network model, the audio editing efficiency can be improved, the requirement of the audio editing operation on the professional degree of the user can be reduced, the user can generate the song wanted by the user through simple operation, and the audio editing capability of the user is improved.
Fig. 9 is a schematic diagram of an audio production device that may be implemented in software, hardware, or a combination of both, according to an example embodiment. The audio production apparatus may include:
The display module 901 is configured to display an audio editing interface of a first audio, where the audio editing interface includes at least one sentence of lyrics of the first audio and a lyric editing control, and the at least one sentence of lyrics includes a first lyric;
An interaction module 902 for receiving a lyric editing operation on the lyric editing control for the first lyric, the lyric editing operation including inputting a second lyric;
A generating module 903, configured to replace the first lyrics in the first audio with the second lyrics, and generate second audio, where the second audio includes voice audio generated according to the second lyrics.
Optionally, the apparatus further comprises:
An obtaining module 904, configured to obtain a target tone, where the target tone is used to generate the voice audio;
The generating module 903 is further configured to replace the first lyrics in the first audio with the second lyrics according to the target tone color, and generate the second audio.
Optionally, the generating module 903 is further configured to generate the vocal audio including the second lyrics according to the target tone, phonemes of the second lyrics, and notes corresponding to the first lyrics in the first audio;
The obtaining module 904 is further configured to obtain a template audio of the first audio, where the template audio includes at least one of accompaniment audio and main melody audio;
the generating module 903 is further configured to generate the second audio according to the template audio and the voice audio.
Optionally, the generating module 903 includes:
A model submodule 905, configured to input a tone mark of the target tone, the phoneme of the second lyric, and a note corresponding to the first lyric in the first audio into an acoustic model to obtain a mel frequency spectrum;
a vocoder sub-module 906 for invoking a vocoder to convert the mel spectrum into the human voice audio.
Optionally, the second audio includes:
the audio duration is smaller than the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the human voice audio fragments of the second lyrics are generated according to the target tone color, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original tone color of the first audio;
Or alternatively, the first and second heat exchangers may be,
The audio duration is smaller than the first audio, and the human voice audio of all lyrics is generated according to the target tone;
Or alternatively, the first and second heat exchangers may be,
The audio duration is equal to the first audio, and the vocal audio of all lyrics is audio generated from the target tone.
Optionally, the apparatus further comprises:
The obtaining module 904 is further configured to obtain training data, where the training data includes: at least one of phonemes of training lyrics, notes of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, tone color identification of training audio, and mel frequency spectrum of the training audio;
And a training module 907, configured to train an initial model according to the training data to obtain the acoustic model.
Optionally, the apparatus further comprises:
The display module 901 is further configured to display an audio playing interface of the second audio, where the audio playing interface includes a playing control;
the interaction module 902 is further configured to receive a play operation that triggers the play control;
And a playing module 908, configured to play the second audio in response to receiving a play operation that triggers the play control.
Optionally, the apparatus further comprises:
The display module 901 is further configured to display a tone color selection interface, where the tone color selection interface includes at least one candidate tone color and a selection control;
the interaction module 902 is further configured to receive a selection operation that triggers the selection control;
the obtaining module 904 is further configured to determine, in response to receiving a selection operation triggering the selection control, the target timbre from the candidate timbres according to the selection operation.
A playing module 908, configured to play the second audio.
It should be noted that: in the audio production device provided in the above embodiment, when implementing the audio production method, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio production device and the audio production method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not described herein again.
Fig. 10 shows a block diagram of a terminal 1000 according to an exemplary embodiment of the present invention. The terminal 1000 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1000 can also be referred to by other names of user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, terminal 1000 can include: a processor 1001 and a memory 1002.
The processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1001 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 1001 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1001 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1001 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. Memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement the audio production method provided by the method embodiments of the present application.
In some embodiments, terminal 1000 can optionally further include: a peripheral interface 1003, and at least one peripheral. The processor 1001, the memory 1002, and the peripheral interface 1003 may be connected by a bus or signal line. The various peripheral devices may be connected to the peripheral device interface 1003 via a bus, signal wire, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch display 1005, camera assembly 1006, audio circuitry 1007, positioning assembly 1008, and power supply 1009.
Peripheral interface 1003 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1001 and memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1001, memory 1002, and peripheral interface 1003 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
Radio Frequency circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. Radio frequency circuitry 1004 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. Radio frequency circuitry 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1004 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.
The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1005 is a touch screen, the display 1005 also has the ability to capture touch signals at or above the surface of the display 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this time, the display 1005 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, display 1005 may be one, providing a front panel of terminal 1000; in other embodiments, display 1005 may be provided in at least two, separately provided on different surfaces of terminal 1000 or in a folded configuration; in still other embodiments, display 1005 may be a flexible display disposed on a curved surface or a folded surface of terminal 1000. Even more, the display 1005 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1005 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.
The camera assembly 1006 is used to capture images or video. Optionally, camera assembly 1006 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing, or inputting the electric signals to the radio frequency circuit 1004 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each located at a different portion of terminal 1000. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 1007 may also include a headphone jack.
The location component 1008 is used to locate the current geographic location of terminal 1000 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 1008 may be a positioning component based on the united states GPS (Global Positioning System ), the chinese beidou system, or the russian galileo system.
Power supply 1009 is used to power the various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 1009 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 1000 can further include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyroscope sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.
The acceleration sensor 1011 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1001 may control the touch display 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for the acquisition of motion data of a game or a user.
The gyro sensor 1012 may detect the body direction and the rotation angle of the terminal 1000, and the gyro sensor 1012 may collect the 3D motion of the user to the terminal 1000 in cooperation with the acceleration sensor 1011. The processor 1001 may implement the following functions according to the data collected by the gyro sensor 1012: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
Pressure sensor 1013 may be disposed on a side frame of terminal 1000 and/or on an underlying layer of touch display 1005. When the pressure sensor 1013 is provided at a side frame of the terminal 1000, a grip signal of the terminal 1000 by a user can be detected, and the processor 1001 performs right-and-left hand recognition or quick operation according to the grip signal collected by the pressure sensor 1013. When the pressure sensor 1013 is provided at the lower layer of the touch display 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 1005. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1014 may be provided on the front, back or side of terminal 1000. When a physical key or vendor Logo is provided on terminal 1000, fingerprint sensor 1014 may be integrated with the physical key or vendor Logo.
The optical sensor 1015 is used to collect ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display 1005 based on the ambient light intensity collected by the optical sensor 1015. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 1005 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may dynamically adjust the shooting parameters of the camera module 1006 according to the ambient light intensity collected by the optical sensor 1015.
Proximity sensor 1016, also referred to as a distance sensor, is typically located on the front panel of terminal 1000. Proximity sensor 1016 is used to collect the distance between the user and the front of terminal 1000. In one embodiment, when proximity sensor 1016 detects a gradual decrease in the distance between the user and the front face of terminal 1000, processor 1001 controls touch display 1005 to switch from the bright screen state to the off screen state; when proximity sensor 1016 detects a gradual increase in the distance between the user and the front face of terminal 1000, processor 1001 controls touch display 1005 to switch from the off-screen state to the on-screen state.
Those skilled in the art will appreciate that the structure shown in fig. 10 is not limiting and that terminal 1000 can include more or fewer components than shown, or certain components can be combined, or a different arrangement of components can be employed.
The embodiment of the application also provides a computer device, which comprises: the audio production method comprises a processor and a memory, wherein the memory stores instructions which are executed by the processor to realize the audio production method provided by the embodiment.
The embodiment of the application also provides a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, enables the mobile terminal to execute the audio production method provided by the above-mentioned embodiment.
The embodiment of the application also provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the audio production method provided by the embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above embodiments are only examples of the present invention, and are not intended to limit the present invention, but any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the scope of the present invention.