Movatterモバイル変換


[0]ホーム

URL:


CN108831437B - Singing voice generation method, singing voice generation device, terminal and storage medium - Google Patents

Singing voice generation method, singing voice generation device, terminal and storage medium
Download PDF

Info

Publication number
CN108831437B
CN108831437BCN201810622548.8ACN201810622548ACN108831437BCN 108831437 BCN108831437 BCN 108831437BCN 201810622548 ACN201810622548 ACN 201810622548ACN 108831437 BCN108831437 BCN 108831437B
Authority
CN
China
Prior art keywords
information
voice signal
standard
voice
acoustic characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810622548.8A
Other languages
Chinese (zh)
Other versions
CN108831437A (en
Inventor
李�昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810622548.8ApriorityCriticalpatent/CN108831437B/en
Publication of CN108831437ApublicationCriticalpatent/CN108831437A/en
Application grantedgrantedCritical
Publication of CN108831437BpublicationCriticalpatent/CN108831437B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The embodiment of the invention discloses a singing voice generation method, a singing voice generation device, a terminal and a storage medium, wherein the singing voice generation method comprises the following steps: acquiring a voice signal corresponding to a song input by a user; acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song; and storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal. The embodiment of the invention overcomes the problems that the conversion from voice to singing voice is realized by utilizing a large amount of data to train the acoustic model in the prior art, and the finally formed singing voice does not contain the voice of the user, so that the participation degree and the experience degree of the user are not high, and the effect of converting the voice of the user into the singing voice which keeps the voice of the user without training the acoustic model is realized.

Description

Singing voice generation method, singing voice generation device, terminal and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a singing voice generating method, a singing voice generating device, a singing voice generating terminal and a storage medium.
Background
The voice conversion singing voice means that voice of a user is converted into corresponding singing voice. The Internet products can combine the accompanying music to synthesize the singing works of the users after converting the voice of the users into the singing voice, and have entertainment, social and certain market values.
The prior art scheme for converting voice into singing voice mainly comprises the following steps: in the model training stage, performing model training by using text data (including lyrics and the like) of a plurality of songs of a professional singer A and acoustic characteristics of the songs corresponding to singing of the singer A to obtain an acoustic model of the singer A; in the singing voice generation stage, acquiring voice data of a song sung or read by a user B, identifying lyrics of the song according to the voice data and obtaining acoustic characteristics of the user B; inputting the recognized lyrics into an acoustic model of a singer A to obtain predicted acoustic features output by the acoustic model, updating the fundamental frequency and the duration in the predicted acoustic features according to the fundamental frequency and the duration in the acoustic features of a user B to obtain modified acoustic features, wherein the modified acoustic features comprise the fundamental frequency of the user B, the duration of the user B and the frequency spectrum of the singer A, and therefore, a parameter statistical method or a sound library splicing method is used for the modified acoustic features, the obtained singing voice has the sound characteristics of the singer A and the pitch and rhythm of the user B, and the effect that the singer A simulates the user B to sing a song is achieved.
The scheme often needs acoustic model training, has high requirements on sample data size, is complex in implementation process and can bring loss on tone quality; in addition, the singing voice synthesized by the method has the voice characteristics of the singer, and the participation and experience of the user are poor.
Disclosure of Invention
Embodiments of the present invention provide a singing voice generation method, apparatus, terminal, and storage medium, so as to achieve an effect of converting a user's voice into a singing voice that retains the user's own voice without performing acoustic model training.
In a first aspect, an embodiment of the present invention provides a singing voice generating method, where the method includes:
acquiring a voice signal corresponding to a song input by a user;
acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template;
and storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.
In a second aspect, an embodiment of the present invention further provides a singing voice generating apparatus, where the apparatus includes:
the voice signal acquisition module is used for acquiring a voice signal which is recorded by a user and corresponds to the song;
the acoustic feature information updating module is used for acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template;
and the target voice signal determining module is used for storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.
In a third aspect, an embodiment of the present invention further provides a singing voice generating terminal, where the terminal includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, cause the one or more processors to implement the singing voice generation method as described above in the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the singing voice generating method according to the first aspect.
The embodiment of the invention obtains the voice signal which is input by the user and corresponds to the song, obtains the standard acoustic characteristic information which corresponds to the song from the acoustic characteristic template which is established in advance, updates the acoustic characteristic information according to the standard acoustic characteristic information, wherein, the standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template, and stores or outputs the voice signal with the updated acoustic characteristic information as the target voice signal, thereby overcoming the problems that the voice is converted into the singing voice by using a large amount of data to train the acoustic model in the prior art, and the finally formed singing voice does not contain the own voice of the user, which causes low participation degree and experience degree of the user, realizing the effect of converting the voice of the user into the singing voice which keeps the own voice of the user without training the acoustic model, meanwhile, the singing voice is ensured to have good tone quality effect.
Drawings
Fig. 1 is a flowchart of a singing voice generating method according to a first embodiment of the present invention;
fig. 2 is a flowchart of a singing voice generating method according to a second embodiment of the present invention;
fig. 3 is a flowchart of a singing voice generating method in a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a singing voice generating apparatus in a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a singing voice generating terminal in the fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a singing voice generating method according to an embodiment of the present invention, where the embodiment is applicable to a case where a voice of a user is converted into a singing voice, and the method may be executed by a singing voice generating apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a singing voice generating terminal, as shown in fig. 1, the method of the embodiment specifically includes:
and S110, acquiring a voice signal corresponding to the song input by the user.
The voice signal corresponding to the song entered by the user may be generated by the user through reading or singing, with the content of the specific song as the target. The speech signal may contain various information, for example, lyric information of a specific song and acoustic feature information including fundamental frequency information reflecting pitch height, energy information reflecting volume, duration information reflecting rhythm, and the like. Wherein, the difference between the level of reading or singing a specific song by the user and the professional level of singing the song by a professional singer can be judged according to the acoustic characteristic information.
Preferably, the user may send a request for inputting a voice signal corresponding to the song to the singing voice generating terminal, and after receiving the request, the singing voice generating terminal may obtain the voice signal input by the user by turning on a microphone or the like. The singing voice generating terminal can be an independent hardware device, such as an intelligent sound box, a robot for man-machine conversation and the like, and can also be a client installed on each terminal (such as a mobile phone, a notebook, an intelligent television and the like).
And S120, acquiring standard acoustic feature information corresponding to the song from the pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information.
The acoustic feature template is obtained by extracting acoustic feature information of at least one song recorded by a professional singer, wherein standard acoustic feature information of the at least one song is stored. In this embodiment, after acquiring the voice signal corresponding to the specific song entered by the user, in order to update the acoustic feature information of the voice signal, it is preferable to acquire standard acoustic feature information corresponding to the specific song from a pre-established acoustic feature template, and update the acoustic feature information corresponding to the voice signal according to the standard acoustic feature information.
Illustratively, a user may preferably enter song a into a singing voice generating terminal by singing, in order to obtain a song having its own voice characteristics and the acoustic characteristics of a professional singer. In this case, in order to convert the acoustic features of the user singing song a into those of a professional singer, an acoustic feature template stored in advance in the singing voice generation terminal may be used. Specifically, which song the voice signal input by the user corresponds to may be determined according to the lyrics of the song a or the selection of the user, after the song is determined, the standard acoustic feature information corresponding to the song may be obtained from the acoustic feature template, and the acoustic feature information of the voice signal input by the user is updated by using the standard acoustic feature information.
And S130, storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.
Since the speech signal with the updated acoustic feature information has the standard acoustic feature information of the professional singer and the user's own voice feature information, it is preferable that the speech signal with the updated acoustic feature information be stored or outputted as the target speech signal.
The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song entered by a user, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updates the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the standard acoustic feature information of at least one song is stored in the acoustic feature template, and the voice signal with the updated acoustic feature information is stored or output as a target voice signal, so as to overcome the problems that in the prior art, voice conversion from voice to singing voice is realized by performing acoustic model training with a large amount of data, and finally formed singing voice does not contain the voice of the user, which results in low user participation and experience, and realize the effect of converting the voice of the user into the singing voice which retains the voice of the user without performing acoustic model training, meanwhile, the singing voice is ensured to have good tone quality effect.
On the basis of the foregoing embodiments, further, before acquiring the voice signal corresponding to the song entered by the user, the method further includes:
respectively extracting acoustic characteristic information of a plurality of recorded songs as standard acoustic characteristic information of corresponding songs;
and storing the identification information of the plurality of songs and the corresponding standard acoustic characteristic information in the acoustic characteristic template.
In this embodiment, the acoustic feature template is obtained in advance from a plurality of songs recorded by a professional singer. Specifically, after a plurality of songs recorded by a professional singer are acquired, the acoustic feature information of each song can be respectively extracted, and each song corresponding to each acoustic feature information is recorded by the professional singer, so that each extracted acoustic feature information can be used as the standard acoustic feature information of the corresponding song.
If only the extracted standard acoustic feature information is stored in the acoustic feature template, there is no basis for obtaining the standard acoustic feature information corresponding to a specific song from the pre-established acoustic feature template. Based on the method, the identification information of each song corresponding to each standard acoustic feature one by one can be obtained while extracting each standard acoustic feature information, and the identification information of each song and the corresponding standard acoustic feature information are stored in the acoustic feature template. The identification information of the song includes the name of the song, the lyrics of the song, the name of the song plus the name of a professional singer, and the like, and the manner of acquiring the identification information of the song corresponding to the voice signal input by the user by the singing voice generation terminal may be receiving the input information of the user or extracting the identification information from the acquired voice signal.
Example two
Fig. 2 is a flowchart of a singing voice generating method according to a second embodiment of the present invention. On the basis of the foregoing embodiments, this embodiment may select to update the acoustic feature information of the speech signal according to the standard acoustic feature information, including: acquiring the sound length information corresponding to the voice signal, and performing time domain audio frequency transformation on the voice signal according to the sound length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal; correspondingly, the storing or outputting the voice signal with the updated acoustic feature information as the target voice signal comprises the following steps: and storing or outputting the voice signal obtained after the time domain audio frequency conversion as a target voice signal. As shown in fig. 2, the method of this embodiment specifically includes:
and S210, acquiring a voice signal corresponding to the song input by the user.
And S220, acquiring standard acoustic characteristic information corresponding to the song from the acoustic characteristic template established in advance.
And S230, acquiring the voice length information corresponding to the voice signal, and performing time-domain audio frequency transformation on the voice signal according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal.
The voice signal may be a waveform that varies with time, each word, or phrase in the voice signal corresponds to a corresponding segment of waveform, each segment of waveform has time information such as a time start point, a time end point, and a time length corresponding to the segment of waveform, and each word, or phrase and the time information corresponding to the word, or phrase are duration information corresponding to the voice signal.
After the voice length information corresponding to the voice signal is obtained, time domain audio frequency transformation can be performed on the voice signal according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal. Specifically, based on the sound length information, the time domain audio frequency conversion may be performed on the waveform corresponding to the voice signal by using the standard acoustic feature information, so that the sound length information, the fundamental frequency information, and the energy information of the waveform corresponding to the voice signal after the time domain audio frequency conversion can be respectively matched with the standard sound length information, the standard fundamental frequency information, and the standard energy information in the standard acoustic feature information. The operation takes the standard acoustic characteristic information as a reference for adjusting the voice signal, and the acoustic characteristic information of the voice signal is adjusted to change the acoustic characteristic information of the voice signal.
Preferably, the obtaining of the duration information corresponding to the voice signal may include:
and obtaining lyric information contained in the voice signal through voice recognition, and obtaining the duration information corresponding to the voice signal according to the lyric information.
Specifically, after a voice signal input by a user is acquired, lyric information in the voice signal can be acquired through a voice recognition method, wherein the lyric information comprises words, phrases and the like, and each word, word or phrase has corresponding time information. And obtaining the voice length information corresponding to the voice signal according to the lyric information.
And S240, storing or outputting the voice signal obtained after the time domain audio frequency conversion is carried out as a target voice signal.
The voice signal obtained after the time domain audio conversion can contain the own voice characteristics of the user and also can contain the acoustic characteristic information of a professional singer, and based on the voice signal, the voice signal obtained after the time domain audio conversion can be used as a target voice signal to be stored or output.
The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song entered by a user, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, obtains duration information corresponding to the voice signal, performs time-domain audio transform on the voice signal according to the duration information and the standard acoustic feature information to change the acoustic feature information of the voice signal, and stores or outputs the voice signal obtained after the time-domain transform as a target voice signal, thereby overcoming the problems that in the prior art, voice conversion to singing voice is realized by performing acoustic model training on a large amount of data, and finally formed singing voice does not contain the voice of the user, which results in low user participation and experience, and realizing the effect of converting the voice of the user into singing voice which retains the voice of the user on the time domain without performing acoustic model training, meanwhile, the singing voice is ensured to have good tone quality effect.
On the basis of the foregoing embodiments, further, performing time-domain audio transform on a speech signal according to the duration information and the standard acoustic feature information to change the acoustic feature information of the speech signal includes:
and carrying out sound element division on the voice signals according to the sound length information, and carrying out time domain audio frequency conversion on the voice signals after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signals after the time domain audio frequency conversion to be consistent with the standard fundamental frequency information, the sound length information of the voice signals after the time domain audio frequency conversion to be consistent with the standard sound length information, and the energy information of the voice signals after the time domain audio frequency conversion to be consistent with the standard energy information.
The acoustic feature information may include fundamental frequency information, duration information, and energy information of the voice signal, among others. The fundamental frequency information corresponds to the pitch of the voice signal, the duration information corresponds to the rhythm of the voice signal, and the energy information corresponds to the volume of the voice signal.
In this embodiment, the voice signal may be subjected to sound element division according to the duration information of the voice signal, and preferably, the voice signal may be subjected to tone element division according to each word in the duration information and time information corresponding to each word, so as to obtain a tone element corresponding to each word, where each tone element corresponds to a portion of the voice signal, for example, for a song containing lyrics of 100 words, time information corresponding to the 1 st word is t1b-t1n, time information corresponding to the 2 nd word is t2b-t2n, … …, time information corresponding to the 100 th word is t100b-t100n, then a portion of the signal corresponding to the t1b-t1n time period in the voice signal is a tone element of the 1 st word, a portion of the signal corresponding to the t2b-t2n time period in the voice signal is a tone element of the 2 nd word, … …, and a portion of the signal corresponding to the t100b-t100n time period in the voice signal is a tone element of the 100 th word. Each sound element has corresponding fundamental frequency information, duration information and energy information. Then, with the sound element as a unit, time domain audio frequency transformation can be performed on the voice signal after sound element division by using standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information, so that the fundamental frequency information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard fundamental frequency information, the sound length information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard sound length information, and the energy information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard energy information. That is, for each word of the lyrics of the song in the standard acoustic feature information of the song, the fundamental frequency information, the duration information, and the energy information of the corresponding tone element are stored, and for each tone element of the speech signal after time domain audio conversion, the fundamental frequency information, the duration information, and the energy information of the tone element are respectively consistent with the standard fundamental frequency information, the standard duration information, and the standard energy information of the corresponding tone element in the standard acoustic feature information.
EXAMPLE III
Fig. 3 is a flowchart of a singing voice generating method according to a third embodiment of the present invention. On the basis of the foregoing embodiments, this embodiment may further include, after acquiring the voice signal corresponding to the song entered by the user and before updating the acoustic feature information of the voice signal according to the standard acoustic feature information, that: extracting frequency spectrum information of the voice signal; updating the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the updating comprises: acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; converting the voice signal after sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after conversion according to the standard acoustic characteristic information; correspondingly, the storing or outputting the voice signal with the updated acoustic feature information as the target voice signal comprises the following steps: and acquiring a target voice signal according to the updated acoustic characteristic information and the frequency spectrum information, and storing or outputting the target voice signal. As shown in fig. 3, the method of this embodiment specifically includes:
and S310, acquiring a voice signal corresponding to the song input by the user.
And S320, extracting the spectrum information of the voice signal.
In this embodiment, the spectrum information of the speech signal corresponds to the tone of the speech signal, which reflects the sound characteristics of the user. In the process of converting the voice signal into the singing voice, in order to preserve the voice characteristics of the user and enable the finally generated singing voice to have the voice characteristics of the user, the frequency spectrum information in the voice signal can be extracted in advance.
S330, standard acoustic characteristic information corresponding to the song is obtained from the acoustic characteristic template established in advance.
S340, acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; and converting the voice signal after the sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after the conversion according to the standard acoustic characteristic information.
According to the method for acquiring the duration information and the dividing the sound elements corresponding to the voice signal described in the embodiments, the corresponding duration information can be acquired in the time domain according to the waveform and the voice content of the voice signal, and the sound elements of the voice signal are divided according to the duration information.
In addition to the updating of the acoustic feature information for the speech signal in the time domain, the updating of the acoustic feature information for the speech signal may also be performed in the frequency domain. Specifically, the voice signal after the sound element division may be converted from the time domain to the frequency domain by using each divided sound element as a unit, so as to obtain a representation form of each sound element on the frequency domain. And determining acoustic characteristic information of the voice signal in the frequency domain according to the representation form of each sound element in the frequency domain, and updating the acoustic characteristic information of the voice signal in the frequency domain obtained after conversion according to the standard acoustic characteristic information to obtain updated acoustic characteristic information.
And S350, obtaining a target voice signal according to the updated acoustic feature information and the updated spectrum information, and storing or outputting the target voice signal.
The updated acoustic feature information is acoustic feature information of a professional singer, the frequency spectrum information reflects the voice feature of the user, and the target voice signal obtained by using the updated acoustic feature information and the frequency spectrum information comprises the voice feature of the user and the acoustic feature information of the professional singer. After the target speech signal is obtained, the target speech signal may be stored or output.
The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song recorded by a user, extracts a spectral feature of the voice signal, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, obtains duration information corresponding to the voice signal, performs phoneme division on the voice signal according to the duration information, performs time-domain to frequency-domain conversion on the voice signal subjected to the phoneme division, updates the acoustic feature information of the voice signal on the frequency domain obtained after the conversion according to the standard acoustic feature information, finally obtains a target voice signal according to the acoustic feature information and the spectral information obtained after the updating, stores or outputs the target voice signal, overcomes the defects that in the prior art, the conversion from voice to singing voice is realized by performing acoustic model training by using a large amount of data, and the finally formed singing voice does not contain the own voice of the user, the problem that the participation degree and the experience degree of the user are not high is solved, the effect that the voice of the user is converted into the singing voice which is reserved with the voice of the user on a frequency domain without acoustic model training is achieved, and meanwhile, the good tone quality effect of the singing voice is guaranteed.
On the basis of the foregoing embodiments, further, updating the acoustic feature information of the speech signal in the frequency domain obtained after the conversion according to the standard acoustic feature information includes:
and replacing the fundamental frequency information of the voice signal on the frequency domain obtained after conversion by using the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after conversion by using the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after conversion by using the standard energy information in the standard acoustic characteristic information.
In this embodiment, acoustic feature information of a speech signal in a frequency domain is determined in units of each sound element according to a representation form of each sound element in the frequency domain, where the acoustic feature information includes fundamental frequency information, duration information, and energy information of the speech signal. And then, replacing the fundamental frequency information of the voice signal on the frequency domain obtained after conversion with the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after conversion with the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after conversion with the standard energy information in the standard acoustic characteristic information.
On the basis of the foregoing embodiments, further, obtaining a target speech signal according to the updated acoustic feature information and the updated spectral information includes:
and inputting the updated acoustic characteristic information and the updated spectrum information to the vocoder to obtain the target voice signal restored by the vocoder.
The corresponding target speech signal cannot be directly obtained by using the acoustic feature information obtained by updating in the frequency domain and the pre-acquired frequency spectrum information. Therefore, the target speech signal can be preferably restored by the vocoder. The vocoder is also called a speech analysis and synthesis system or a speech band compression system, and can restore a corresponding speech signal by using model parameters of the speech signal in combination with a speech synthesis technique, and is a codec for analyzing and synthesizing speech.
In this embodiment, the updated acoustic feature information and the pre-obtained spectrum information may be input into the vocoder, and the vocoder restores the corresponding target speech signal according to the input parameters and by combining its internal speech synthesis technology.
Example four
Fig. 4 is a schematic structural diagram of a singing voice generating apparatus in a fourth embodiment of the present invention. As shown in fig. 4, the singing voice generating apparatus of the present embodiment includes:
a voicesignal obtaining module 410, configured to obtain a voice signal corresponding to a song input by a user;
the acoustic featureinformation updating module 420 is configured to obtain standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and update the acoustic feature information of the speech signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song;
and a target speechsignal determining module 430, configured to store or output the speech signal with the updated acoustic feature information as a target speech signal.
The singing voice generating device provided by this embodiment obtains a voice signal corresponding to a song, which is input by a user, through a voice signal obtaining module, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template through an acoustic feature information updating module, and updates the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the standard acoustic feature information of at least one song is stored in the acoustic feature template, and then the voice signal with the updated acoustic feature information is stored or output as a target voice signal through a target voice signal determining module, so as to overcome the problems that in the prior art, the conversion from voice to singing voice is realized by performing acoustic model training through a large amount of data, and the finally formed singing voice does not contain the own voice of the user, which results in low user participation degree and experience degree, the method and the device have the advantages that the effect that the voice of the user is converted into the singing voice which is reserved with the voice of the user can be achieved without acoustic model training, and meanwhile, the good tone quality effect of the singing voice is guaranteed.
On the basis of the foregoing embodiments, further, the acoustic featureinformation updating module 420 may include:
the first voice length information acquisition unit is used for acquiring voice length information corresponding to the voice signal;
the time domain audio transformation unit is used for carrying out time domain audio transformation on the voice signals according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signals;
the target speechsignal determination module 430 may specifically include:
and the first target voice signal determining unit is used for storing or outputting the voice signal obtained after the time domain audio frequency conversion is carried out as a target voice signal.
Further, the time-domain audio transform unit may specifically be configured to:
and carrying out sound element division on the voice signals according to the sound length information, and carrying out time domain audio frequency conversion on the voice signals after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signals after the time domain audio frequency conversion to be consistent with the standard fundamental frequency information, the sound length information of the voice signals after the time domain audio frequency conversion to be consistent with the standard sound length information, and the energy information of the voice signals after the time domain audio frequency conversion to be consistent with the standard energy information.
Further, the apparatus may further include:
the frequency spectrum information extraction module is used for extracting the frequency spectrum information of the voice signal after acquiring the voice signal which is input by the user and corresponds to the song and before updating the acoustic characteristic information of the voice signal according to the standard acoustic characteristic information;
the acoustic featureinformation update module 420 may further include:
the second sound length information acquisition unit is used for acquiring sound length information corresponding to the voice signal;
the frequency domain audio frequency transformation unit is used for carrying out sound element division on the voice signal according to the sound length information; converting the voice signal after sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after conversion according to standard acoustic characteristic information;
the target speechsignal determination module 430 may further include:
and the second target voice determining unit is used for acquiring a target voice signal according to the updated acoustic feature information and the updated spectrum information, and storing or outputting the target voice signal.
Further, the frequency domain audio transform unit may specifically be configured to:
and replacing the fundamental frequency information of the voice signal on the frequency domain obtained after conversion by using the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after conversion by using the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after conversion by using the standard energy information in the standard acoustic characteristic information.
Further, the second target speech determination unit may specifically be configured to:
and inputting the updated acoustic characteristic information and the updated spectrum information to the vocoder to obtain the target voice signal restored by the vocoder.
Further, the first duration information acquiring unit and the second duration information acquiring unit may be specifically configured to:
obtaining lyric information contained in the voice signal through voice recognition, and obtaining duration information corresponding to the voice signal according to the lyric information;
further, the apparatus may further include:
the standard acoustic characteristic information extraction module is used for respectively extracting the acoustic characteristic information of the recorded songs as the standard acoustic characteristic information of the corresponding songs before acquiring the voice signals which are input by the user and correspond to the songs;
and the acoustic characteristic template generating module is used for storing the identification information of the plurality of songs and the corresponding standard acoustic characteristic information in the acoustic characteristic template.
The singing voice generating device provided by the embodiment of the invention can execute the singing voice generating method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a singing voice generating terminal according to a fifth embodiment of the present invention. Figure 5 illustrates a block diagram of an exemplary singingvoice generation terminal 512 suitable for use in implementing embodiments of the present invention. The singing voice generating terminal 512 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 5, the singingvoice generating terminal 512 is represented in the form of a general purpose computing device. The components of the singingvoice generating terminal 512 may include, but are not limited to: one ormore processors 516, amemory 528, and abus 518 that couples the various system components including thememory 528 and theprocessors 516.
Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
The singingvoice generating terminal 512 typically includes a variety of computer system readable media. Such media may be any available media that can be accessed by the singingvoice generating terminal 512 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 528 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)530 and/orcache memory 532. The singingvoice generating terminal 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only,storage 534 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected tobus 518 through one or more data media interfaces.Memory 528 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 540 having a set (at least one) ofprogram modules 542, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in, for example, thememory 528, each of which examples or some combination may include an implementation of a network environment. Theprogram modules 542 generally perform the functions and/or methods of the described embodiments of the invention.
The singingvoice generating terminal 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device,display 524, etc., where thedisplay 524 may be configurable or not as desired), with one or more devices that enable a user to interact with the singingvoice generating terminal 512, and/or with any devices (e.g., network card, modem, etc.) that enable the singing voice generating terminal 512 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 522. Also, the singingvoice generating terminal 512 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through thenetwork adapter 520. As shown, thenetwork adapter 520 communicates with the other modules of the singingvoice generating terminal 512 via abus 518. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in conjunction with the singingvoice generation terminal 512, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage, among others.
Theprocessor 516 executes various functional applications and data processing, such as implementing a singing voice generating method provided by any embodiment of the present invention, by executing programs stored in thememory 528.
EXAMPLE six
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a singing voice generating method provided in an embodiment of the present invention, where the method includes:
acquiring a voice signal corresponding to a song input by a user;
acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song;
and storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.
Of course, the computer program stored on the computer-readable storage medium provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the singing voice generation method provided by any embodiments of the present invention.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (11)

acquiring the sound length information corresponding to the voice signal, carrying out sound element division on the voice signal according to the sound length information, and carrying out time domain audio frequency transformation on the voice signal after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signal after the time domain audio frequency transformation to be consistent with the standard fundamental frequency information, the sound length information of the voice signal after the time domain audio frequency transformation to be consistent with the standard sound length information, and the energy information of the voice signal after the time domain audio frequency transformation to be consistent with the standard energy information;
acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; converting the voice signal after sound element division from a time domain to a frequency domain, replacing fundamental frequency information of the voice signal on the frequency domain obtained after conversion with standard fundamental frequency information in the standard acoustic characteristic information, replacing duration information of the voice signal on the frequency domain obtained after conversion with standard duration information in the standard acoustic characteristic information, and replacing energy information of the voice signal on the frequency domain obtained after conversion with standard energy information in the standard acoustic characteristic information;
the time domain audio frequency transformation unit is used for carrying out sound element division on the voice signals according to the sound length information, carrying out time domain audio frequency transformation on the voice signals after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signals after the time domain audio frequency transformation to be consistent with the standard fundamental frequency information, the sound length information of the voice signals after the time domain audio frequency transformation to be consistent with the standard sound length information and the energy information of the voice signals after the time domain audio frequency transformation to be consistent with the standard energy information;
the frequency domain audio frequency transformation unit is used for carrying out sound element division on the voice signal according to the sound length information; converting the time domain to the frequency domain of the voice signal after the sound element division, replacing the fundamental frequency information of the voice signal on the frequency domain obtained after the conversion with the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after the conversion with the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after the conversion with the standard energy information in the standard acoustic characteristic information;
CN201810622548.8A2018-06-152018-06-15Singing voice generation method, singing voice generation device, terminal and storage mediumActiveCN108831437B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810622548.8ACN108831437B (en)2018-06-152018-06-15Singing voice generation method, singing voice generation device, terminal and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810622548.8ACN108831437B (en)2018-06-152018-06-15Singing voice generation method, singing voice generation device, terminal and storage medium

Publications (2)

Publication NumberPublication Date
CN108831437A CN108831437A (en)2018-11-16
CN108831437Btrue CN108831437B (en)2020-09-01

Family

ID=64142414

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810622548.8AActiveCN108831437B (en)2018-06-152018-06-15Singing voice generation method, singing voice generation device, terminal and storage medium

Country Status (1)

CountryLink
CN (1)CN108831437B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111354332A (en)*2018-12-052020-06-30北京嘀嘀无限科技发展有限公司Singing voice synthesis method and device
CN109920449B (en)*2019-03-182022-03-04广州市百果园网络科技有限公司Beat analysis method, audio processing method, device, equipment and medium
CN112420008A (en)*2019-08-222021-02-26北京峰趣互联网信息服务有限公司Method and device for recording songs, electronic equipment and storage medium
CN110782866A (en)*2019-09-162020-02-11中北大学Singing sound converter
CN110738980A (en)*2019-09-162020-01-31平安科技(深圳)有限公司Singing voice synthesis model training method and system and singing voice synthesis method
CN112837668B (en)*2019-11-012023-04-28北京搜狗科技发展有限公司Voice processing method and device for processing voice
CN111091807B (en)*2019-12-262023-05-26广州酷狗计算机科技有限公司Speech synthesis method, device, computer equipment and storage medium
CN111429881B (en)*2020-03-192023-08-18北京字节跳动网络技术有限公司Speech synthesis method and device, readable medium and electronic equipment
CN111445892B (en)*2020-03-232023-04-14北京字节跳动网络技术有限公司Song generation method and device, readable medium and electronic equipment
CN111477210A (en)*2020-04-022020-07-31北京字节跳动网络技术有限公司Speech synthesis method and device
CN112289300B (en)*2020-10-282024-01-09腾讯音乐娱乐科技(深圳)有限公司Audio processing method and device, electronic equipment and computer readable storage medium
CN112712783B (en)*2020-12-212023-09-29北京百度网讯科技有限公司Method and device for generating music, computer equipment and medium
CN115410551B (en)*2021-05-252025-08-15广州酷狗计算机科技有限公司Song conversion method and device, storage medium and electronic equipment
CN113593520B (en)*2021-09-082024-05-17广州虎牙科技有限公司Singing voice synthesizing method and device, electronic equipment and storage medium
CN115995223B (en)*2021-10-192025-08-26伟光有限公司 Correction method, electronic device and computer storage medium
CN115033536A (en)*2022-05-182022-09-09江西台德智慧科技有限公司Intelligent karaoke method and intelligent karaoke terminal
CN115881086B (en)*2022-12-102025-06-24云知声智能科技股份有限公司 A training method, device, equipment and storage medium for a singing synthesis model

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN1719514A (en)*2004-07-062006-01-11中国科学院自动化研究所 High-quality real-time voice change method based on speech analysis and synthesis
EP1185976B1 (en)*2000-02-252006-08-16Philips Electronics N.V.Speech recognition device with reference transformation means
CN1924994A (en)*2005-08-312007-03-07中国科学院自动化研究所Embedded language synthetic method and system
CN101064103A (en)*2006-04-242007-10-31中国科学院自动化研究所Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN105244041A (en)*2015-09-222016-01-13百度在线网络技术(北京)有限公司Song audition evaluation method and device
CN106652997A (en)*2016-12-292017-05-10腾讯音乐娱乐(深圳)有限公司Audio synthesis method and terminal
CN108053814A (en)*2017-11-062018-05-18芋头科技(杭州)有限公司A kind of speech synthesis system and method for analog subscriber song

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR100472904B1 (en)*2002-02-202005-03-08안호성Digital Recorder for Selectively Storing Only a Music Section Out of Radio Broadcasting Contents and Method thereof
EP1543497B1 (en)*2002-09-172006-06-07Koninklijke Philips Electronics N.V.Method of synthesis for a steady sound signal
JP2004287099A (en)*2003-03-202004-10-14Sony CorpMethod and apparatus for singing synthesis, program, recording medium, and robot device
TWI260582B (en)*2005-01-202006-08-21Sunplus Technology Co LtdSpeech synthesizer with mixed parameter mode and method thereof
US20130226957A1 (en)*2012-02-272013-08-29The Trustees Of Columbia University In The City Of New YorkMethods, Systems, and Media for Identifying Similar Songs Using Two-Dimensional Fourier Transform Magnitudes
JP6492933B2 (en)*2015-04-242019-04-03ヤマハ株式会社 CONTROL DEVICE, SYNTHETIC SINGING SOUND GENERATION DEVICE, AND PROGRAM
CN105845125B (en)*2016-05-182019-05-03百度在线网络技术(北京)有限公司Phoneme synthesizing method and speech synthetic device
CN106971703A (en)*2017-03-172017-07-21西北师范大学A kind of song synthetic method and device based on HMM
CN107863095A (en)*2017-11-212018-03-30广州酷狗计算机科技有限公司Acoustic signal processing method, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
EP1185976B1 (en)*2000-02-252006-08-16Philips Electronics N.V.Speech recognition device with reference transformation means
CN1719514A (en)*2004-07-062006-01-11中国科学院自动化研究所 High-quality real-time voice change method based on speech analysis and synthesis
CN1924994A (en)*2005-08-312007-03-07中国科学院自动化研究所Embedded language synthetic method and system
CN101064103A (en)*2006-04-242007-10-31中国科学院自动化研究所Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN105244041A (en)*2015-09-222016-01-13百度在线网络技术(北京)有限公司Song audition evaluation method and device
CN106652997A (en)*2016-12-292017-05-10腾讯音乐娱乐(深圳)有限公司Audio synthesis method and terminal
CN108053814A (en)*2017-11-062018-05-18芋头科技(杭州)有限公司A kind of speech synthesis system and method for analog subscriber song

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Onset pitch perturbations and the cross-linguistic implementation of voicing: Evidence from tonal and non-tonal languages》;James P. Kirby;《Journal of Phonetics》;20181022;全文*
《语音合成技术发展综述与研究现状》;张丹烽等;《电子信息》;20171130;全文*

Also Published As

Publication numberPublication date
CN108831437A (en)2018-11-16

Similar Documents

PublicationPublication DateTitle
CN108831437B (en)Singing voice generation method, singing voice generation device, terminal and storage medium
US10614803B2 (en)Wake-on-voice method, terminal and storage medium
CN112927674B (en) Speech style transfer method, device, readable medium and electronic device
CN110069608B (en)Voice interaction method, device, equipment and computer storage medium
CN111445892B (en)Song generation method and device, readable medium and electronic equipment
JP7681793B2 (en) Robust direct speech-to-speech translation
CN111782576B (en)Background music generation method and device, readable medium and electronic equipment
CN111798821B (en)Sound conversion method, device, readable storage medium and electronic equipment
CN111161695B (en)Song generation method and device
CN112382274B (en)Audio synthesis method, device, equipment and storage medium
CN113948062A (en) Data conversion method and computer storage medium
CN113421571B (en)Voice conversion method and device, electronic equipment and storage medium
US20230015112A1 (en)Method and apparatus for processing speech, electronic device and storage medium
US20140236597A1 (en)System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
CN112908308B (en)Audio processing method, device, equipment and medium
CN117597728A (en)Personalized and dynamic text-to-speech sound cloning using a text-to-speech model that is not fully trained
CN112382269B (en) Audio synthesis method, device, equipment and storage medium
CN113314096A (en)Speech synthesis method, apparatus, device and storage medium
WO2023287360A2 (en)Multimedia processing method and apparatus, electronic device, and storage medium
WO2022089097A1 (en)Audio processing method and apparatus, electronic device, and computer-readable storage medium
CN113160849B (en)Singing voice synthesizing method, singing voice synthesizing device, electronic equipment and computer readable storage medium
US11250837B2 (en)Speech synthesis system, method and non-transitory computer readable medium with language option selection and acoustic models
CN112071287A (en)Method, apparatus, electronic device and computer readable medium for generating song score
CN114822492B (en)Speech synthesis method and device, electronic equipment and computer readable storage medium
CN113421544B (en)Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp