CN108831437B

Movatterモバイル変換

Info

Publication number: CN108831437B
Application number: CN201810622548.8A
Authority: CN
Inventors: 李�昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2020-09-01
Anticipated expiration: 2038-06-15
Also published as: CN108831437A

Abstract

The embodiment of the invention discloses a singing voice generation method, a singing voice generation device, a terminal and a storage medium, wherein the singing voice generation method comprises the following steps: acquiring a voice signal corresponding to a song input by a user; acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song; and storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal. The embodiment of the invention overcomes the problems that the conversion from voice to singing voice is realized by utilizing a large amount of data to train the acoustic model in the prior art, and the finally formed singing voice does not contain the voice of the user, so that the participation degree and the experience degree of the user are not high, and the effect of converting the voice of the user into the singing voice which keeps the voice of the user without training the acoustic model is realized.

Description

Singing voice generation method, singing voice generation device, terminal and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a singing voice generating method, a singing voice generating device, a singing voice generating terminal and a storage medium.

Background

The voice conversion singing voice means that voice of a user is converted into corresponding singing voice. The Internet products can combine the accompanying music to synthesize the singing works of the users after converting the voice of the users into the singing voice, and have entertainment, social and certain market values.

The prior art scheme for converting voice into singing voice mainly comprises the following steps: in the model training stage, performing model training by using text data (including lyrics and the like) of a plurality of songs of a professional singer A and acoustic characteristics of the songs corresponding to singing of the singer A to obtain an acoustic model of the singer A; in the singing voice generation stage, acquiring voice data of a song sung or read by a user B, identifying lyrics of the song according to the voice data and obtaining acoustic characteristics of the user B; inputting the recognized lyrics into an acoustic model of a singer A to obtain predicted acoustic features output by the acoustic model, updating the fundamental frequency and the duration in the predicted acoustic features according to the fundamental frequency and the duration in the acoustic features of a user B to obtain modified acoustic features, wherein the modified acoustic features comprise the fundamental frequency of the user B, the duration of the user B and the frequency spectrum of the singer A, and therefore, a parameter statistical method or a sound library splicing method is used for the modified acoustic features, the obtained singing voice has the sound characteristics of the singer A and the pitch and rhythm of the user B, and the effect that the singer A simulates the user B to sing a song is achieved.

The scheme often needs acoustic model training, has high requirements on sample data size, is complex in implementation process and can bring loss on tone quality; in addition, the singing voice synthesized by the method has the voice characteristics of the singer, and the participation and experience of the user are poor.

Disclosure of Invention

Embodiments of the present invention provide a singing voice generation method, apparatus, terminal, and storage medium, so as to achieve an effect of converting a user's voice into a singing voice that retains the user's own voice without performing acoustic model training.

In a first aspect, an embodiment of the present invention provides a singing voice generating method, where the method includes:

acquiring a voice signal corresponding to a song input by a user;

acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template;

and storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.

In a second aspect, an embodiment of the present invention further provides a singing voice generating apparatus, where the apparatus includes:

the voice signal acquisition module is used for acquiring a voice signal which is recorded by a user and corresponds to the song;

the acoustic feature information updating module is used for acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template;

and the target voice signal determining module is used for storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.

In a third aspect, an embodiment of the present invention further provides a singing voice generating terminal, where the terminal includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, cause the one or more processors to implement the singing voice generation method as described above in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the singing voice generating method according to the first aspect.

The embodiment of the invention obtains the voice signal which is input by the user and corresponds to the song, obtains the standard acoustic characteristic information which corresponds to the song from the acoustic characteristic template which is established in advance, updates the acoustic characteristic information according to the standard acoustic characteristic information, wherein, the standard acoustic characteristic information of at least one song is stored in the acoustic characteristic template, and stores or outputs the voice signal with the updated acoustic characteristic information as the target voice signal, thereby overcoming the problems that the voice is converted into the singing voice by using a large amount of data to train the acoustic model in the prior art, and the finally formed singing voice does not contain the own voice of the user, which causes low participation degree and experience degree of the user, realizing the effect of converting the voice of the user into the singing voice which keeps the own voice of the user without training the acoustic model, meanwhile, the singing voice is ensured to have good tone quality effect.

Drawings

Fig. 1 is a flowchart of a singing voice generating method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a singing voice generating method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a singing voice generating method in a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a singing voice generating apparatus in a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a singing voice generating terminal in the fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a singing voice generating method according to an embodiment of the present invention, where the embodiment is applicable to a case where a voice of a user is converted into a singing voice, and the method may be executed by a singing voice generating apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a singing voice generating terminal, as shown in fig. 1, the method of the embodiment specifically includes:

and S110, acquiring a voice signal corresponding to the song input by the user.

The voice signal corresponding to the song entered by the user may be generated by the user through reading or singing, with the content of the specific song as the target. The speech signal may contain various information, for example, lyric information of a specific song and acoustic feature information including fundamental frequency information reflecting pitch height, energy information reflecting volume, duration information reflecting rhythm, and the like. Wherein, the difference between the level of reading or singing a specific song by the user and the professional level of singing the song by a professional singer can be judged according to the acoustic characteristic information.

Preferably, the user may send a request for inputting a voice signal corresponding to the song to the singing voice generating terminal, and after receiving the request, the singing voice generating terminal may obtain the voice signal input by the user by turning on a microphone or the like. The singing voice generating terminal can be an independent hardware device, such as an intelligent sound box, a robot for man-machine conversation and the like, and can also be a client installed on each terminal (such as a mobile phone, a notebook, an intelligent television and the like).

And S120, acquiring standard acoustic feature information corresponding to the song from the pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information.

The acoustic feature template is obtained by extracting acoustic feature information of at least one song recorded by a professional singer, wherein standard acoustic feature information of the at least one song is stored. In this embodiment, after acquiring the voice signal corresponding to the specific song entered by the user, in order to update the acoustic feature information of the voice signal, it is preferable to acquire standard acoustic feature information corresponding to the specific song from a pre-established acoustic feature template, and update the acoustic feature information corresponding to the voice signal according to the standard acoustic feature information.

Illustratively, a user may preferably enter song a into a singing voice generating terminal by singing, in order to obtain a song having its own voice characteristics and the acoustic characteristics of a professional singer. In this case, in order to convert the acoustic features of the user singing song a into those of a professional singer, an acoustic feature template stored in advance in the singing voice generation terminal may be used. Specifically, which song the voice signal input by the user corresponds to may be determined according to the lyrics of the song a or the selection of the user, after the song is determined, the standard acoustic feature information corresponding to the song may be obtained from the acoustic feature template, and the acoustic feature information of the voice signal input by the user is updated by using the standard acoustic feature information.

And S130, storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal.

Since the speech signal with the updated acoustic feature information has the standard acoustic feature information of the professional singer and the user's own voice feature information, it is preferable that the speech signal with the updated acoustic feature information be stored or outputted as the target speech signal.

The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song entered by a user, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updates the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the standard acoustic feature information of at least one song is stored in the acoustic feature template, and the voice signal with the updated acoustic feature information is stored or output as a target voice signal, so as to overcome the problems that in the prior art, voice conversion from voice to singing voice is realized by performing acoustic model training with a large amount of data, and finally formed singing voice does not contain the voice of the user, which results in low user participation and experience, and realize the effect of converting the voice of the user into the singing voice which retains the voice of the user without performing acoustic model training, meanwhile, the singing voice is ensured to have good tone quality effect.

On the basis of the foregoing embodiments, further, before acquiring the voice signal corresponding to the song entered by the user, the method further includes:

respectively extracting acoustic characteristic information of a plurality of recorded songs as standard acoustic characteristic information of corresponding songs;

and storing the identification information of the plurality of songs and the corresponding standard acoustic characteristic information in the acoustic characteristic template.

In this embodiment, the acoustic feature template is obtained in advance from a plurality of songs recorded by a professional singer. Specifically, after a plurality of songs recorded by a professional singer are acquired, the acoustic feature information of each song can be respectively extracted, and each song corresponding to each acoustic feature information is recorded by the professional singer, so that each extracted acoustic feature information can be used as the standard acoustic feature information of the corresponding song.

If only the extracted standard acoustic feature information is stored in the acoustic feature template, there is no basis for obtaining the standard acoustic feature information corresponding to a specific song from the pre-established acoustic feature template. Based on the method, the identification information of each song corresponding to each standard acoustic feature one by one can be obtained while extracting each standard acoustic feature information, and the identification information of each song and the corresponding standard acoustic feature information are stored in the acoustic feature template. The identification information of the song includes the name of the song, the lyrics of the song, the name of the song plus the name of a professional singer, and the like, and the manner of acquiring the identification information of the song corresponding to the voice signal input by the user by the singing voice generation terminal may be receiving the input information of the user or extracting the identification information from the acquired voice signal.

Example two

Fig. 2 is a flowchart of a singing voice generating method according to a second embodiment of the present invention. On the basis of the foregoing embodiments, this embodiment may select to update the acoustic feature information of the speech signal according to the standard acoustic feature information, including: acquiring the sound length information corresponding to the voice signal, and performing time domain audio frequency transformation on the voice signal according to the sound length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal; correspondingly, the storing or outputting the voice signal with the updated acoustic feature information as the target voice signal comprises the following steps: and storing or outputting the voice signal obtained after the time domain audio frequency conversion as a target voice signal. As shown in fig. 2, the method of this embodiment specifically includes:

and S210, acquiring a voice signal corresponding to the song input by the user.

And S220, acquiring standard acoustic characteristic information corresponding to the song from the acoustic characteristic template established in advance.

And S230, acquiring the voice length information corresponding to the voice signal, and performing time-domain audio frequency transformation on the voice signal according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal.

The voice signal may be a waveform that varies with time, each word, or phrase in the voice signal corresponds to a corresponding segment of waveform, each segment of waveform has time information such as a time start point, a time end point, and a time length corresponding to the segment of waveform, and each word, or phrase and the time information corresponding to the word, or phrase are duration information corresponding to the voice signal.

After the voice length information corresponding to the voice signal is obtained, time domain audio frequency transformation can be performed on the voice signal according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signal. Specifically, based on the sound length information, the time domain audio frequency conversion may be performed on the waveform corresponding to the voice signal by using the standard acoustic feature information, so that the sound length information, the fundamental frequency information, and the energy information of the waveform corresponding to the voice signal after the time domain audio frequency conversion can be respectively matched with the standard sound length information, the standard fundamental frequency information, and the standard energy information in the standard acoustic feature information. The operation takes the standard acoustic characteristic information as a reference for adjusting the voice signal, and the acoustic characteristic information of the voice signal is adjusted to change the acoustic characteristic information of the voice signal.

Preferably, the obtaining of the duration information corresponding to the voice signal may include:

and obtaining lyric information contained in the voice signal through voice recognition, and obtaining the duration information corresponding to the voice signal according to the lyric information.

Specifically, after a voice signal input by a user is acquired, lyric information in the voice signal can be acquired through a voice recognition method, wherein the lyric information comprises words, phrases and the like, and each word, word or phrase has corresponding time information. And obtaining the voice length information corresponding to the voice signal according to the lyric information.

And S240, storing or outputting the voice signal obtained after the time domain audio frequency conversion is carried out as a target voice signal.

The voice signal obtained after the time domain audio conversion can contain the own voice characteristics of the user and also can contain the acoustic characteristic information of a professional singer, and based on the voice signal, the voice signal obtained after the time domain audio conversion can be used as a target voice signal to be stored or output.

The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song entered by a user, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, obtains duration information corresponding to the voice signal, performs time-domain audio transform on the voice signal according to the duration information and the standard acoustic feature information to change the acoustic feature information of the voice signal, and stores or outputs the voice signal obtained after the time-domain transform as a target voice signal, thereby overcoming the problems that in the prior art, voice conversion to singing voice is realized by performing acoustic model training on a large amount of data, and finally formed singing voice does not contain the voice of the user, which results in low user participation and experience, and realizing the effect of converting the voice of the user into singing voice which retains the voice of the user on the time domain without performing acoustic model training, meanwhile, the singing voice is ensured to have good tone quality effect.

On the basis of the foregoing embodiments, further, performing time-domain audio transform on a speech signal according to the duration information and the standard acoustic feature information to change the acoustic feature information of the speech signal includes:

and carrying out sound element division on the voice signals according to the sound length information, and carrying out time domain audio frequency conversion on the voice signals after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signals after the time domain audio frequency conversion to be consistent with the standard fundamental frequency information, the sound length information of the voice signals after the time domain audio frequency conversion to be consistent with the standard sound length information, and the energy information of the voice signals after the time domain audio frequency conversion to be consistent with the standard energy information.

The acoustic feature information may include fundamental frequency information, duration information, and energy information of the voice signal, among others. The fundamental frequency information corresponds to the pitch of the voice signal, the duration information corresponds to the rhythm of the voice signal, and the energy information corresponds to the volume of the voice signal.

In this embodiment, the voice signal may be subjected to sound element division according to the duration information of the voice signal, and preferably, the voice signal may be subjected to tone element division according to each word in the duration information and time information corresponding to each word, so as to obtain a tone element corresponding to each word, where each tone element corresponds to a portion of the voice signal, for example, for a song containing lyrics of 100 words, time information corresponding to the 1 st word is t1b-t1n, time information corresponding to the 2 nd word is t2b-t2n, … …, time information corresponding to the 100 th word is t100b-t100n, then a portion of the signal corresponding to the t1b-t1n time period in the voice signal is a tone element of the 1 st word, a portion of the signal corresponding to the t2b-t2n time period in the voice signal is a tone element of the 2 nd word, … …, and a portion of the signal corresponding to the t100b-t100n time period in the voice signal is a tone element of the 100 th word. Each sound element has corresponding fundamental frequency information, duration information and energy information. Then, with the sound element as a unit, time domain audio frequency transformation can be performed on the voice signal after sound element division by using standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information, so that the fundamental frequency information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard fundamental frequency information, the sound length information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard sound length information, and the energy information of the voice signal after time domain audio frequency transformation is consistent with the corresponding standard energy information. That is, for each word of the lyrics of the song in the standard acoustic feature information of the song, the fundamental frequency information, the duration information, and the energy information of the corresponding tone element are stored, and for each tone element of the speech signal after time domain audio conversion, the fundamental frequency information, the duration information, and the energy information of the tone element are respectively consistent with the standard fundamental frequency information, the standard duration information, and the standard energy information of the corresponding tone element in the standard acoustic feature information.

EXAMPLE III

Fig. 3 is a flowchart of a singing voice generating method according to a third embodiment of the present invention. On the basis of the foregoing embodiments, this embodiment may further include, after acquiring the voice signal corresponding to the song entered by the user and before updating the acoustic feature information of the voice signal according to the standard acoustic feature information, that: extracting frequency spectrum information of the voice signal; updating the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the updating comprises: acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; converting the voice signal after sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after conversion according to the standard acoustic characteristic information; correspondingly, the storing or outputting the voice signal with the updated acoustic feature information as the target voice signal comprises the following steps: and acquiring a target voice signal according to the updated acoustic characteristic information and the frequency spectrum information, and storing or outputting the target voice signal. As shown in fig. 3, the method of this embodiment specifically includes:

and S310, acquiring a voice signal corresponding to the song input by the user.

And S320, extracting the spectrum information of the voice signal.

In this embodiment, the spectrum information of the speech signal corresponds to the tone of the speech signal, which reflects the sound characteristics of the user. In the process of converting the voice signal into the singing voice, in order to preserve the voice characteristics of the user and enable the finally generated singing voice to have the voice characteristics of the user, the frequency spectrum information in the voice signal can be extracted in advance.

S330, standard acoustic characteristic information corresponding to the song is obtained from the acoustic characteristic template established in advance.

S340, acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; and converting the voice signal after the sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after the conversion according to the standard acoustic characteristic information.

According to the method for acquiring the duration information and the dividing the sound elements corresponding to the voice signal described in the embodiments, the corresponding duration information can be acquired in the time domain according to the waveform and the voice content of the voice signal, and the sound elements of the voice signal are divided according to the duration information.

In addition to the updating of the acoustic feature information for the speech signal in the time domain, the updating of the acoustic feature information for the speech signal may also be performed in the frequency domain. Specifically, the voice signal after the sound element division may be converted from the time domain to the frequency domain by using each divided sound element as a unit, so as to obtain a representation form of each sound element on the frequency domain. And determining acoustic characteristic information of the voice signal in the frequency domain according to the representation form of each sound element in the frequency domain, and updating the acoustic characteristic information of the voice signal in the frequency domain obtained after conversion according to the standard acoustic characteristic information to obtain updated acoustic characteristic information.

And S350, obtaining a target voice signal according to the updated acoustic feature information and the updated spectrum information, and storing or outputting the target voice signal.

The updated acoustic feature information is acoustic feature information of a professional singer, the frequency spectrum information reflects the voice feature of the user, and the target voice signal obtained by using the updated acoustic feature information and the frequency spectrum information comprises the voice feature of the user and the acoustic feature information of the professional singer. After the target speech signal is obtained, the target speech signal may be stored or output.

The singing voice generating method provided by this embodiment obtains a voice signal corresponding to a song recorded by a user, extracts a spectral feature of the voice signal, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, obtains duration information corresponding to the voice signal, performs phoneme division on the voice signal according to the duration information, performs time-domain to frequency-domain conversion on the voice signal subjected to the phoneme division, updates the acoustic feature information of the voice signal on the frequency domain obtained after the conversion according to the standard acoustic feature information, finally obtains a target voice signal according to the acoustic feature information and the spectral information obtained after the updating, stores or outputs the target voice signal, overcomes the defects that in the prior art, the conversion from voice to singing voice is realized by performing acoustic model training by using a large amount of data, and the finally formed singing voice does not contain the own voice of the user, the problem that the participation degree and the experience degree of the user are not high is solved, the effect that the voice of the user is converted into the singing voice which is reserved with the voice of the user on a frequency domain without acoustic model training is achieved, and meanwhile, the good tone quality effect of the singing voice is guaranteed.

On the basis of the foregoing embodiments, further, updating the acoustic feature information of the speech signal in the frequency domain obtained after the conversion according to the standard acoustic feature information includes:

and replacing the fundamental frequency information of the voice signal on the frequency domain obtained after conversion by using the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after conversion by using the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after conversion by using the standard energy information in the standard acoustic characteristic information.

In this embodiment, acoustic feature information of a speech signal in a frequency domain is determined in units of each sound element according to a representation form of each sound element in the frequency domain, where the acoustic feature information includes fundamental frequency information, duration information, and energy information of the speech signal. And then, replacing the fundamental frequency information of the voice signal on the frequency domain obtained after conversion with the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after conversion with the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after conversion with the standard energy information in the standard acoustic characteristic information.

On the basis of the foregoing embodiments, further, obtaining a target speech signal according to the updated acoustic feature information and the updated spectral information includes:

and inputting the updated acoustic characteristic information and the updated spectrum information to the vocoder to obtain the target voice signal restored by the vocoder.

The corresponding target speech signal cannot be directly obtained by using the acoustic feature information obtained by updating in the frequency domain and the pre-acquired frequency spectrum information. Therefore, the target speech signal can be preferably restored by the vocoder. The vocoder is also called a speech analysis and synthesis system or a speech band compression system, and can restore a corresponding speech signal by using model parameters of the speech signal in combination with a speech synthesis technique, and is a codec for analyzing and synthesizing speech.

In this embodiment, the updated acoustic feature information and the pre-obtained spectrum information may be input into the vocoder, and the vocoder restores the corresponding target speech signal according to the input parameters and by combining its internal speech synthesis technology.

Example four

Fig. 4 is a schematic structural diagram of a singing voice generating apparatus in a fourth embodiment of the present invention. As shown in fig. 4, the singing voice generating apparatus of the present embodiment includes:

a voicesignal obtaining module 410, configured to obtain a voice signal corresponding to a song input by a user;

the acoustic featureinformation updating module 420 is configured to obtain standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and update the acoustic feature information of the speech signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song;

and a target speechsignal determining module 430, configured to store or output the speech signal with the updated acoustic feature information as a target speech signal.

The singing voice generating device provided by this embodiment obtains a voice signal corresponding to a song, which is input by a user, through a voice signal obtaining module, obtains standard acoustic feature information corresponding to the song from a pre-established acoustic feature template through an acoustic feature information updating module, and updates the acoustic feature information of the voice signal according to the standard acoustic feature information, wherein the standard acoustic feature information of at least one song is stored in the acoustic feature template, and then the voice signal with the updated acoustic feature information is stored or output as a target voice signal through a target voice signal determining module, so as to overcome the problems that in the prior art, the conversion from voice to singing voice is realized by performing acoustic model training through a large amount of data, and the finally formed singing voice does not contain the own voice of the user, which results in low user participation degree and experience degree, the method and the device have the advantages that the effect that the voice of the user is converted into the singing voice which is reserved with the voice of the user can be achieved without acoustic model training, and meanwhile, the good tone quality effect of the singing voice is guaranteed.

On the basis of the foregoing embodiments, further, the acoustic featureinformation updating module 420 may include:

the first voice length information acquisition unit is used for acquiring voice length information corresponding to the voice signal;

the time domain audio transformation unit is used for carrying out time domain audio transformation on the voice signals according to the voice length information and the standard acoustic characteristic information so as to change the acoustic characteristic information of the voice signals;

the target speechsignal determination module 430 may specifically include:

and the first target voice signal determining unit is used for storing or outputting the voice signal obtained after the time domain audio frequency conversion is carried out as a target voice signal.

Further, the time-domain audio transform unit may specifically be configured to:

Further, the apparatus may further include:

the frequency spectrum information extraction module is used for extracting the frequency spectrum information of the voice signal after acquiring the voice signal which is input by the user and corresponds to the song and before updating the acoustic characteristic information of the voice signal according to the standard acoustic characteristic information;

the acoustic featureinformation update module 420 may further include:

the second sound length information acquisition unit is used for acquiring sound length information corresponding to the voice signal;

the frequency domain audio frequency transformation unit is used for carrying out sound element division on the voice signal according to the sound length information; converting the voice signal after sound element division from a time domain to a frequency domain, and updating the acoustic characteristic information of the voice signal on the frequency domain obtained after conversion according to standard acoustic characteristic information;

the target speechsignal determination module 430 may further include:

and the second target voice determining unit is used for acquiring a target voice signal according to the updated acoustic feature information and the updated spectrum information, and storing or outputting the target voice signal.

Further, the frequency domain audio transform unit may specifically be configured to:

Further, the second target speech determination unit may specifically be configured to:

Further, the first duration information acquiring unit and the second duration information acquiring unit may be specifically configured to:

obtaining lyric information contained in the voice signal through voice recognition, and obtaining duration information corresponding to the voice signal according to the lyric information;

further, the apparatus may further include:

the standard acoustic characteristic information extraction module is used for respectively extracting the acoustic characteristic information of the recorded songs as the standard acoustic characteristic information of the corresponding songs before acquiring the voice signals which are input by the user and correspond to the songs;

and the acoustic characteristic template generating module is used for storing the identification information of the plurality of songs and the corresponding standard acoustic characteristic information in the acoustic characteristic template.

The singing voice generating device provided by the embodiment of the invention can execute the singing voice generating method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the executing method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a singing voice generating terminal according to a fifth embodiment of the present invention. Figure 5 illustrates a block diagram of an exemplary singingvoice generation terminal 512 suitable for use in implementing embodiments of the present invention. The singing voice generating terminal 512 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 5, the singingvoice generating terminal 512 is represented in the form of a general purpose computing device. The components of the singingvoice generating terminal 512 may include, but are not limited to: one ormore processors 516, amemory 528, and abus 518 that couples the various system components including thememory 528 and theprocessors 516.

Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

The singingvoice generating terminal 512 typically includes a variety of computer system readable media. Such media may be any available media that can be accessed by the singingvoice generating terminal 512 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 528 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)530 and/orcache memory 532. The singingvoice generating terminal 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only,storage 534 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected tobus 518 through one or more data media interfaces.Memory 528 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 540 having a set (at least one) ofprogram modules 542, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in, for example, thememory 528, each of which examples or some combination may include an implementation of a network environment. Theprogram modules 542 generally perform the functions and/or methods of the described embodiments of the invention.

The singingvoice generating terminal 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device,display 524, etc., where thedisplay 524 may be configurable or not as desired), with one or more devices that enable a user to interact with the singingvoice generating terminal 512, and/or with any devices (e.g., network card, modem, etc.) that enable the singing voice generating terminal 512 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 522. Also, the singingvoice generating terminal 512 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through thenetwork adapter 520. As shown, thenetwork adapter 520 communicates with the other modules of the singingvoice generating terminal 512 via abus 518. It should be appreciated that although not shown in fig. 5, other hardware and/or software modules may be used in conjunction with the singingvoice generation terminal 512, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage, among others.

Theprocessor 516 executes various functional applications and data processing, such as implementing a singing voice generating method provided by any embodiment of the present invention, by executing programs stored in thememory 528.

EXAMPLE six

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a singing voice generating method provided in an embodiment of the present invention, where the method includes:

acquiring a voice signal corresponding to a song input by a user;

acquiring standard acoustic feature information corresponding to the song from a pre-established acoustic feature template, and updating the acoustic feature information of the voice signal according to the standard acoustic feature information; wherein, the acoustic characteristic template stores standard acoustic characteristic information of at least one song;

Of course, the computer program stored on the computer-readable storage medium provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the singing voice generation method provided by any embodiments of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A singing voice generating method, comprising:

acquiring a voice signal corresponding to a song input by a user;

storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal;

wherein, updating the acoustic feature information of the voice signal according to the standard acoustic feature information comprises:

acquiring the sound length information corresponding to the voice signal, carrying out sound element division on the voice signal according to the sound length information, and carrying out time domain audio frequency transformation on the voice signal after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signal after the time domain audio frequency transformation to be consistent with the standard fundamental frequency information, the sound length information of the voice signal after the time domain audio frequency transformation to be consistent with the standard sound length information, and the energy information of the voice signal after the time domain audio frequency transformation to be consistent with the standard energy information;

storing or outputting the voice signal with the updated acoustic feature information as a target voice signal, comprising:

and storing or outputting the voice signal obtained after the time domain audio frequency conversion as a target voice signal.

2. The method of claim 1, wherein obtaining duration information corresponding to the voice signal comprises:

3. The method according to claim 1 or 2, further comprising, before acquiring the voice signal corresponding to the song entered by the user:

and storing the identification information of the plurality of songs and the corresponding standard acoustic characteristic information in an acoustic characteristic template.

4. A singing voice generating method, comprising:

acquiring a voice signal corresponding to a song input by a user;

extracting frequency spectrum information of the voice signal;

acquiring the voice length information corresponding to the voice signal, and carrying out sound element division on the voice signal according to the voice length information; converting the voice signal after sound element division from a time domain to a frequency domain, replacing fundamental frequency information of the voice signal on the frequency domain obtained after conversion with standard fundamental frequency information in the standard acoustic characteristic information, replacing duration information of the voice signal on the frequency domain obtained after conversion with standard duration information in the standard acoustic characteristic information, and replacing energy information of the voice signal on the frequency domain obtained after conversion with standard energy information in the standard acoustic characteristic information;

and acquiring a target voice signal according to the updated acoustic characteristic information and the frequency spectrum information, and storing or outputting the target voice signal.

5. The method of claim 4, wherein obtaining the target speech signal according to the updated acoustic feature information and the spectrum information comprises:

and inputting the updated acoustic characteristic information and the updated frequency spectrum information to a vocoder to obtain a target voice signal restored by the vocoder.

6. The method of claim 4, wherein obtaining duration information corresponding to the voice signal comprises:

7. The method according to any one of claims 4-6, further comprising, prior to obtaining the voice signal corresponding to the song entered by the user:

8. A singing voice generating apparatus, comprising:

the target voice signal determining module is used for storing or outputting the voice signal with the updated acoustic characteristic information as a target voice signal;

wherein, the acoustic characteristic information updating module comprises:

the time domain audio frequency transformation unit is used for carrying out sound element division on the voice signals according to the sound length information, carrying out time domain audio frequency transformation on the voice signals after the sound element division according to standard fundamental frequency information, standard sound length information and standard energy information in the standard acoustic characteristic information so as to enable the fundamental frequency information of the voice signals after the time domain audio frequency transformation to be consistent with the standard fundamental frequency information, the sound length information of the voice signals after the time domain audio frequency transformation to be consistent with the standard sound length information and the energy information of the voice signals after the time domain audio frequency transformation to be consistent with the standard energy information;

the target speech signal determination module includes:

9. A singing voice generating apparatus, comprising:

wherein, the acoustic characteristic information updating module includes:

the frequency domain audio frequency transformation unit is used for carrying out sound element division on the voice signal according to the sound length information; converting the time domain to the frequency domain of the voice signal after the sound element division, replacing the fundamental frequency information of the voice signal on the frequency domain obtained after the conversion with the standard fundamental frequency information in the standard acoustic characteristic information, replacing the duration information of the voice signal on the frequency domain obtained after the conversion with the standard duration information in the standard acoustic characteristic information, and replacing the energy information of the voice signal on the frequency domain obtained after the conversion with the standard energy information in the standard acoustic characteristic information;

10. A singing voice generating terminal, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a singing voice generation method as recited in any one of claims 1-7.

11. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the singing voice generating method according to any one of claims 1 to 7.