CN111354332A

Movatterモバイル変換

Info

Publication number: CN111354332A
Application number: CN201811480060.2A
Authority: CN
Inventors: 赵帅江; 杨贵林; 蒋栋蔚
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2020-06-30

Abstract

The embodiment of the application provides a singing voice synthesis method and a singing voice synthesis device, which are used for acquiring a target song and user voice to be simulated; acquiring song characteristics of the target song and acoustic characteristics of the user sound; determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model; synthesizing the singing voice of the user based on the plurality of acoustic parameters. Therefore, the acoustic parameters can be determined by the synthesis model through acquiring the song characteristics of the preprocessed target song and the acoustic characteristics of the voice of the user, and the acoustic parameters are used for synthesizing the singing voice, so that the singing voice synthesis can realize automatic singing voice synthesis through model learning, the singing voice synthesis speed is improved, the synthesis factors are rich, and the synthesis singing voice has high reality degree.

Description

Singing voice synthesis method and device

Technical Field

The application relates to the technical field of voice synthesis, in particular to a singing voice synthesis method and device.

Background

With the continuous development of artificial intelligence technology, the voice interaction capability is more and more emphasized, and listening, speaking and singing become necessary skills. More and more application software and intelligent devices begin to use speech synthesis technology to simulate the content of human speech, for example, use speech synthesis technology to simulate the content of human speech to achieve the purpose of "learning tongue", or use speech synthesis technology to simulate the voice of singing people, etc. different from the ordinary speech scene.

However, in a scene of simulating singing of a user, a common method is to perform modeling for sound information of different users, perform tone conversion on the sound of a target user by using the established model, and adjust the tone of the song and the like by combining the tone of the user through manual experience and other experience modes on the basis of the inherent tone of the song so as to realize the singing sound effect of simulating the tone of the user.

Disclosure of Invention

In view of this, embodiments of the present application provide a singing voice synthesis method and apparatus, which can implement automatic singing voice synthesis through model learning, improve the singing voice synthesis rate, and obtain different user voices, with rich synthesis factors and high degree of reality.

In one aspect, an embodiment of the present application provides a singing voice synthesis method, where the method includes:

acquiring a target song and user sound to be simulated;

acquiring song characteristics of the target song and acoustic characteristics of the user sound;

determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model;

synthesizing the singing voice of the user based on the plurality of acoustic parameters.

In one embodiment, the determining a plurality of acoustic parameters of the synthesized singing voice based on the song feature, the acoustic feature and a pre-trained synthesis model comprises:

constructing a song feature vector based on the song features, and constructing an acoustic feature vector based on the acoustic features;

and inputting the song feature vector and the acoustic feature vector into a pre-trained synthetic model, and determining various acoustic parameters of the synthetic song.

In the above embodiment, the plurality of acoustic parameters include a mel-frequency cepstrum and a fundamental frequency.

In some embodiments of the present application, the inputting the song feature vector and the acoustic feature vector into a pre-trained synthesis model to determine a plurality of acoustic parameters of a synthesized song includes:

splicing the song characteristic vector and the acoustic characteristic vector to obtain a spliced characteristic vector;

and inputting the splicing feature vector into the synthetic model, and determining the Mel cepstrum and fundamental frequency of the synthetic singing voice.

In the above embodiment, the obtaining of the song characteristics of the target song includes:

analyzing and processing the lyrics and the music score of the target song to obtain phoneme information and rhythm information;

and coding the phoneme information and the prosody information to obtain the song characteristics.

In the above embodiment, acquiring the acoustic feature of the user sound includes:

analyzing the audio frequency of the user sound to obtain audio information;

determining index information of the user voice;

and coding the audio information and the index information of the user voice to obtain the acoustic characteristics.

In some embodiments, the method further comprises:

acquiring a song sample, a voice sample and a singing voice sample obtained by singing the song sample by a user corresponding to the voice sample;

acquiring song characteristics of the song sample and acoustic characteristics of the sound sample;

and training a synthetic model according to the song characteristics of the song sample, the acoustic characteristics of the sound sample and the singing voice sample.

In some embodiments, training a synthetic model based on the song characteristics of the song sample, the acoustic characteristics of the sound sample, and the song sound sample comprises:

constructing a song feature vector of the song sample based on the song features of the song sample, and constructing an acoustic feature vector of the sound sample based on the acoustic features of the sound sample;

acquiring acoustic parameters of the singing voice sample;

training the synthetic model based on the acoustic parameters of the song sample as an output result, and the song feature vector of the song sample and the acoustic feature vector of the sound sample as model input features.

In some embodiments, training the synthetic model based on the acoustic parameters of the song sample as output results, and the song feature vector of the song sample and the acoustic feature vector of the sound sample as model input features comprises:

splicing the song characteristic vector of the song sample and the acoustic characteristic vector of the sound sample to obtain a spliced characteristic vector sample;

inputting the spliced feature vector sample into the synthesis model, and determining acoustic parameters of the synthesized singing voice;

calculating an error of the synthetic model based on the acoustic parameters of the singing voice sample and the acoustic parameters output by the synthetic model;

and updating the training parameters of the synthesis model based on the error, and selecting a new song sample, a sound sample and a singing sound sample obtained by singing the song sample by the user corresponding to the sound sample for training until a training cut-off condition is met.

In the above embodiment, the training cutoff condition comprises one of the following conditions:

the error no longer becomes small; the training times reach a set threshold; the number of training samples reaches a set threshold.

In the above embodiment, obtaining the song characteristics of the song sample includes:

analyzing and processing the lyrics and the music score of the song sample to obtain phoneme information and rhythm information;

and coding the phoneme information and the prosody information of the song sample to obtain the song characteristics of the song sample.

In the above embodiment, acquiring the acoustic training features of the sound sample includes:

analyzing the audio frequency of the sound sample to obtain audio information;

determining index information of the sound sample;

and coding the audio information and the index information of the sound sample to obtain the acoustic characteristics of the sound sample.

On the other hand, an embodiment of the present application further provides a singing voice synthesizing apparatus, including:

the first acquisition module is used for acquiring a target song and user voice to be simulated;

the second acquisition module is used for acquiring the song characteristics of the target song and the acoustic characteristics of the user sound;

the determining module is used for determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthetic model;

and the synthesis module is used for synthesizing the singing voice of the user based on the plurality of acoustic parameters.

In one embodiment, the determining module is specifically configured to:

In some embodiments of the present application, the determining module is further specifically configured to:

In some embodiments of the present application, the second obtaining module is specifically configured to:

In some embodiments of the present application, the second obtaining module is further specifically configured to:

analyzing the audio frequency of the user sound to obtain audio information;

determining index information of the user voice;

In some embodiments, the singing voice synthesizing apparatus comprises:

the third acquisition module is used for acquiring a song sample, a voice sample and a singing voice sample obtained by singing the song sample by a user corresponding to the voice sample;

the fourth acquisition module is used for acquiring the song characteristics of the song sample and the acoustic characteristics of the sound sample;

and the training module is used for training a synthesis model according to the song characteristics of the song sample, the acoustic characteristics of the sound sample and the song sound sample.

In some embodiments, the training module is specifically configured to:

acquiring acoustic parameters of the singing voice sample;

In the above embodiment, the training module is further specifically configured to:

In the foregoing embodiment, the fourth obtaining module is specifically configured to:

In the above embodiment, the fourth obtaining module is further specifically configured to:

analyzing the audio frequency of the sound sample to obtain audio information;

determining index information of the sound sample;

On the other hand, an embodiment of the present application further provides an electronic device, including: the singing voice synthesizing method comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic equipment runs, the processor is communicated with the storage medium through the bus, and the processor executes the machine-readable instructions to execute the steps of the singing voice synthesizing method.

On the other hand, the embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the singing voice synthesizing method.

The singing voice synthesis method and the singing voice synthesis device provided by the embodiment of the application acquire a target song and user voice to be simulated; acquiring song characteristics of the target song and acoustic characteristics of the user sound; determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model; synthesizing the singing voice of the user based on the plurality of acoustic parameters. Therefore, the synthesis model can be used for determining various acoustic parameters by acquiring the song characteristics of the preprocessed target song and the acoustic characteristics of the user voice, and the determined acoustic parameters are used for synthesizing the singing voice, so that the singing voice synthesis can realize automatic singing voice synthesis through model learning, the singing voice synthesis rate is improved, different user voice training synthesis models can be acquired, and the synthesis factors for synthesizing the singing voice based on the synthesis model are rich and high in reality degree.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a diagram of exemplary hardware and software components of an electronic device that may implement the concepts of the present application, according to some embodiments of the present application;

fig. 2 is a flowchart of a singing voice synthesizing method according to an embodiment of the present application;

fig. 3 is a flowchart of a singing voice synthesizing method according to another embodiment of the present application;

fig. 4 is a block diagram of a singing voice synthesizing apparatus according to an embodiment of the present application;

fig. 5 is a second block diagram of a singing voice synthesizer according to an embodiment of the present invention.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

In order to enable those skilled in the art to use the present disclosure, the following embodiments are given in conjunction with a specific application scenario "singing voice synthesis using a user's voice". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in the context of singing voice synthesis, it should be understood that this is merely one exemplary embodiment. The present application may also include any service system for sound synthesis, for example, a system for synthesizing dubbing of audio or video using user's voice, and the like.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

It is worth noting that, in the current singing simulation of user voice, a common method is to perform modeling aiming at voice information of different users, perform tone conversion on the voice of a target user by using an established model, and adjust the tone of a song and the like through manual experience by combining the tone of the user on the basis of the inherent tone of the song so as to realize the singing effect of simulating the voice of the user singing, but the voice synthesis process is slow in processing speed, and the inherent tone is used for adjustment, so that the obtained simulation result is rigid and inconsistent with the actual singing voice.

In view of the above, one aspect of the present application provides a singing voice synthesis method, which implements automatic synthesis through model learning, thereby improving the singing voice synthesis rate, and can obtain different user voice training synthesis models, wherein the synthesis factors for synthesizing the singing voice based on the synthesis models are rich, and the synthesized singing voice is fine and smooth and has high reality.

The embodiment of the present application may be implemented by a cloud platform, a server, or a server group (hereinafter, referred to as a service system) interacting with a user side, where the user side in the embodiment of the present application may be understood as a terminal device, and may also be understood as a client registered by a user on a platform, where the terminal device may be understood as a mobile terminal, a Personal Computer (PC), or any device capable of installing a platform application program.

In some embodiments, the service system may be a single server or a group of servers. The server group may be centralized or distributed (e.g., the service system may be a distributed system). In some embodiments, the service system may be local or remote with respect to the terminal. For example, the service system may access information and/or data stored in the user side or a database, or any combination thereof, via a network. In some embodiments, the service system may be implemented on a cloud platform; by way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud (communicuted), a distributed cloud, an inter-cloud, a multi-cloud, and the like, or any combination thereof. In some embodiments, the service system may be implemented on an electronic device 100 having one or more of the components shown in FIG. 1 in the present application.

In some embodiments, the service system may include a processor. The processor may process information and/or data related to the service request to perform one or more of the functions described herein. In some embodiments, a processor may include one or more processing cores (e.g., a single-core processor (S) or a multi-core processor (S)). Merely by way of example, a Processor may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a reduced Instruction Set computer (reduced Instruction Set computer), a microprocessor, or the like, or any combination thereof.

FIG. 1 is a diagram of exemplary hardware and software components of an electronic device 100 that may implement the concepts of the present application, according to some embodiments of the present application. For example, the processor 120 may be used on the electronic device 100 and to perform the functions in the present application.

The electronic device 100 may be a general-purpose computer or a special-purpose computer, both of which may be used to implement the singing voice synthesis method of the present application. Although only a single computer is shown, for convenience, the functions described herein may be implemented in a distributed fashion across multiple similar platforms to balance processing loads.

For example, the electronic device 100 may include a network port 110 connected to a network, one or more processors 120 for executing program instructions, a communication bus 130, and a storage medium 140 of different form, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The electronic device 100 also includes an Input/Output (I/O) interface 150 between the computer and other Input/Output devices (e.g., keyboard, display screen).

For ease of illustration, only one processor is depicted in electronic device 100. However, it should be noted that the electronic device 100 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 100 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together or separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

The idea of the embodiment of the present application is further described below by taking a travel service scenario as an example from the perspective of an execution method.

Fig. 2 is a flowchart of a singing voice synthesizing method provided in an embodiment of the present application, as shown in fig. 2, the method including:

s201: and acquiring a target song and the user voice to be simulated.

The target song and the user voice to be simulated are obtained, the target song and the user voice can be obtained from a pre-stored database, and for obtaining the user voice, the user voice can be collected in real time, and then the collected user voice is obtained.

In addition, the acquired user voice can be the speaking voice of a certain user which exists actually, and can also be the voice of a certain user which exists actually, so that the voice of a virtual user is obtained, and therefore, the synthesis of virtual tone can be realized, and the interestingness of the synthesis of singing voice is increased.

S202: and acquiring the song characteristics of the target song and the acoustic characteristics of the user voice.

In this step, after the target song and the user sound are acquired, the target song and the user sound may be processed, so as to acquire the song characteristics of the target song and the acoustic characteristics of the user sound, which are acquired after processing.

The processing of the target song and the user sound may be coding the target song and the user sound through a coding model, so as to obtain coded song characteristics and acoustic characteristics.

Specifically, the song characteristics may be characteristics of the target song, such as a song name, lyrics, a music score, a tune, a singing method, a genre and the like, obtained by encoding, may be specific characteristics obtained by encoding the above characteristics and the like, or may be an obtained representation of the target song, that is, the song and the song characteristics are represented by a simplified code or Identifier (ID) and the like, and when in use, the target song and the song characteristics of the target song associated with the representation can be extracted through the representation; the acoustic features may be obtained by encoding features of the user's voice that characterize the user's voice, such as timbre, frequency, pitch, and loudness of the user's voice, or by integrating specific features obtained by encoding the above features, or may be an obtained representation of the user, i.e., representing the user and the user's voice in a simplified code or ID format, and, in use, the acoustic features of the user and the user's voice that are associated with the representation may be extracted from the representation.

Specifically, a coding model based on a Long Short-Term Memory network (LSTM) may be used, a plurality of LSTMs are set, a full connection layer is even arranged behind each LSTM, a Back Propagation Time (BPTT) is used to train the coding model, the trained coding model is used to code the target song and the user voice, and after the user voice is input, the last layer of output is normalized by L2 to obtain the final coding characteristics of the user voice.

Therefore, the model end is used for processing the characteristics of the user voice and the song, so that the processing result can be quickly and effectively obtained, and preparation is made for subsequent voice synthesis.

S203: and determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model.

In this step, after the song feature and the acoustic feature are obtained, the song feature and the acoustic feature may be used as input features of a model, and the input features are input into a pre-trained synthesis model for processing, so as to obtain and determine a plurality of acoustic features required for synthesizing the song voice.

The pre-trained synthetic model may be a model of attention-based neural network natural language processing for singing voice synthesis, which is trained by adopting an attention-based system in combination with a convolutional neural network.

Step S204: synthesizing the singing voice of the user based on the plurality of acoustic parameters.

In this step, after obtaining the acoustic parameters through the synthesis model, a synthesizer or the like may be used to simulate and synthesize the singing voice of the user singing the target song using the determined multiple acoustic parameters.

The synthesizer may be a part of the synthesis model, or may be provided independently.

According to the singing voice synthesis method provided by the embodiment of the application, a target song and user voice to be simulated are obtained; acquiring song characteristics of the target song and acoustic characteristics of the user sound; determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model; synthesizing the singing voice of the user based on the plurality of acoustic parameters.

Therefore, the acoustic parameters can be determined by using the synthesis model through acquiring the song characteristics of the preprocessed target song and the acoustic characteristics of the user voice, the singing voice is synthesized by using the acoustic parameters, the automatic singing voice synthesis can be realized through model learning, the singing voice synthesis speed is improved, different user voice training synthesis models can be acquired, and the synthesis factors for synthesizing the singing voice based on the synthesis models are rich and high in degree of reality.

Fig. 3 is a flowchart of a singing voice synthesizing method according to another embodiment of the present application, as shown in fig. 3, the method including:

s301: and acquiring a target song and the user voice to be simulated.

In addition, the acquired user voice can be the speaking voice of a certain user which exists actually, and can also be the voice of a virtual user obtained by adjusting the voice of the certain user which exists actually, so that the synthesis of virtual timbre can be realized, and the interestingness of the synthesis of singing voice is increased.

S302: and acquiring the song characteristics of the target song and the acoustic characteristics of the user voice.

Specifically, the song characteristics may be characteristics of the target song, such as a song name, lyrics, a music score, a tune, a gramophone record, a genre and the like, obtained by encoding, may be specific characteristics obtained by encoding the above characteristics and the like, or may be an obtained representation of the target song, that is, the song and the song characteristics are represented by a simplified code or Identifier (ID), and when in use, the target song and the song characteristics of the target song associated with the representation may be extracted by the representation; the acoustic features may be obtained by encoding features of the user's voice that characterize the user's voice, such as timbre, frequency, pitch, and loudness of the user's voice, or by integrating specific features obtained by encoding the features, or may be an obtained representation of the user, i.e., representing the user and the user's voice in a simplified code or ID format, and the acoustic features of the user and the user's voice that are associated with the representation may be extracted from the representation during use.

S303: constructing a song feature vector based on the song features, and constructing an acoustic feature vector based on the acoustic features.

In this step, after the song feature and the acoustic feature are obtained, the song feature and the acoustic feature may be processed to determine an input feature that may be subsequently input as a model, specifically, a song diagnosis vector may be constructed according to the song feature, and an acoustic feature vector may be constructed according to the acoustic feature, so as to facilitate subsequent use.

The song feature vector may be a fixed-length vector, the acoustic feature vector, or a fixed-length vector, and for convenience of subsequent use and processing, the vector length of the song feature vector and the vector length of the acoustic feature may be the same.

The pre-trained synthetic model may be a synthetic model for singing voice synthesis trained by using an attention mechanism and combining with a convolutional neural network.

And S304, inputting the song characteristic vector and the acoustic characteristic vector into a pre-trained synthetic model, and determining various acoustic parameters of the synthetic song.

In this step, after the song feature vector and the acoustic feature vector are obtained by constructing a vector, the song feature vector and the acoustic feature vector may be input to a synthesis model trained in advance as input features of the model, so that a plurality of acoustic parameters of the synthesized song are determined based on an output result of the synthesis model.

Among them, the plurality of acoustic parameters include a meier cepstrum and a fundamental frequency for synthesizing the singing voice.

Specifically, the song feature vector and the acoustic feature vector are input into a pre-trained synthetic model, and a plurality of acoustic parameters of the synthetic song are determined, which may be that vector splicing is performed on the song feature vector and the acoustic feature vector to obtain a spliced vector, and then the spliced vector is input into the synthetic model as the input feature of the synthetic model, so as to obtain the determined mel cepstrum and fundamental frequency of the synthetic song according to the output result of the synthetic model.

Specifically, the splicing processing of the song feature vector and the acoustic feature vector may be performed in a conventional vector splicing manner, or may be performed in a special vector form according to the needs or settings of the synthetic model.

Therefore, the singing feature and the coding of the acoustic feature and a series of vectorization processing are carried out, then the synthesis model is used for synthesizing to obtain the acoustic parameters of the synthesized singing voice, the voice synthesis is carried out in an end-to-end mode, the automation degree is high, the synthesis speed and the synthesis efficiency can be effectively improved, each processing process can be independently set and processed, the cascade errors among the joint processing can be effectively reduced or avoided, and the training and the deployment are simple and flexible.

S305: synthesizing the singing voice of the user based on the plurality of acoustic parameters.

In some implementations, the song characteristics of the target song may be obtained by:

analyzing and processing the lyrics and the music score of the target song to obtain phoneme information and rhythm information; and coding the phoneme information and the prosody information to obtain the song characteristics.

In this step, after the target song is obtained, song information of the target song may be obtained, lyrics and a music score of the target song are extracted from the song information of the target song, then, the lyrics and the music score of the target song may be analyzed, so that phoneme information and prosody information of the song are obtained through analysis, and then, the phoneme information and the prosody information are encoded, so that song characteristics of the target song may be obtained.

In some embodiments, the acoustic characteristics of the user's voice may be obtained by:

analyzing the audio frequency of the user sound to obtain audio information; determining index information of the user voice; and coding the audio information and the index information of the user voice to obtain the acoustic characteristics.

In this step, after the user sound is obtained, the audio of the user sound, that is, the frequency of the user sound, may be detected, and then the detected audio is analyzed, so as to obtain the audio information of the user sound, and after the index information of the user sound is determined, then the audio information and the index information may be encoded, so as to obtain the acoustic feature of the user sound.

The index information of the user voice may be an index for indicating which user the user voice corresponds to, that is, the index information may be used to determine which user voice is specific, and the index may be represented by setting an index such as a label, a symbol, or a tag to the user.

In this embodiment, the analyzing of the audio of the user sound is performed before the index information of the user sound is determined, but is not limited to the order of the two, and the analyzing of the audio of the user sound may be performed, or may be performed after the index information of the user sound is determined, and may be performed in synchronization with the determining of the index information of the user sound. In other embodiments of the present application, the order explanation of each step is the same as that here, and is not described in detail.

In some embodiments, the method further comprises:

acquiring a song sample, a voice sample and a singing voice sample obtained by singing the song sample by a user corresponding to the voice sample; acquiring song characteristics of the song sample and acoustic characteristics of the sound sample; and training a synthetic model according to the song characteristics of the song sample, the acoustic characteristics of the sound sample and the singing voice sample.

In this step, after a synthesis model for synthesizing singing voice is established, the synthesis model needs to be trained to perform singing voice synthesis using the synthesis model, specifically, a song sample for training the synthesis model and a voice sample of a user are obtained, a singing voice sample obtained based on the song sample is performed by the user corresponding to the voice sample, then song features and acoustic features of the song sample and the voice sample are respectively obtained for the song sample and the voice sample, and then the song features of the song sample, the acoustic features of the voice sample and the singing voice sample can be used to train the synthesis model.

In the foregoing embodiment, specifically, the acquiring of the song characteristics of the song sample may be that the lyrics and the music score of the song sample are firstly analyzed to obtain the phoneme information and the prosody information of the song sample, and then the phoneme information and the prosody information of the song sample are coded to obtain the song characteristics of the song sample, which are used as the model input characteristics for training the synthesis model.

In the foregoing embodiment, specifically, the obtaining of the acoustic training feature of the sound sample may be that audio of the sound sample is analyzed to obtain audio information, and index information of the sound sample is determined, and then the audio information and the index information of the sound sample are encoded to obtain the acoustic feature of the sound sample, which is used as a model input feature for training the synthesis model.

constructing a song feature vector of the song sample based on the song features of the song sample, and constructing an acoustic feature vector of the sound sample based on the acoustic features of the sound sample; acquiring acoustic parameters of the singing voice sample; training the synthetic model based on the acoustic parameters of the song sample as an output result, and the song feature vector of the song sample and the acoustic feature vector of the sound sample as model input features.

In this step, after acquiring the song characteristics of the song sample and the acoustic characteristics of the sound sample, by vectorizing the song characteristics of the song sample and the acoustic characteristics of the sound sample, thereby constructing a song feature vector of the song sample according to the song features of the song sample, and constructing an acoustic feature vector of the sound sample according to the acoustic features of the sound sample so as to determine the model input features which can be used for training the synthesis model subsequently, after obtaining the acoustic parameters of the singing voice sample, the acoustic parameters of the singing voice sample may be used as a model output result of the synthetic model, and the song feature vector of the song sample and the acoustic feature vector of the voice sample may be used as model input features of the synthetic model to train the synthetic model.

Specifically, the synthesis model is trained based on the acoustic parameters of the singing voice sample as the output result, and the song feature vector of the song sample and the acoustic feature vector of the sound sample as the model input features, after determining the song feature vector of the song sample and the acoustic feature vector of the sound sample, the song feature vector of the song sample and the acoustic feature vector of the sound sample are spliced to obtain a spliced feature vector sample of the input vector that can be used by the synthesis model, then the spliced feature vector sample is input to the synthesis model, the acoustic parameters of the synthesized singing voice are determined through the synthesis result output by the synthesis model, and then the acoustic parameters output by the synthesis model are compared with the acquired acoustic parameters of the singing voice sample, and calculating the error of the synthetic model according to the difference between the two, updating the training parameters of the synthetic model according to the error, reselecting a new song sample, a sound sample and a singing voice sample obtained by singing the song sample by a user corresponding to the new sound sample in order to improve the accuracy of the synthetic model, and continuing training the synthetic model until a training cut-off condition is met.

Wherein, the error of the synthesis model is calculated, which may be a mean square error of the acoustic parameters of the singing voice sample and the acoustic parameters output by the synthesis model; based on the error, the training parameters of the synthetic model are updated, and the network weights of the synthetic model may be updated through a back propagation algorithm according to the mean square error.

Wherein the training cutoff condition comprises one of:

the error is no longer small, i.e. the mean square error is no longer decreasing; the training times reach a set threshold; the number of training samples reaches a set threshold.

According to the singing voice synthesis method provided by the embodiment of the application, a target song and user voice to be simulated are obtained; acquiring song characteristics of the target song and acoustic characteristics of the user sound; constructing a song feature vector based on the song features, and constructing an acoustic feature vector based on the acoustic features; inputting the song feature vector and the acoustic feature vector into a pre-trained synthesis model, and determining various acoustic parameters of the synthesized song; synthesizing the singing voice of the user based on the plurality of acoustic parameters.

Referring to fig. 4 and 5, fig. 4 is a first structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present application, and fig. 5 is a second structural diagram of the singing voice synthesizing apparatus according to the embodiment of the present application. The singingvoice synthesizing apparatus 400 may implement the steps performed by the above-described singing voice synthesizing method. The device can be understood as the server or the processor of the server, and can also be understood as a component which is independent of the server or the processor and realizes the functions of the application under the control of the server. As shown in fig. 4, the singingvoice synthesizing apparatus 400 includes:

a first obtainingmodule 410, configured to obtain a target song and a user sound to be simulated.

A second obtainingmodule 420, configured to obtain song characteristics of the target song and acoustic characteristics of the user sound.

A determiningmodule 430, configured to determine a plurality of acoustic parameters of the synthesized singing voice based on the song feature, the acoustic feature and a pre-trained synthesis model.

A synthesizingmodule 440 for synthesizing the singing voice of the user based on the plurality of acoustic parameters.

In some embodiments, the determiningmodule 430 is specifically configured to:

constructing a song feature vector based on the song features, and constructing an acoustic feature vector based on the acoustic features.

In some embodiments, the determiningmodule 430 is further specifically configured to:

and splicing the song characteristic vector and the acoustic characteristic vector to obtain a spliced characteristic vector.

In some embodiments, the second obtainingmodule 420 is specifically configured to:

and analyzing and processing the lyrics and the music score of the target song to obtain phoneme information and prosody information.

In some embodiments, the second obtainingmodule 420 is further specifically configured to:

and analyzing the audio frequency of the user sound to obtain audio information.

And determining index information of the user voice.

In some embodiments, as shown in fig. 5, the singing voice synthesizing apparatus includes:

the third obtainingmodule 450 is configured to obtain a song sample, a voice sample, and a singing voice sample obtained by singing the song sample by a user corresponding to the voice sample.

A fourth obtainingmodule 460, configured to obtain song characteristics of the song sample and acoustic characteristics of the sound sample.

And atraining module 470, configured to train a synthesis model according to the song characteristics of the song sample, the acoustic characteristics of the sound sample, and the singing sound sample.

In some embodiments, thetraining module 470 is specifically configured to:

and constructing a song feature vector of the song sample based on the song features of the song sample, and constructing an acoustic feature vector of the sound sample based on the acoustic features of the sound sample.

And acquiring the acoustic parameters of the singing voice sample.

In some embodiments, thetraining module 470 is further specifically configured to:

and splicing the song characteristic vector of the song sample and the acoustic characteristic vector of the sound sample to obtain a spliced characteristic vector sample.

And inputting the spliced feature vector sample into the synthesis model, and determining the acoustic parameters of the synthesized singing voice.

And calculating the error of the synthetic model based on the acoustic parameters of the singing voice sample and the acoustic parameters output by the synthetic model.

In some embodiments, the training cutoff condition comprises one of:

In some embodiments, the fourth obtainingmodule 460 is specifically configured to:

and analyzing and processing the lyrics and the music score of the song sample to obtain phoneme information and prosody information.

In some embodiments, the fourth obtainingmodule 460 is further specifically configured to:

and analyzing the audio frequency of the sound sample to obtain audio information.

Index information for the sound sample is determined.

The singing voice synthesizing device provided by the embodiment of the application acquires a target song and user voice to be simulated; acquiring song characteristics of the target song and acoustic characteristics of the user sound; determining a plurality of acoustic parameters of the synthesized singing voice based on the song characteristics, the acoustic characteristics and a pre-trained synthesis model; synthesizing the singing voice of the user based on the plurality of acoustic parameters.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the singing voice synthesizing method in the method embodiments shown in fig. 2 and fig. 3 may be executed.

The modules may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof. The wireless connection may comprise a connection over a LAN, WAN, bluetooth, ZigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of synthesizing singing voice, the method comprising:

acquiring a target song and user sound to be simulated;

2. The method of claim 1, wherein determining a plurality of acoustic parameters of a synthesized song voice based on the song features, the acoustic features, and a pre-trained synthesis model comprises:

3. The method of claim 2, wherein the plurality of acoustic parameters include a mel-frequency cepstrum and a fundamental frequency.

4. The method of claim 3, wherein inputting the song feature vectors and the acoustic feature vectors into a pre-trained synthesis model to determine a plurality of acoustic parameters of a synthesized song comprises:

5. The method of claim 2, wherein obtaining song characteristics of the target song comprises:

6. The method of claim 2, wherein obtaining the acoustic features of the user's voice comprises:

analyzing the audio frequency of the user sound to obtain audio information;

determining index information of the user voice;

7. The method of claim 1, wherein the method further comprises:

8. The method of claim 7, wherein training a synthetic model based on the song characteristics of the song sample, the acoustic characteristics of the sound sample, and the song sample comprises:

acquiring acoustic parameters of the singing voice sample;

9. The method of claim 8, wherein training the synthetic model based on the acoustic parameters of the song sample as an output result, and the song feature vector of the song sample and the acoustic feature vector of the sound sample as model input features comprises:

10. The method of claim 9, wherein the training cutoff condition comprises one of:

11. The method of claim 7, wherein obtaining song characteristics of the song sample comprises:

12. The method of claim 7, wherein obtaining acoustic training features for the sound sample comprises:

analyzing the audio frequency of the sound sample to obtain audio information;

determining index information of the sound sample;

13. A singing voice synthesizing apparatus, characterized in that the singing voice synthesizing apparatus comprises:

14. The singing voice synthesis apparatus of claim 13, wherein the determining module is specifically configured to:

15. The singing voice synthesizing apparatus according to claim 14, wherein the plurality of acoustic parameters include a mel-frequency cepstrum and a fundamental frequency.

16. The singing voice synthesis apparatus of claim 15, wherein the determining module is further specifically configured to:

17. The singing voice synthesis apparatus of claim 14, wherein the second obtaining module is specifically configured to:

18. The singing voice synthesis apparatus of claim 14, wherein the second obtaining module is further specifically configured to:

analyzing the audio frequency of the user sound to obtain audio information;

determining index information of the user voice;

19. The singing voice synthesizing apparatus according to claim 13, wherein the singing voice synthesizing apparatus comprises:

20. The singing voice synthesis apparatus of claim 19, wherein the training module is specifically configured to:

acquiring acoustic parameters of the singing voice sample;

21. The singing voice synthesis apparatus of claim 20, wherein the training module is further specifically configured to:

22. The singing voice synthesizing apparatus according to claim 21, wherein the training cutoff condition includes one of the following conditions:

23. The singing voice synthesis apparatus of claim 19, wherein the fourth obtaining module is specifically configured to:

24. The singing voice synthesis apparatus of claim 19, wherein the fourth obtaining module is further specifically configured to:

analyzing the audio frequency of the sound sample to obtain audio information;

determining index information of the sound sample;

25. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the singing voice synthesizing method according to any one of claims 1 to 12.

26. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of the singing voice synthesizing method according to any one of claims 1 to 12.