Movatterモバイル変換


[0]ホーム

URL:


CN111028823A - Audio generation method and device, computer readable storage medium and computing device - Google Patents

Audio generation method and device, computer readable storage medium and computing device
Download PDF

Info

Publication number
CN111028823A
CN111028823ACN201911267158.4ACN201911267158ACN111028823ACN 111028823 ACN111028823 ACN 111028823ACN 201911267158 ACN201911267158 ACN 201911267158ACN 111028823 ACN111028823 ACN 111028823A
Authority
CN
China
Prior art keywords
audio
phoneme
pronunciation information
sample
hyphen
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911267158.4A
Other languages
Chinese (zh)
Other versions
CN111028823B (en
Inventor
肖纯智
劳振锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co LtdfiledCriticalGuangzhou Kugou Computer Technology Co Ltd
Priority to CN201911267158.4ApriorityCriticalpatent/CN111028823B/en
Publication of CN111028823ApublicationCriticalpatent/CN111028823A/en
Application grantedgrantedCritical
Publication of CN111028823BpublicationCriticalpatent/CN111028823B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The application relates to an audio generation method, an audio generation device, a computer readable storage medium and a computing device, and belongs to the field of electronic technology application. The method comprises the following steps: acquiring a plurality of pronunciation information, wherein the plurality of pronunciation information comprises at least one first pronunciation information, and each first pronunciation information comprises: the method comprises the steps of obtaining a corresponding audio frame, a corresponding target phoneme content corresponding to the audio frame, a target phoneme content adjacent to the target phoneme, and a hyphen indicator, wherein the target phoneme adjacent to the target phoneme comprises a previous phoneme and a next phoneme of the target phoneme, and the hyphen indicator is used for indicating whether the target phoneme and the adjacent phonemes in the pronunciation information have hyphen; and inputting the plurality of pronunciation information into the audio synthesis model to obtain a target audio output by the audio synthesis model, wherein an audio frame corresponding to each piece of pronunciation information in the plurality of pronunciation information is one audio frame in the target audio. The method and the device can improve the quality of the output audio.

Description

Audio generation method and device, computer readable storage medium and computing device
Technical Field
The present application relates to the field of electronic technology application, and in particular, to an audio generation method, an audio generation device, a computer-readable storage medium, and a computing device.
Background
An audio synthesis model is a model for performing audio synthesis. The audio of songs and the like can be synthesized through the audio synthesis model.
The current process of generating audio by using an audio synthesis model includes: and obtaining an audio synthesis model through a model training process, inputting a plurality of pronunciation information (conditions) into the audio synthesis model, and outputting the target audio by the audio synthesis model. The plurality of pronunciation information is in one-to-one correspondence with a plurality of audio frames included in the output target audio, and each pronunciation information is used for describing the audio characteristics of the corresponding audio frame. Typically, each pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the previous phoneme of the target phoneme, and the content of the next phoneme.
However, the song sung by the real person is actually formed by the change of the human vocal cavity, and the song generated by the audio synthesis model cannot effectively reflect the change process of the human vocal cavity, so that the quality of the output audio is poor.
Disclosure of Invention
The embodiment of the application provides an audio generation method, an audio generation device, a computer readable storage medium and a computing device, which can improve the quality of generated audio. The technical scheme is as follows:
according to a first aspect of embodiments of the present application, there is provided an audio generation method, including:
acquiring a plurality of pronunciation information, wherein the plurality of pronunciation information comprises at least one first pronunciation information, and each first pronunciation information comprises: the method comprises the steps of obtaining a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme, and a hyphen indicator, wherein the adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of the target phoneme, the hyphen indicator is used for indicating whether a hyphen exists between the target phoneme and the adjacent phonemes in pronunciation information, and an audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio;
and inputting the plurality of pronunciation information into an audio synthesis model to obtain the target audio output by the audio synthesis model.
Optionally, before the obtaining the plurality of pronunciation information, the method further comprises:
analyzing the sample audio to obtain a plurality of sample pronunciation information, wherein the plurality of sample pronunciation information comprises at least one second pronunciation information, and each second pronunciation information comprises: the pitch of the corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of a neighboring phoneme of the target phoneme, and a hyphen indicator, wherein the audio frame corresponding to each of the plurality of sample pronunciation information is one of the sample audios;
and performing model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
Optionally, the analyzing the sample audio to obtain a plurality of sample pronunciation information includes:
obtaining a pitch of each audio frame in the sample audio;
detecting whether a connective sound exists between each phoneme and an adjacent phoneme in the sample audio to obtain a connective sound detection result;
generating the plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
Optionally, the detecting whether there is a connective between each phoneme in the sample audio and a neighboring phoneme to obtain a connective detection result includes:
when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in the sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frames are audio frames with a pitch larger than 0, N and M are positive integers, and the sample audio frame set corresponding to the any phoneme is an audio frame set formed by the any phoneme in the sample audio in a pronunciation process;
and when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to the phoneme are all pitch frames, determining that the any phoneme exists in the postconnective sound.
Optionally, the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent preceding phoneme, and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent succeeding phoneme;
or, the hyphen indicator includes an indicator for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the preceding phoneme adjacent thereto, and whether a hyphen exists between the target phoneme in the pronunciation information and the succeeding phoneme adjacent thereto.
According to a second aspect of embodiments of the present application, there is provided an audio generating apparatus, comprising:
an obtaining module, configured to obtain multiple pieces of pronunciation information, where the multiple pieces of pronunciation information include at least one piece of first pronunciation information, and each piece of the first pronunciation information includes: the method comprises the steps of obtaining a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme, and a hyphen indicator, wherein the adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of the target phoneme, the hyphen indicator is used for indicating whether a hyphen exists between the target phoneme and the adjacent phonemes in pronunciation information, and an audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio;
and the processing module is used for inputting the plurality of pronunciation information into an audio synthesis model to obtain the target audio output by the audio synthesis model.
Optionally, the apparatus further comprises:
an analysis module, configured to analyze the sample audio before obtaining the multiple pieces of pronunciation information to obtain multiple pieces of sample pronunciation information, where the multiple pieces of sample pronunciation information include at least one piece of second pronunciation information, and each piece of second pronunciation information includes: the pitch of the corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of a neighboring phoneme of the target phoneme, and a hyphen indicator, wherein the audio frame corresponding to each of the plurality of sample pronunciation information is one of the sample audios;
and the training module is used for carrying out model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
Optionally, the analysis module comprises:
an obtaining sub-module, configured to obtain a pitch of each audio frame in the sample audio;
the detection submodule is used for detecting whether a connective sound exists between each phoneme in the sample audio and the adjacent phoneme to obtain a connective sound detection result;
a generating sub-module for generating the plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
Optionally, the detection submodule is configured to:
when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in the sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frames are audio frames with a pitch larger than 0, N and M are positive integers, and the sample audio frame set corresponding to the any phoneme is an audio frame set formed by the any phoneme in the sample audio in a pronunciation process;
and when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to the phoneme are all pitch frames, determining that the any phoneme exists in the postconnective sound.
Optionally, the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent preceding phoneme, and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent succeeding phoneme;
or, the hyphen indicator includes an indicator for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the preceding phoneme adjacent thereto, and whether a hyphen exists between the target phoneme in the pronunciation information and the succeeding phoneme adjacent thereto.
According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program, which when executed by a processor causes the processor to implement the audio generation method according to any one of the preceding first aspects.
According to a fourth aspect of embodiments herein, there is provided a computing device comprising a processor and a memory;
the memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform the audio generation method of any of the first aspects.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the audio generation method and device provided by the embodiment of the application, the pronunciation information input into the audio synthesis model comprises the polyphonic indicator, the polyphonic indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes have polyphonic or not, and the polyphonic condition of each phoneme is involved in the audio generation process, so that the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, and the sound smoothness of the polyphonic part is improved. Therefore, the change process of the human acoustic cavity can be effectively reflected and realized, and the quality of the output audio is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to illustrate the embodiments of the present application more clearly, the drawings that are needed in the description of the embodiments will be briefly described below, it being apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be derived from those drawings by a person skilled in the art without inventive effort.
FIG. 1 is a flow diagram illustrating a method of audio generation according to an example embodiment.
FIG. 2 is a flow diagram illustrating another audio generation method according to an example embodiment.
Fig. 3 is a block diagram illustrating an audio generation apparatus according to an example embodiment.
Fig. 4 is a block diagram illustrating another audio generation apparatus according to an example embodiment.
FIG. 5 is a block diagram illustrating an analysis module in accordance with an exemplary embodiment.
Fig. 6 is a schematic diagram illustrating a structure of a terminal according to an exemplary embodiment.
Fig. 7 is a schematic diagram illustrating a configuration of a server according to an example embodiment.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. The types of phonemes differ among different pronunciation rules. For example, for the english pronunciation rules, the phonemes include two classes of vowel phonemes and consonant phonemes, each of which is subdivided into a plurality of specific phonemes, and the phonetic symbols of international phonetic symbols (also called "international phonetic letters" or "universal phonetic letters") are in one-to-one correspondence with the phonemes; for the Chinese pronunciation rule, the pronunciation of each Chinese character can be decomposed into an initial consonant and a final sound, the phonemes comprise an initial consonant phoneme and a final sound phoneme, each class is subdivided into a plurality of specific phonemes, and the symbols in the Chinese initial consonant and final sound list correspond to the phonemes one by one.
The sonification of different phonemes requires changing the acoustic cavity to a different shape, and the changing of the acoustic cavity requires a process that can be simply divided into three stages, open, smooth, and closed, for example. The opening and the closing are the process of the sound cavity generating opening and closing change, if the pronunciation of the first phoneme is similar to that of the second phoneme in two adjacent phonemes, when the two phonemes are continuously pronounced, the change of the sound cavity is not obvious, and the smooth stage of the first phoneme can be directly converted into the smooth stage of the second phoneme, and the condition can be called as continuous sound. For example, when a polyphone occurs, of two continuous phonemes, the closed state of the first phoneme and the open state of the second phoneme disappear.
Taking the Chinese pronunciation rule as an example, the continuous pronunciation can occur by reading one in the same way and the same way continuously. However, when a song is singing, there are some pauses between "one" and non-continuous sound occurs. Therefore, in actual pronunciation, the same adjacent phonemes may have different pronunciation effects in different situations.
In generating audio, the conventional audio synthesis model uses a plurality of pronunciation information, each of which includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the previous phoneme of the target phoneme, and the content of the next phoneme. The audio synthesized by the audio synthesis model cannot reflect the condition of the polyphone which should appear, so that the smoothness of the sound at the polyphone part is poor. Therefore, the change process of the human acoustic cavity cannot be effectively reflected, resulting in poor quality of the output audio.
The embodiment of the application provides an audio generation method, which can be applied to generation of various types of audio, such as Chinese songs, English songs or other audio including human voice, such as commentary or music art audio. The voice can be simulated by the audio generation method, so that artificial intelligent singing functions such as virtual singing and the like are provided for users.
As shown in fig. 1, fig. 1 is a flowchart of the audio generation method, including:
step 101, obtaining a plurality of pronunciation information, wherein the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phoneme of the target phoneme, and the hyphen indicator.
The adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of any target phoneme, the hyphen indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes thereof have hyphen, and an audio frame corresponding to each piece of the plurality of pronunciation information is one audio frame in the target audio.
And 102, inputting the plurality of pronunciation information into the audio synthesis model to obtain the target audio output by the audio synthesis model.
In summary, according to the audio generating method provided by the embodiment of the present application, since the pronunciation information in the input audio synthesis model includes the polyphonic indicator, the polyphonic indicator is used to indicate whether there is a polyphonic between the target phoneme in the pronunciation information and the adjacent phoneme, and since the polyphonic condition of each phoneme is involved in the audio generating process, the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, so as to improve the smoothness of the sound at the polyphonic position. Therefore, the change process of the human acoustic cavity can be effectively reflected, and the quality of output audio is improved.
The embodiment of the present application provides another audio generation method, which may be performed by an audio generation apparatus, where the audio generation apparatus may be a terminal or a server, and the terminal may be a display, a computer, a smart phone, a tablet computer, a laptop computer, and the like. The server may be a single server or a server cluster consisting of several servers. The method relates to a model training process and a model using process, as shown in fig. 2, fig. 2 is a flow chart of the audio generating method, the method comprising:
step 201, analyzing the sample audio to obtain a plurality of sample pronunciation information.
The sample audio may be one or more designated audio that is pre-recorded, which may be song audio or other audio including human voice, such as commentary or music art audio.
The sample audio may include a plurality of audio frames, where the plurality of audio frames correspond to a plurality of sample pronunciation information respectively, and each sample pronunciation information is used to represent an audio feature of a corresponding audio frame. The plurality of sample pronunciation information includes at least one second pronunciation information, each second pronunciation information including: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phoneme of the target phoneme, and the hyphen indicator. The adjacent phonemes of any one target phoneme include a preceding phoneme and a succeeding phoneme of the any one target phoneme. The previous phoneme and the next phoneme are generally different from any of the target phonemes, respectively. Taking the Chinese pronunciation rule as an example, the phonemes included in the "hello" are sequentially "n, i, h, ao". For phonemes: the final "i", the former phoneme is the initial "n", the latter phoneme is the initial "h". And the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio. The speech content of the corresponding audio frame includes the content of the corresponding phoneme.
Optionally, the analyzing the sample audio to obtain the pronunciation information of the plurality of samples may include:
step a1, obtaining the pitch of each audio frame in the sample audio.
For example, specified software may be employed to identify the pitch of each audio frame in the sample audio. In the silence section, unvoiced section, transient phoneme transition region of non-continuous sound and the like of the sample audio, the audio has no periodicity due to no vibration of human vocal cords, and the pitch cannot be extracted; while the vocal cords are continuously vibrating in the transition region of the voiced segment and the polyphone (i.e. the region between one of the two phones with the polyphone to the other), the audio has periodicity, and the pitch can be extracted. The pitch may be recorded in the form of a sequence of pitch values or in the form of a pitch chart.
And step A2, detecting whether a connective exists between each phoneme and the adjacent phoneme in the sample audio to obtain a connective detection result.
There are various ways to detect whether there is a hyphen between each phoneme and the adjacent phonemes in the sample audio. The embodiments of the present application are described by taking the following two alternatives as examples:
in a first alternative, it is determined whether there is a hyphen between each phoneme and the adjacent phoneme by detecting whether there is a pitch frame in the audio frame adjacent to each pitch frame in the sample audio. Wherein, some audio frames have a pitch greater than 0.
In the embodiment of the present application, the set of audio frames formed by any phoneme in the pronunciation process is the set of audio frames corresponding to the phoneme. In the following embodiments, for the convenience of the reader to understand, a set of audio frames formed by any phoneme in the sample audio in the pronunciation process is referred to as a sample audio frame set corresponding to the any phoneme; in the target audio, a set of audio frames formed by any phoneme in the pronunciation process is called a target audio frame set corresponding to the phoneme.
For each phoneme in the sample audio, detecting whether M audio frames adjacent to the beginning of the sample audio frame set corresponding to the phoneme and N audio frames adjacent to the beginning of the sample audio frame set (i.e. consecutive M + N audio frames) are all pitch frames, where N and M are both positive integers. When M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the beginning are all pitch frames, determining that the any phoneme has a front connective sound; when any unvoiced frame exists in M audio frames adjacent to the beginning point of the sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning point, determining that the any phoneme does not exist a preceding continuous sound, wherein the unvoiced frame is an audio frame with a pitch equal to 0. The sample audio frame set corresponding to the any phoneme is a set of audio frames formed by pronunciation of the any phoneme. That is, the set of audio frames corresponding to any phoneme is a set formed by one or more continuous audio frames of the phoneme in the pronunciation process of the phoneme. For example, assume that the phonemes: when the initial consonant "n" is short, the pronunciation lasts only 70ms (millisecond), and the duration of one audio frame is 10ms, then the sample audio frame set corresponding to the initial consonant "n" has 7 audio frame frames, and the speech content of each audio frame contains the phoneme "n"; for another example, assume that the phoneme: the vowel "i" is longer in pronunciation and lasts 300ms, and then the audio frame set corresponding to the vowel "i" has 30 audio frame frames, and the speech content of each audio frame includes the phoneme "i".
For each phoneme in the sample audio, detecting whether M audio frames adjacent to the end point of the sample audio frame set corresponding to the phoneme before and N audio frames adjacent to the end point of the sample audio frame set corresponding to the phoneme after (namely M + N continuous audio frames) are all pitch frames. When M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme are all pitch frames, determining that any phoneme has a back hyphen; when any unvoiced frame exists in M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme and N audio frames adjacent to the end point, determining that the any phoneme does not exist in the succeeding consonants, wherein the unvoiced frame is an audio frame with the pitch equal to 0. M and N may be the same or different, and for example, M and N have a value ranging from 1 to 5.
In an alternative example, the start point and the end point of the sample audio frame set corresponding to each phoneme may be represented by the start time and the end time of the sample audio frame set in the audio, respectively, where the start time is 9: 00, end point 9: 02; in another alternative example, each audio frame in the sample audio is assigned with a sequence number, where the sequence number is used to identify the position of the corresponding audio frame in the sample audio, and the start point and the end point of the sample audio frame set corresponding to each phoneme may also be represented by the sequence number of the first audio frame and the sequence number of the last audio frame of the sample audio frame set, respectively. The embodiment of the present application does not limit the representation manner of the sample audio frame set.
For each phoneme, the starting point of the first audio frame of the corresponding sample audio frame set is the front boundary point of the phoneme, the ending point of the last audio frame of the corresponding sample audio frame set is the rear boundary point of the phoneme, and the foregoing step a2 is substantially to query whether M audio frames before and N audio frames after the front boundary point of each phoneme are all pitch frames, and whether M audio frames before and N audio frames after the rear boundary point of each phoneme are all pitch frames, so as to determine whether there is a hyphen between each phoneme and the adjacent phonemes. That is, for the boundary point of each phoneme, it is queried whether M audio frames adjacent to the front boundary point and N audio frames adjacent to the rear boundary point are all pitch frames, so as to determine whether there is a hyphen between each phoneme and the adjacent phoneme. With the aforementioned conjunctive detection method, the detection method for the boundary point of each phoneme is consistent. Moreover, the influence of the error of the determined sample audio frame set on the continuous sound detection result is effectively avoided, so that the detected continuous sound state can be ensured to be more accurate.
It should be noted that, when performing the successive-note detection on each phoneme, the foregoing successive-note detection process may be performed by sequentially traversing all sample audio frame sets in the sample audio, skipping other unrelated audio frames, or may be performed by directly traversing all audio frames in the sample audio, and performing the foregoing successive-note detection process at the sample audio frame set corresponding to each phoneme, which is not limited in this embodiment of the present application.
For example, assuming that the phonemes are divided according to a chinese pronunciation rule, where M is equal to N is equal to 3, the text content of the sample audio is "we are the same", the included phonemes are sequentially "w, o, M, en, d, ou, y, i, y, and ang", for each phoneme in the sample audio, whether 3 audio frames adjacent to the beginning point of the sample audio frame set corresponding to the phoneme and 3 audio frames adjacent to the beginning point of the sample audio frame set (i.e., 6 adjacent audio frames) are both pitch frames, and whether 3 audio frames adjacent to the end point of the sample audio frame set and 3 audio frames adjacent to the end point of the sample audio frame set are both pitch frames, then for the phoneme: and if the first 3 audio frames and the last 3 audio frames adjacent to the starting point of the sample audio frame set are detected to be pitch frames and the first 3 audio frames and the last 3 audio frames adjacent to the ending point of the sample audio frame set are pitch frames, the phoneme of the vowel i has front lingering sound and back lingering sound.
It should be noted that, in the sample audio, the sample audio frame set corresponding to each phoneme is known, and in an optional manner, the sample audio frame set corresponding to each phoneme may be artificially calibrated in advance; in another alternative, the sample audio frame set corresponding to each phoneme can be identified by audio identification software; in yet another alternative, the sample audio is pre-generated audio with known content of each phoneme, such as a song with lyrics downloaded in a network, and the sample audio frame set corresponding to each phoneme is calibrated when the sample audio is obtained. The embodiment of the present application does not limit the manner of obtaining the sample audio frame set corresponding to each phoneme.
In a second alternative, whether a connective exists between each phoneme and the adjacent phoneme is determined through a manual calibration mode.
As in step a1, the pitch of an audio frame may be recorded in the form of a sequence of pitch values or in the form of a pitch chart. The audio generation means may present the pitch of the sample audio and the corresponding sequence number (or icon) of the respective audio frame in the manner of the aforementioned recordings. The staff can mark the audio frames where the phonemes with the front liaison and/or the phonemes with the back liaison exist in a manual marking mode. Accordingly, the audio generating device receives the marking instruction, and determines whether the phoneme in each audio frame and the adjacent phoneme have the continuous sound or not based on the marking instruction.
It should be noted that the aforementioned hyphen indicator is used to indicate whether there is a hyphen between the target phoneme and its neighboring phoneme in the pronunciation information. The hyphen indicator may be implemented in a variety of ways. The embodiments of the present application are described below by way of examples.
In a first alternative implementation, the hyphen indicator includes a top hyphen indicator and a bottom hyphen indicator. The front hyphen indicator is used for indicating whether a target phoneme in the pronunciation information is connected with a previous phoneme adjacent to the target phoneme. The hyphen indicator is used for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the next phoneme adjacent to the target phoneme. Wherein the front hyphen indicator and the back hyphen indicator may each be comprised of one or more characters. The character may be a binary character, such as 0 or 1. For example, 0 may be used to indicate the presence of a polyphone, and 1 may be used to indicate the absence of a polyphone. The character may also be other types of characters, such as letters, which is not limited in the embodiments of the present application. The front hyphen indicator and the back hyphen indicator may occupy one field in the pronunciation information respectively, and both occupy two fields in the pronunciation information.
In a second alternative implementation manner, the hyphenation indicator includes an indicator for indicating whether a target phoneme in the pronunciation information has a hyphenation with a preceding phoneme adjacent to the target phoneme and whether a target phoneme in the pronunciation information has a hyphenation with a succeeding phoneme adjacent to the target phoneme. Wherein the hyphen indicator may be comprised of one or more characters. The character may be a binary character, for example, the hyphen indicator may include: 00. 01, 10 and 11. For example, 00 may be used to indicate that the target phoneme and the preceding and succeeding phonemes adjacent to the target phoneme have no liaison, 01 may be used to indicate that the target phoneme and the preceding phoneme adjacent to the target phoneme have liaison and the succeeding phoneme have no liaison, 10 may be used to indicate that the target phoneme and the preceding phoneme adjacent to the target phoneme have no liaison and the succeeding phoneme has liaison, and 11 may be used to indicate that the target phoneme and the preceding and succeeding phonemes adjacent to the target phoneme have liaison. The character may also be other types of characters, such as letters, which is not limited in the embodiments of the present application. The one indicator may occupy one field in the pronunciation information.
In this second optional implementation manner, one indicator may indicate the situations indicated by the preceding hyphen indicator and the following hyphen indicator at the same time, so as to reduce the occupation of the field and improve the operation efficiency of the subsequent model.
When the continuous tone indicator is set by using the indication methods provided by the first and second optional implementation manners, each piece of pronunciation information in the pronunciation information corresponding to all audio frames in the sample audio is the second pronunciation information, that is, each piece of pronunciation information includes the continuous tone indicator, so that the continuous tone condition can be effectively indicated.
In practical implementation, when the target phoneme in the pronunciation information and all adjacent phonemes do not have the polyphone, the embodiment of the application can not carry the polyphone indicator; when the target phoneme in the pronunciation information has a hyphen with all phonemes adjacent to the target phoneme, a hyphen indicator may be carried. That is, the sample pronunciation information corresponding to the audio frames of the sample audio includes two types of pronunciation information, which are the second pronunciation information and other pronunciation information, respectively, and the polyphonic indicator carried by the second pronunciation information may refer to the situations in the first optional implementation manner and the second optional implementation manner. The content of the other pronunciation information can be simply modified based on the content in the second pronunciation information or with reference to the conventional pronunciation information. When the plurality of sample pronunciation information comprises two types of pronunciation information, compared with the condition that all the sample pronunciation information is the second pronunciation information, the number of the pronunciation information carrying the continuous tone indicator in all the sample pronunciation information can be reduced, the occupation of fields is reduced, and the operation efficiency of a subsequent model is improved.
Step a3, generating a plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
The audio generating means may generate a corresponding plurality of sample pronunciation information for all audio frames based on the pitch of each audio frame and the hyphen detection result.
It should be noted that, the foregoing sample pronunciation information may also be added with other information describing its corresponding audio frame according to the actual situation. Illustratively, the sample pronunciation information further includes: and position information of the corresponding audio frame, wherein the position information is used for describing the position of the corresponding audio frame in a sample audio frame set, and the sample audio frame set is a set of audio frames corresponding to the target phoneme corresponding to the corresponding audio frame. The explanation thereof can refer to the explanation in the aforementioned step a 2.
For example, the position information of the corresponding audio frame may be represented by a position of a segment of the audio frame in the sample audio frame set. Alternatively, the sample audio frame set may be divided into w segments according to a preset segmentation rule (for example, the segmentation rule is an average segmentation rule), w is a positive integer, and the segmentation position is one of the w segments. Optionally, w is a fixed value, and w > 1. For example, w is 3, that is, the sample audio frame set is divided into 3 segments, and the 3 segments are divided into an open segment, a stationary segment and a closed segment with equal (or similar) duration according to an average segmentation rule. It is assumed that the audio frame corresponding to the sample pronunciation information is in an open segment, and the position information of the corresponding audio frame is used to indicate the open segment.
For example, the aforementioned location information may identify the segment location using one or more characters. The character may be a binary character, for example, the location information includes: 00. 01 and 10 types. For example, an open section may be represented by 00, a stationary section by 01, and a closed section by 10. The character may also be other types of characters, such as letters, which is not limited in the embodiments of the present application. The aforementioned location information may occupy one field in the pronunciation information.
Step 202, performing model training based on the pronunciation information of the plurality of samples to obtain an audio synthesis model.
Because the sample audio is known, the sample audio can be used as a label, a plurality of sample pronunciation information is used as input information, model training is carried out until the loss value corresponding to the preset loss function converges to a target range, and an audio synthesis model is obtained.
By adopting the multiple sample pronunciation information to carry out model training, the different pronunciation states of the phonemes formed in the continuous-tone state and the non-continuous-tone state can be effectively helped to be learned by the audio synthesis model, and the pronunciation smoothness of the audio generated by the trained audio synthesis model at the continuous tone position is effectively improved.
Step 203, obtaining a plurality of pronunciation information, wherein the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phoneme of the target phoneme, and the hyphen indicator.
Wherein, the interpretation of the adjacent phonemes and the hyphen indicators can refer to the interpretation in theaforementioned step 201. The target audio to be synthesized subsequently may include a plurality of audio frames, where the plurality of audio frames correspond to the plurality of pronunciation information respectively, and each pronunciation information is used to represent an audio feature of a corresponding audio frame. An audio frame can be correspondingly generated based on the pronunciation information. The audio frame corresponding to each pronunciation information is one of the audio frames formed by the corresponding phoneme in the pronunciation process, and the speech content of the corresponding audio frame contains the content of the corresponding phoneme.
In the embodiment of the present application, the process of acquiring multiple pieces of pronunciation information may have multiple implementation manners:
in a first implementation, the audio generation apparatus may receive pronunciation information for a plurality of phonemes. Alternatively, the initial audio may be audio recorded by the user himself or may be audio acquired by other means, for example, audio downloaded from a network. The user can obtain different types of initial audio based on own requirements, and the subsequent target audio generated by the method can effectively meet the user requirements, realize the customization and individuation of audio synthesis, and improve the user experience.
For example, the audio generating device is a mobile phone, a notebook computer, a desktop computer, or the like, a user (or a programmer) may Input/Output (I/O) information of the plurality of phonemes through an Input/Output (I/O) such as a keyboard or a touch screen, and accordingly, the audio generating device receives the pronunciation information of the plurality of phonemes. Alternatively, the process of receiving pronunciation information of a plurality of phonemes by the audio generation apparatus may be exemplified by the following two alternatives: in a first alternative example, the audio generating apparatus receives first information to be edited, the first information to be edited including: the pitch of each target audio frame to be generated, the content of a target phoneme corresponding to each target audio frame, the content of a phoneme adjacent to each target phoneme and a hyphen indicator corresponding to each target phoneme; the audio generating device encodes the received first information to be edited to obtain pronunciation information of a plurality of phonemes. For example, the audio generating apparatus may use an onehot coding method or an emmmbebig coding method to encode the first information to be edited. In a second alternative example, the audio generating apparatus may directly receive pronunciation information of a plurality of phonemes, where the pronunciation information of each phoneme is encoded by an onehot encoding method, an embmbing encoding method, or the like.
In a second implementation manner, the audio generating apparatus may receive at least one initial audio and analyze the at least one initial audio to obtain pronunciation information of the plurality of phonemes. The analysis process for each initial audio may refer to the process of analyzing the sample audio instep 201. Optionally, the process of obtaining pronunciation information of a plurality of phonemes for at least one initial audio analysis may include: analyzing at least one initial audio to obtain second information to be edited, wherein the second information to be edited comprises: the pitch of each target audio frame to be generated, the content of a target phoneme corresponding to each target audio frame, the content of a phoneme adjacent to each target phoneme and a hyphen indicator corresponding to each target phoneme; and the audio generation device encodes the received second information to be edited to obtain pronunciation information of a plurality of phonemes. For example, the audio generating apparatus may use an onehot coding method or an emmebbing coding method to encode the second information to be edited.
In practical implementation of the embodiment of the application, the audio generating device may receive a plurality of initial audios, analyze the plurality of initial audios, and obtain pronunciation information of the plurality of phonemes, so that in a subsequent process, the synthesized target audio is equivalent to an audio obtained by combining the plurality of initial audios.
Referring to step 201, the foregoing sample pronunciation information may also add other information describing its corresponding audio frame according to the actual situation. Accordingly, the pronunciation information obtained instep 203 is consistent with the information content in the sample pronunciation information, and other information describing the corresponding audio frame may also be added. Illustratively, the pronunciation information further includes: and position information of the corresponding audio frame, wherein the position information is used for describing the position of the corresponding audio frame (namely the audio frame to be generated) in the audio frame set corresponding to the corresponding phoneme. Assume that a phoneme corresponding to a corresponding audio frame is a first phoneme, and an audio frame set corresponding to the first phoneme is a target audio frame set, that is, an audio frame set formed by the first phoneme in a pronunciation process in target audio. For the explanation of the position information, reference may be made to the foregoingstep 201, which is not limited in this embodiment of the application.
For the convenience of the reader's understanding, table 1 schematically shows the contents of a plurality of pronunciation information, which are pronunciation information "same" as the contents of the chinese characters, and table 1 performs division of phonemes in a chinese pronunciation rule, as shown in table 1, assuming that the position information includes three types 00, 01, and 10, 00 indicating an open segment, 01 indicating a stationary segment, and 10 indicating a closed segment. The hyphen indicator includes a front hyphen indicator and a rear hyphen indicator, where 0 indicates the presence of a hyphen and 1 indicates the absence of a hyphen. "null" means absent. Taking the pronunciation information with the sequence number of 4 corresponding to the audio frame as an example, the contents of the pronunciation information are as follows: pitch 150Hz, target phoneme: a final i (indicating that the speech content of the audio frame with the sequence number of 4 includes a phoneme i), the former phoneme is an initial y, the latter phoneme is an initial y, the former hyphen indicator is 0 (indicating that there is a former hyphen), the latter hyphen indicator is 0 (indicating that there is a latter hyphen), and the position information is 00 (indicating that it is located in the open segment). For the explanation of other pronunciation information, reference may be made to the explanation of the pronunciation information, which is not described in detail in this embodiment of the application.
TABLE 1
Figure BDA0002313178180000141
Figure BDA0002313178180000151
And step 204, inputting the plurality of pronunciation information into the audio synthesis model to obtain the target audio output by the audio synthesis model.
The audio generation device inputs the plurality of pronunciation information into an audio synthesis model, and the audio output by the audio synthesis model is the target audio. In the embodiment of the present application, the audio synthesis model is a model for audio synthesis, and audio such as a song can be synthesized by the audio synthesis model. The audio synthesis model is typically a Deep Learning (Deep Learning) model. For example, the audio synthesis model may be a wavenet model, or an NPSS model.
Steps 201 to 202 belong to a model training process, and steps 203 to 204 belong to a model using process. According to the audio generation method provided by the embodiment of the application, the pronunciation information input into the audio synthesis model comprises the polyphonic indicator, the polyphonic indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes have the polyphonic, and the polyphonic condition of each phoneme is involved in the audio generation process, so that the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, and the sound smoothness of the polyphonic part is improved. Therefore, in the embodiment of the application, the pronunciation information is expanded, and the information about whether the polyphone exists before and after the target phoneme in the pronunciation information is increased, so that the audio synthesis model is effectively helped to learn the composition of each pronunciation state under the polyphone and the non-polyphone, the pronunciation smoothness of the polyphone is effectively improved, the effective reflection of the change process of the human sound cavity can be realized, and the quality of the output audio is improved.
It should be noted that the foregoing audio synthesis method may be executed by the terminal, the server, or both. In the first case, when the audio synthesis method is executed by a terminal, the audio synthesis apparatus is the terminal, and steps 201 to 204 are executed by the terminal. In a second case, when the audio synthesis method is executed by a server, the audio synthesis apparatus is the server, and steps 201 to 204 are executed by the server, wherein the sample audio instep 201 may be sent to the server by a terminal or may be obtained by the server itself; in the first implementation manner instep 203, the multiple pieces of pronunciation information may be sent to the server by the terminal, or may be obtained by the server itself; in a second implementation manner ofstep 203, at least one initial audio may be sent to the server by the terminal, or may be obtained by the server itself. Afterstep 204, the server may transmit the generated target audio to the terminal. In a third case, when the audio synthesis method is executed by a terminal and a server in cooperation, the audio synthesis apparatus is regarded as a system consisting of the terminal and the server, steps 201 to 202 are executed by the server, steps 203 to 204 are executed by the terminal, and afterstep 202, the server transmits the trained audio synthesis model to the terminal.
The order of the steps of the audio generation method provided in the embodiment of the present application may be appropriately adjusted, and the steps may also be increased or decreased according to the situation, and any method that can be easily conceived by those skilled in the art within the technical scope disclosed in the present application shall be covered by the protection scope of the present application, and therefore, the details are not repeated.
An embodiment of the present application provides anaudio generating apparatus 30, as shown in fig. 3, including:
the obtainingmodule 301 is configured to obtain a plurality of pronunciation information, where the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the method comprises the steps of obtaining the pitch of a corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of adjacent phonemes of the target phoneme and a hyphen indicator, wherein the adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, and the hyphen indicator is used for indicating whether the target phoneme and the adjacent phonemes in the pronunciation information have hyphen. And the audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio.
Theprocessing module 302 is configured to input the multiple pieces of pronunciation information into the audio synthesis model to obtain a target audio output by the audio synthesis model.
According to the audio generation device provided by the embodiment of the application, since the pronunciation information in the input audio synthesis model comprises the polyphonic indicator, the polyphonic indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes have the polyphonic or not, and the polyphonic condition of each phoneme is involved in the audio generation process, the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, and the sound smoothness of the polyphonic part is improved. Therefore, the change process of the human acoustic cavity can be effectively reflected, and the quality of output audio is improved.
Optionally, as shown in fig. 4, theapparatus 30 further includes:
ananalyzing module 303, configured to analyze the sample audio before obtaining the multiple pieces of pronunciation information, to obtain multiple pieces of sample pronunciation information, where the multiple pieces of sample pronunciation information include at least one piece of second pronunciation information, and each piece of second pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phoneme of the target phoneme, and the hyphen indicator, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio;
and thetraining module 304 is configured to perform model training based on the pronunciation information of the multiple samples to obtain an audio synthesis model.
Optionally, as shown in fig. 5, theanalysis module 303 includes:
an obtainingsubmodule 3031, configured to obtain a pitch of each audio frame in the sample audio;
thedetection submodule 3032 is configured to detect whether a polyphone exists between each phoneme in the sample audio and an adjacent phoneme, so as to obtain a polyphone detection result;
the generatingsubmodule 3033 is configured to generate a plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
Optionally, thedetection submodule 3032 is configured to:
when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frame is an audio frame with a pitch greater than 0, N and M are positive integers, and the sample audio frame set corresponding to any phoneme is an audio frame set formed in the pronunciation process of any phoneme;
when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme are all pitch frames, determining that any phoneme exists in the postconcatenation sound.
Optionally, the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent preceding phoneme, and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent succeeding phoneme;
or, the hyphen indicator includes an indicator for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the preceding phoneme adjacent thereto, and whether a hyphen exists between the target phoneme in the pronunciation information and the succeeding phoneme adjacent thereto.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of a computing device to perform the audio generation method illustrated in the various embodiments of the present application is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
An embodiment of the present application provides a computing device, which includes a processor and a memory;
the memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform any of the audio generation methods provided by the embodiments of the present application.
In this embodiment of the present application, the foregoing computing device may be a terminal, and fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 600 includes: aprocessor 601 and amemory 602.
Theprocessor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. Theprocessor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Theprocessor 601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, theprocessor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments,processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
Thememory 602 may include one or more computer-readable storage media, which may be non-transitory. Thememory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in thememory 602 is used to store at least one instruction for execution by theprocessor 601 to implement the audio generation method provided by the method embodiments herein.
In some embodiments, the terminal 600 may further optionally include: aperipheral interface 603 and at least one peripheral. Theprocessor 601,memory 602, andperipheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to theperipheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of aradio frequency circuit 604, atouch screen display 605, acamera 606, anaudio circuit 607, apositioning component 608, and apower supply 609.
Aperipheral interface 603 may be used to connect at least one I/O related peripheral to theprocessor 601 andmemory 602. In some embodiments, theprocessor 601,memory 602, andperipheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of theprocessor 601, thememory 602, and theperipheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
TheRadio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. Theradio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. Therf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, theradio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. Theradio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, therf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
Thedisplay 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When thedisplay screen 605 is a touch display screen, thedisplay screen 605 also has the ability to capture touch signals on or over the surface of thedisplay screen 605. The touch signal may be input to theprocessor 601 as a control signal for processing. At this point, thedisplay 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, thedisplay 605 may be one, providing the front panel of the terminal 600; in other embodiments, thedisplay 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, thedisplay 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, thedisplay 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. TheDisplay 605 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.
Thecamera assembly 606 is used to capture images or video. Optionally,camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments,camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to theprocessor 601 for processing or inputting the electric signals to theradio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from theprocessor 601 or theradio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments,audio circuitry 607 may also include a headphone jack.
Thepositioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (location based Service). Thepositioning component 608 can be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.
Power supply 609 is used to provide power to the various components interminal 600. Thepower supply 609 may be ac, dc, disposable or rechargeable. When thepower supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. Theprocessor 601 may control thetouch screen display 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on theterminal 600. Theprocessor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or on a lower layer of thetouch display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and theprocessor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of thetouch display screen 605, theprocessor 601 controls the operability control on the UI interface according to the pressure operation of the user on thetouch display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 614 is used for collecting a fingerprint of a user, and theprocessor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, theprocessor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.
The optical sensor 615 is used to collect the ambient light intensity. In one embodiment,processor 601 may control the display brightness oftouch display 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of thetouch display screen 605 is increased; when the ambient light intensity is low, the display brightness of thetouch display screen 605 is turned down. In another embodiment, theprocessor 601 may also dynamically adjust the shooting parameters of thecamera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, theprocessor 601 controls thetouch display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually becomes larger, theprocessor 601 controls thetouch display 605 to switch from the breath screen state to the bright screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting ofterminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
In this embodiment of the present application, the aforementioned computing device may be a server, and fig. 7 is a schematic structural diagram of a server according to an exemplary embodiment. Theserver 700 includes a Central Processing Unit (CPU)701, asystem memory 704 including a Random Access Memory (RAM)702 and a Read Only Memory (ROM)703, and asystem bus 705 connecting thesystem memory 704 and thecentral processing unit 701. Theserver 700 also includes a basic input/output system (I/O system) 706, which facilitates transfer of information between devices within the computer, and amass storage device 707 for storing anoperating system 713,application programs 714, and other program modules 715.
The basic input/output system 706 comprises adisplay 708 for displaying information and aninput device 709, such as a mouse, keyboard, etc., for a user to input information. Wherein thedisplay 708 andinput device 709 are connected to thecentral processing unit 701 through an input output controller 710 coupled to thesystem bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 710 may also provide output to a display screen, a printer, or other type of output device.
Themass storage device 707 is connected to thecentral processing unit 701 through a mass storage controller (not shown) connected to thesystem bus 705. Themass storage device 707 and its associated computer-readable media provide non-volatile storage for theserver 700. That is, themass storage device 707 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.
Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. Thesystem memory 704 andmass storage device 707 described above may be collectively referred to as memory.
Theserver 700 may also operate as a remote computer connected to a network via a network, such as the internet, according to various embodiments of the present application. That is, theserver 700 may be connected to thenetwork 712 through anetwork interface unit 711 connected to thesystem bus 705, or may be connected to other types of networks or remote computer systems (not shown) using thenetwork interface unit 711.
The memory further includes one or more programs, the one or more programs are stored in the memory, and thecentral processing unit 701 implements the audio generation method provided by the embodiment of the present application by executing the one or more programs.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
In this application, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise. "A refers to B" and means that A is the same as B or A is simply modified based on B. The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

1. A method of audio generation, comprising:
acquiring a plurality of pronunciation information;
inputting the plurality of pronunciation information into an audio synthesis model to obtain a target audio output by the audio synthesis model;
wherein the plurality of pronunciation information includes at least one first pronunciation information, each of the first pronunciation information includes: the method comprises the steps of obtaining a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme, and a hyphen indicator, wherein the adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of the target phoneme, the hyphen indicator is used for indicating whether a hyphen exists between the target phoneme and the adjacent phonemes in the pronunciation information, and the audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio.
2. The method of claim 1, wherein prior to said obtaining the plurality of pronunciation information, the method further comprises:
analyzing the sample audio to obtain a plurality of sample pronunciation information, wherein the plurality of sample pronunciation information comprises at least one second pronunciation information, and each second pronunciation information comprises: the pitch of the corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of a neighboring phoneme of the target phoneme, and a hyphen indicator, wherein the audio frame corresponding to each of the plurality of sample pronunciation information is one of the sample audios;
and performing model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
3. The method of claim 2, wherein analyzing the sample audio to obtain a plurality of sample pronunciation information comprises:
obtaining a pitch of each audio frame in the sample audio;
detecting whether a connective sound exists between each phoneme and an adjacent phoneme in the sample audio to obtain a connective sound detection result;
generating the plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
4. The method of claim 3, wherein the detecting whether a connective exists between each phoneme and adjacent phonemes in the sample audio to obtain a connective detection result comprises:
when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in the sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frames are audio frames with a pitch larger than 0, N and M are positive integers, and the sample audio frame set corresponding to the any phoneme is an audio frame set formed by the any phoneme in the sample audio in a pronunciation process;
and when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to the phoneme are all pitch frames, determining that the any phoneme exists in the postconnective sound.
5. The method according to any one of claims 1 to 4, wherein the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent preceding phoneme and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent succeeding phoneme;
or, the hyphen indicator includes an indicator for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the preceding phoneme adjacent thereto, and whether a hyphen exists between the target phoneme in the pronunciation information and the succeeding phoneme adjacent thereto.
6. An audio generation apparatus, comprising:
the acquisition module is used for acquiring a plurality of pronunciation information;
the processing module is used for inputting the plurality of pronunciation information into an audio synthesis model to obtain a target audio output by the audio synthesis model;
wherein the plurality of pronunciation information includes at least one first pronunciation information, each of the first pronunciation information includes: the method comprises the steps of obtaining a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme, and a hyphen indicator, wherein the adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of the target phoneme, the hyphen indicator is used for indicating whether a hyphen exists between the target phoneme and the adjacent phonemes in the pronunciation information, and the audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio.
7. The apparatus of claim 6, further comprising:
an analysis module, configured to analyze the sample audio before obtaining the multiple pieces of pronunciation information to obtain multiple pieces of sample pronunciation information, where the multiple pieces of sample pronunciation information include at least one piece of second pronunciation information, and each piece of second pronunciation information includes: the pitch of the corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of a neighboring phoneme of the target phoneme, and a hyphen indicator, wherein the audio frame corresponding to each of the plurality of sample pronunciation information is one of the sample audios;
and the training module is used for carrying out model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
8. The apparatus of claim 7, wherein the analysis module comprises:
an obtaining sub-module, configured to obtain a pitch of each audio frame in the sample audio;
the detection submodule is used for detecting whether a connective sound exists between each phoneme in the sample audio and the adjacent phoneme to obtain a connective sound detection result;
a generating sub-module for generating the plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.
9. The apparatus of claim 8, wherein the detection submodule is configured to:
when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in the sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frames are audio frames with a pitch larger than 0, N and M are positive integers, and the sample audio frame set corresponding to the any phoneme is an audio frame set formed by the any phoneme in the sample audio in a pronunciation process;
when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme before and N audio frames adjacent after the end point are both pitch frames in the sample audio, determining that the any phoneme exists in a succeeding sound, and M is a positive integer.
10. The apparatus according to any one of claims 6 to 9, wherein the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its neighboring preceding phoneme and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its neighboring succeeding phoneme;
or, the hyphen indicator includes an indicator for indicating whether a hyphen exists between the target phoneme in the pronunciation information and the preceding phoneme adjacent thereto, and whether a hyphen exists between the target phoneme in the pronunciation information and the succeeding phoneme adjacent thereto.
11. A computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, causes the processor to implement the audio generation method according to any one of claims 1 to 5.
12. A computing device, wherein the computing device comprises a processor and a memory;
the memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform the audio generation method of any of claims 1 to 5.
CN201911267158.4A2019-12-112019-12-11Audio generation method, device, computer readable storage medium and computing equipmentActiveCN111028823B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911267158.4ACN111028823B (en)2019-12-112019-12-11Audio generation method, device, computer readable storage medium and computing equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911267158.4ACN111028823B (en)2019-12-112019-12-11Audio generation method, device, computer readable storage medium and computing equipment

Publications (2)

Publication NumberPublication Date
CN111028823Atrue CN111028823A (en)2020-04-17
CN111028823B CN111028823B (en)2024-06-07

Family

ID=70208741

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911267158.4AActiveCN111028823B (en)2019-12-112019-12-11Audio generation method, device, computer readable storage medium and computing equipment

Country Status (1)

CountryLink
CN (1)CN111028823B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113035228A (en)*2021-03-232021-06-25广州酷狗计算机科技有限公司Acoustic feature extraction method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4692941A (en)*1984-04-101987-09-08First ByteReal-time text-to-speech conversion system
CN1257271A (en)*1998-12-022000-06-21松下电器产业株式会社Continuous sound processor for Chinese phonetic systhesis
CN1267863A (en)*1999-03-222000-09-27Lg电子株式会社Image apparatus with education function and its controlling method
JP2001343987A (en)*2000-05-312001-12-14Sanyo Electric Co LtdMethod and device for voice synthesis
JP2002333896A (en)*2001-05-102002-11-22Matsushita Electric Ind Co Ltd Speech synthesis apparatus, speech synthesis system, and speech synthesis method
CN1455386A (en)*2002-11-012003-11-12中国科学院声学研究所Imbedded voice synthesis method and system
CN1938756A (en)*2004-03-052007-03-28莱塞克技术公司Prosodic speech text codes and their use in computerized speech systems
CN104464751A (en)*2014-11-212015-03-25科大讯飞股份有限公司Method and device for detecting pronunciation rhythm problem
CN104934028A (en)*2015-06-172015-09-23百度在线网络技术(北京)有限公司Depth neural network model training method and device used for speech synthesis
US20150371626A1 (en)*2014-06-192015-12-24Baidu Online Network Technology (Beijing) Co., LtdMethod and apparatus for speech synthesis based on large corpus

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4692941A (en)*1984-04-101987-09-08First ByteReal-time text-to-speech conversion system
CN1257271A (en)*1998-12-022000-06-21松下电器产业株式会社Continuous sound processor for Chinese phonetic systhesis
CN1267863A (en)*1999-03-222000-09-27Lg电子株式会社Image apparatus with education function and its controlling method
JP2001343987A (en)*2000-05-312001-12-14Sanyo Electric Co LtdMethod and device for voice synthesis
JP2002333896A (en)*2001-05-102002-11-22Matsushita Electric Ind Co Ltd Speech synthesis apparatus, speech synthesis system, and speech synthesis method
CN1455386A (en)*2002-11-012003-11-12中国科学院声学研究所Imbedded voice synthesis method and system
CN1938756A (en)*2004-03-052007-03-28莱塞克技术公司Prosodic speech text codes and their use in computerized speech systems
US20150371626A1 (en)*2014-06-192015-12-24Baidu Online Network Technology (Beijing) Co., LtdMethod and apparatus for speech synthesis based on large corpus
CN104464751A (en)*2014-11-212015-03-25科大讯飞股份有限公司Method and device for detecting pronunciation rhythm problem
CN104934028A (en)*2015-06-172015-09-23百度在线网络技术(北京)有限公司Depth neural network model training method and device used for speech synthesis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113035228A (en)*2021-03-232021-06-25广州酷狗计算机科技有限公司Acoustic feature extraction method, device, equipment and storage medium

Also Published As

Publication numberPublication date
CN111028823B (en)2024-06-07

Similar Documents

PublicationPublication DateTitle
CN110556127B (en)Method, device, equipment and medium for detecting voice recognition result
CN110933330A (en)Video dubbing method and device, computer equipment and computer-readable storage medium
CN110992927B (en)Audio generation method, device, computer readable storage medium and computing equipment
CN112735429B (en)Method for determining lyric timestamp information and training method of acoustic model
CN111564152B (en)Voice conversion method and device, electronic equipment and storage medium
CN111524501B (en)Voice playing method, device, computer equipment and computer readable storage medium
CN112116904B (en)Voice conversion method, device, equipment and storage medium
CN109003621B (en)Audio processing method and device and storage medium
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN111105788B (en)Sensitive word score detection method and device, electronic equipment and storage medium
CN111370025A (en)Audio recognition method and device and computer storage medium
CN110931048A (en)Voice endpoint detection method and device, computer equipment and storage medium
CN110600034B (en)Singing voice generation method, singing voice generation device, singing voice generation equipment and storage medium
CN114283825A (en)Voice processing method and device, electronic equipment and storage medium
CN111048109A (en)Acoustic feature determination method and apparatus, computer device, and storage medium
CN113362836A (en)Vocoder training method, terminal and storage medium
CN111081277B (en)Audio evaluation method, device, equipment and storage medium
CN111223475B (en)Voice data generation method and device, electronic equipment and storage medium
CN111428079B (en)Text content processing method, device, computer equipment and storage medium
CN115394285B (en) Voice cloning method, device, equipment and storage medium
CN109829067B (en)Audio data processing method and device, electronic equipment and storage medium
CN113920979B (en)Voice data acquisition method, device, equipment and computer readable storage medium
CN110337030B (en)Video playing method, device, terminal and computer readable storage medium
CN111028823B (en)Audio generation method, device, computer readable storage medium and computing equipment
CN115862586B (en) Method and device for training timbre feature extraction model and audio synthesis

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp