And step 204, inputting the plurality of pronunciation information into the audio synthesis model to obtain the target audio output by the audio synthesis model.

The audio generation device inputs the plurality of pronunciation information into an audio synthesis model, and the audio output by the audio synthesis model is the target audio. In the embodiment of the present application, the audio synthesis model is a model for audio synthesis, and audio such as a song can be synthesized by the audio synthesis model. The audio synthesis model is typically a Deep Learning (Deep Learning) model. For example, the audio synthesis model may be a wavenet model, or an NPSS model.

Steps 201 to 202 belong to a model training process, and steps 203 to 204 belong to a model using process. According to the audio generation method provided by the embodiment of the application, the pronunciation information input into the audio synthesis model comprises the polyphonic indicator, the polyphonic indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes have the polyphonic, and the polyphonic condition of each phoneme is involved in the audio generation process, so that the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, and the sound smoothness of the polyphonic part is improved. Therefore, in the embodiment of the application, the pronunciation information is expanded, and the information about whether the polyphone exists before and after the target phoneme in the pronunciation information is increased, so that the audio synthesis model is effectively helped to learn the composition of each pronunciation state under the polyphone and the non-polyphone, the pronunciation smoothness of the polyphone is effectively improved, the effective reflection of the change process of the human sound cavity can be realized, and the quality of the output audio is improved.

It should be noted that the foregoing audio synthesis method may be executed by the terminal, the server, or both. In the first case, when the audio synthesis method is executed by a terminal, the audio synthesis apparatus is the terminal, and steps 201 to 204 are executed by the terminal. In a second case, when the audio synthesis method is executed by a server, the audio synthesis apparatus is the server, and steps 201 to 204 are executed by the server, wherein the sample audio instep 201 may be sent to the server by a terminal or may be obtained by the server itself; in the first implementation manner instep 203, the multiple pieces of pronunciation information may be sent to the server by the terminal, or may be obtained by the server itself; in a second implementation manner ofstep 203, at least one initial audio may be sent to the server by the terminal, or may be obtained by the server itself. Afterstep 204, the server may transmit the generated target audio to the terminal. In a third case, when the audio synthesis method is executed by a terminal and a server in cooperation, the audio synthesis apparatus is regarded as a system consisting of the terminal and the server, steps 201 to 202 are executed by the server, steps 203 to 204 are executed by the terminal, and afterstep 202, the server transmits the trained audio synthesis model to the terminal.

The order of the steps of the audio generation method provided in the embodiment of the present application may be appropriately adjusted, and the steps may also be increased or decreased according to the situation, and any method that can be easily conceived by those skilled in the art within the technical scope disclosed in the present application shall be covered by the protection scope of the present application, and therefore, the details are not repeated.

An embodiment of the present application provides anaudio generating apparatus 30, as shown in fig. 3, including:

the obtainingmodule 301 is configured to obtain a plurality of pronunciation information, where the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the method comprises the steps of obtaining the pitch of a corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame, the content of adjacent phonemes of the target phoneme and a hyphen indicator, wherein the adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, and the hyphen indicator is used for indicating whether the target phoneme and the adjacent phonemes in the pronunciation information have hyphen. And the audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio.

Theprocessing module 302 is configured to input the multiple pieces of pronunciation information into the audio synthesis model to obtain a target audio output by the audio synthesis model.

According to the audio generation device provided by the embodiment of the application, since the pronunciation information in the input audio synthesis model comprises the polyphonic indicator, the polyphonic indicator is used for indicating whether the target phoneme in the pronunciation information and the adjacent phonemes have the polyphonic or not, and the polyphonic condition of each phoneme is involved in the audio generation process, the audio synthesized by the audio synthesis model can effectively reflect the appearing polyphonic condition, and the sound smoothness of the polyphonic part is improved. Therefore, the change process of the human acoustic cavity can be effectively reflected, and the quality of output audio is improved.

Optionally, as shown in fig. 4, theapparatus 30 further includes:

ananalyzing module 303, configured to analyze the sample audio before obtaining the multiple pieces of pronunciation information, to obtain multiple pieces of sample pronunciation information, where the multiple pieces of sample pronunciation information include at least one piece of second pronunciation information, and each piece of second pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phoneme of the target phoneme, and the hyphen indicator, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio;

and thetraining module 304 is configured to perform model training based on the pronunciation information of the multiple samples to obtain an audio synthesis model.

Optionally, as shown in fig. 5, theanalysis module 303 includes:

an obtainingsubmodule 3031, configured to obtain a pitch of each audio frame in the sample audio;

thedetection submodule 3032 is configured to detect whether a polyphone exists between each phoneme in the sample audio and an adjacent phoneme, so as to obtain a polyphone detection result;

the generatingsubmodule 3033 is configured to generate a plurality of sample pronunciation information based on the pitch of each audio frame and the hyphen detection result.

Optionally, thedetection submodule 3032 is configured to:

when M audio frames adjacent to the beginning of a sample audio frame set corresponding to any phoneme and N audio frames adjacent to the beginning are all pitch frames in sample audio, determining that the any phoneme has a preceding continuous tone, wherein the pitch frame is an audio frame with a pitch greater than 0, N and M are positive integers, and the sample audio frame set corresponding to any phoneme is an audio frame set formed in the pronunciation process of any phoneme;

when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme in the sample audio and N audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme are all pitch frames, determining that any phoneme exists in the postconcatenation sound.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of a computing device to perform the audio generation method illustrated in the various embodiments of the present application is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

An embodiment of the present application provides a computing device, which includes a processor and a memory;

the memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform any of the audio generation methods provided by the embodiments of the present application.

In this embodiment of the present application, the foregoing computing device may be a terminal, and fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: aprocessor 601 and amemory 602.

Theprocessor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. Theprocessor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Theprocessor 601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, theprocessor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments,processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Thememory 602 may include one or more computer-readable storage media, which may be non-transitory. Thememory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in thememory 602 is used to store at least one instruction for execution by theprocessor 601 to implement the audio generation method provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: aperipheral interface 603 and at least one peripheral. Theprocessor 601,memory 602, andperipheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to theperipheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of aradio frequency circuit 604, atouch screen display 605, acamera 606, anaudio circuit 607, apositioning component 608, and apower supply 609.

Aperipheral interface 603 may be used to connect at least one I/O related peripheral to theprocessor 601 andmemory 602. In some embodiments, theprocessor 601,memory 602, andperipheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of theprocessor 601, thememory 602, and theperipheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

TheRadio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. Theradio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. Therf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, theradio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. Theradio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, therf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

Thedisplay 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When thedisplay screen 605 is a touch display screen, thedisplay screen 605 also has the ability to capture touch signals on or over the surface of thedisplay screen 605. The touch signal may be input to theprocessor 601 as a control signal for processing. At this point, thedisplay 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, thedisplay 605 may be one, providing the front panel of the terminal 600; in other embodiments, thedisplay 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, thedisplay 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, thedisplay 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. TheDisplay 605 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Thecamera assembly 606 is used to capture images or video. Optionally,camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments,camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to theprocessor 601 for processing or inputting the electric signals to theradio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from theprocessor 601 or theradio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments,audio circuitry 607 may also include a headphone jack.

Thepositioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (location based Service). Thepositioning component 608 can be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 609 is used to provide power to the various components interminal 600. Thepower supply 609 may be ac, dc, disposable or rechargeable. When thepower supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. Theprocessor 601 may control thetouch screen display 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on theterminal 600. Theprocessor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or on a lower layer of thetouch display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and theprocessor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of thetouch display screen 605, theprocessor 601 controls the operability control on the UI interface according to the pressure operation of the user on thetouch display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and theprocessor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, theprocessor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment,processor 601 may control the display brightness oftouch display 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of thetouch display screen 605 is increased; when the ambient light intensity is low, the display brightness of thetouch display screen 605 is turned down. In another embodiment, theprocessor 601 may also dynamically adjust the shooting parameters of thecamera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, theprocessor 601 controls thetouch display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually becomes larger, theprocessor 601 controls thetouch display 605 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting ofterminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In this embodiment of the present application, the aforementioned computing device may be a server, and fig. 7 is a schematic structural diagram of a server according to an exemplary embodiment. Theserver 700 includes a Central Processing Unit (CPU)701, asystem memory 704 including a Random Access Memory (RAM)702 and a Read Only Memory (ROM)703, and asystem bus 705 connecting thesystem memory 704 and thecentral processing unit 701. Theserver 700 also includes a basic input/output system (I/O system) 706, which facilitates transfer of information between devices within the computer, and amass storage device 707 for storing anoperating system 713,application programs 714, and other program modules 715.

The basic input/output system 706 comprises adisplay 708 for displaying information and aninput device 709, such as a mouse, keyboard, etc., for a user to input information. Wherein thedisplay 708 andinput device 709 are connected to thecentral processing unit 701 through an input output controller 710 coupled to thesystem bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 710 may also provide output to a display screen, a printer, or other type of output device.

Themass storage device 707 is connected to thecentral processing unit 701 through a mass storage controller (not shown) connected to thesystem bus 705. Themass storage device 707 and its associated computer-readable media provide non-volatile storage for theserver 700. That is, themass storage device 707 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. Thesystem memory 704 andmass storage device 707 described above may be collectively referred to as memory.

Theserver 700 may also operate as a remote computer connected to a network via a network, such as the internet, according to various embodiments of the present application. That is, theserver 700 may be connected to thenetwork 712 through anetwork interface unit 711 connected to thesystem bus 705, or may be connected to other types of networks or remote computer systems (not shown) using thenetwork interface unit 711.

The memory further includes one or more programs, the one or more programs are stored in the memory, and thecentral processing unit 701 implements the audio generation method provided by the embodiment of the present application by executing the one or more programs.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

In this application, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise. "A refers to B" and means that A is the same as B or A is simply modified based on B. The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of audio generation, comprising:

acquiring a plurality of pronunciation information;

inputting the plurality of pronunciation information into an audio synthesis model to obtain a target audio output by the audio synthesis model;

wherein the plurality of pronunciation information includes at least one first pronunciation information, each of the first pronunciation information includes: the method comprises the steps of obtaining a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme, and a hyphen indicator, wherein the adjacent phonemes of any target phoneme include a previous phoneme and a next phoneme of the target phoneme, the hyphen indicator is used for indicating whether a hyphen exists between the target phoneme and the adjacent phonemes in the pronunciation information, and the audio frame corresponding to each piece of pronunciation information in the plurality of pieces of pronunciation information is one audio frame in the target audio.

2. The method of claim 1, wherein prior to said obtaining the plurality of pronunciation information, the method further comprises:

3. The method of claim 2, wherein analyzing the sample audio to obtain a plurality of sample pronunciation information comprises:

obtaining a pitch of each audio frame in the sample audio;

4. The method of claim 3, wherein the detecting whether a connective exists between each phoneme and adjacent phonemes in the sample audio to obtain a connective detection result comprises:

5. The method according to any one of claims 1 to 4, wherein the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent preceding phoneme and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its adjacent succeeding phoneme;

6. An audio generation apparatus, comprising:

the acquisition module is used for acquiring a plurality of pronunciation information;

the processing module is used for inputting the plurality of pronunciation information into an audio synthesis model to obtain a target audio output by the audio synthesis model;

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, wherein the analysis module comprises:

9. The apparatus of claim 8, wherein the detection submodule is configured to:

when M audio frames adjacent to the end point of the sample audio frame set corresponding to any phoneme before and N audio frames adjacent after the end point are both pitch frames in the sample audio, determining that the any phoneme exists in a succeeding sound, and M is a positive integer.

10. The apparatus according to any one of claims 6 to 9, wherein the hyphen indicator includes a preceding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its neighboring preceding phoneme and a succeeding hyphen indicator for indicating whether a target phoneme in the pronunciation information is hyphen with its neighboring succeeding phoneme;

11. A computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, causes the processor to implement the audio generation method according to any one of claims 1 to 5.

12. A computing device, wherein the computing device comprises a processor and a memory;

the memory stores computer instructions; the processor executes the computer instructions stored by the memory to cause the computing device to perform the audio generation method of any of claims 1 to 5.