Movatterモバイル変換


[0]ホーム

URL:


CN113793589A - Speech synthesis method and device - Google Patents

Speech synthesis method and device
Download PDF

Info

Publication number
CN113793589A
CN113793589ACN202010456116.1ACN202010456116ACN113793589ACN 113793589 ACN113793589 ACN 113793589ACN 202010456116 ACN202010456116 ACN 202010456116ACN 113793589 ACN113793589 ACN 113793589A
Authority
CN
China
Prior art keywords
duration
phoneme
text
converted
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010456116.1A
Other languages
Chinese (zh)
Inventor
别凡虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co LtdfiledCriticalHuawei Technologies Co Ltd
Priority to CN202010456116.1ApriorityCriticalpatent/CN113793589A/en
Priority to PCT/CN2021/080403prioritypatent/WO2021238338A1/en
Publication of CN113793589ApublicationCriticalpatent/CN113793589A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本申请适用于终端人工智能技术领域及从文本到语音技术领域,提供了一种语音合成方法及装置,包括:确定待转换文本对应的每个音素的时长范围;将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;根据所述待转换文本和每个音素的音素时长,生成语音数据。针对相同待转换文本的多个语音数据,多个语音数据中同一音素的音素时长可能基于相同的时长范围取值不同,则可以合成得到多种不同的语音数据,避免了针对同一待转换文本每次合成得到相同的语音数据,降低了语音合成的机械性,提高了语音合成的自然度和多样性。

Figure 202010456116

The present application is applicable to the technical field of terminal artificial intelligence and the field of text-to-speech technology, and provides a speech synthesis method and device, including: determining the duration range of each phoneme corresponding to the text to be converted; Any duration in the range is determined as the phoneme duration of the corresponding phoneme; voice data is generated according to the text to be converted and the phoneme duration of each phoneme. For multiple voice data of the same text to be converted, the phoneme durations of the same phoneme in the multiple voice data may have different values based on the same duration range, so that a variety of different voice data can be synthesized, avoiding the need for the same text to be converted every time. The same speech data is obtained by sub-synthesis, which reduces the mechanical nature of speech synthesis and improves the naturalness and diversity of speech synthesis.

Figure 202010456116

Description

Speech synthesis method and device
Technical Field
The application belongs to the technical field of terminal artificial intelligence and the technical field from text to voice, and particularly relates to a voice synthesis method and device.
Background
With the continuous development of the artificial intelligence technology, the terminal equipment can not only receive the voice information sent by the user, but also play the voice information to the user, and the user can know the information displayed by the terminal equipment only through hearing without looking up characters displayed by the terminal equipment.
In the related technology, the terminal device may obtain a text to be converted, perform feature extraction on the text to be converted to obtain a linguistic feature, determine a phoneme duration of each phoneme corresponding to the text to be converted through the linguistic feature, and finally generate voice data according to each phoneme duration and the linguistic feature.
However, in the process of synthesizing voice data by the terminal device, the voice data generated for the same text to be converted for multiple times are all the same, so that the voice synthesis is too mechanized.
Disclosure of Invention
The embodiment of the application provides a voice synthesis method and a voice synthesis device, which can solve the problem of over-mechanization of voice synthesis.
In a first aspect, an embodiment of the present application provides a speech synthesis method, including:
determining a duration range of each phoneme corresponding to the text to be converted;
determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme;
and generating voice data according to the text to be converted and the phoneme duration of each phoneme.
In a first possible implementation manner of the first aspect, the determining a duration range of each phoneme corresponding to the text to be converted includes:
determining the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each phoneme corresponding to the text to be converted;
and determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
Based on the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the determining the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme corresponding to the text to be converted includes:
inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each phoneme output by the text analysis model;
and inputting the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
Based on the first possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme includes:
and determining the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
In a fourth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
for each phoneme, acquiring text semantic information of the phoneme according to the position of the character corresponding to the phoneme in the text to be converted;
determining a phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
In a fifth possible implementation manner of the first aspect, the determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme includes:
acquiring user data, wherein the user data comprises age information and character information of a user;
determining a phoneme duration for each of the phonemes based on the duration range of the phonemes and the user data.
Based on any one of the possible implementation manners of the first aspect, in a sixth possible implementation manner of the first aspect, the generating voice data according to the text to be converted and the phoneme duration of each phoneme includes:
and generating the voice data through a preset acoustic model and a vocoder according to the text to be converted and the phoneme duration of each phoneme.
In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:
the range determining module is used for determining the duration range of each phoneme corresponding to the text to be converted;
a duration determining module, configured to determine any duration in the duration range of each phoneme as a phoneme duration of the corresponding phoneme;
and the generating module is used for generating voice data according to the text to be converted and the phoneme duration of each phoneme.
In a first possible implementation manner of the second aspect, the range determining module is specifically configured to determine an average pronunciation duration, a pronunciation duration variance, and a pronunciation duration distribution density of each phoneme corresponding to the text to be converted; and determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
Based on the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the range determining module is further specifically configured to input the text to be converted into a preset text analysis model, and obtain a pronunciation duration distribution density of each phoneme output by the text analysis model; and inputting the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
Based on the first possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the range determining module is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme.
In a fourth possible implementation manner of the second aspect, the duration determining module is specifically configured to, for each phoneme, obtain text semantic information of the phoneme according to a position of a character corresponding to the phoneme in the text to be converted; determining a phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
In a fifth possible implementation manner of the second aspect, the duration determining module is specifically configured to obtain user data, where the user data includes age information and personality information of a user; determining a phoneme duration for each of the phonemes based on the duration range of the phonemes and the user data.
Based on any one of the possible implementation manners of the second aspect, in a sixth possible implementation manner of the second aspect, the generating module is specifically configured to generate the speech data through a preset acoustic model and a vocoder according to the text to be converted and the phoneme duration of each phoneme.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the speech synthesis method according to any one of the above first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the speech synthesis method according to any one of the above first aspects.
In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the speech synthesis method according to any one of the above first aspects.
In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor, the processor is coupled with a memory, and the processor executes a computer program stored in the memory to implement the speech synthesis method according to any one of the first aspect.
The chip system can be a single chip or a chip module consisting of a plurality of chips.
Compared with the prior art, the embodiment of the application has the advantages that:
the embodiment of the application determines the duration range of each phoneme corresponding to the text to be converted, determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally generates the voice data according to the text to be converted and the phoneme duration of each phoneme. Aiming at a plurality of voice data of the same text to be converted, the phoneme time length of the same phoneme in the plurality of voice data may be different based on the same time length range, so that a plurality of different voice data can be obtained by synthesis, the same voice data obtained by synthesizing the same text to be converted every time is avoided, the mechanicalness of voice synthesis is reduced, and the naturalness and diversity of voice synthesis are improved.
Drawings
Fig. 1 is a scene schematic diagram of a speech synthesis scene according to a speech synthesis method provided in an embodiment of the present application;
fig. 2 is a scene schematic diagram of another speech synthesis scene related to a speech synthesis method provided in an embodiment of the present application;
fig. 3 is a block diagram of a terminal device according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of a speech synthesis method provided by an embodiment of the present application;
FIG. 5 is a schematic flow chart for determining a duration range of a phoneme according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a duration range provided by an embodiment of the present application;
fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The speech synthesis method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific type of the terminal device at all.
For example, the terminal device may be a Station (ST) in a WLAN, which may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA) device, a handheld device with Wireless communication capability, a computing device or other processing device connected to a Wireless modem, a vehicle-mounted device, a vehicle-mounted networking terminal, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite Wireless device, a Wireless modem card, a television set-top box (STB), a Customer Premises Equipment (CPE), and/or other devices for communicating over a Wireless system and a next generation communication system, such as a Mobile terminal in a 5G Network or a Public Land Mobile Network (future evolved, PLMN) mobile terminals in the network, etc.
By way of example and not limitation, when the terminal device is a wearable device, the wearable device may also be a generic term for intelligently designing daily wearing by applying wearable technology, developing wearable devices, such as glasses, gloves, watches, clothing, shoes, and the like. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud interaction. The generalized wearable intelligent device has the advantages that the generalized wearable intelligent device is complete in function and large in size, can realize complete or partial functions without depending on a smart phone, such as a smart watch or smart glasses, and only is concentrated on a certain application function, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets for monitoring physical signs, smart jewelry and the like.
Fig. 1 is a scene schematic diagram of a speech synthesis scene related to a speech synthesis method provided in an embodiment of the present application, and referring to fig. 1, the speech synthesis scene may include aterminal device 110, and theterminal device 110 may obtain a text to be converted and adjust a phoneme duration of each phoneme corresponding to the text to be converted, so that different speech data may be generated based on the same text to be converted.
In a possible implementation manner,terminal device 110 may obtain a text to be converted, and input the text to be converted into a pre-trained text analysis model and a pre-trained duration model respectively to obtain an average pronunciation duration, a pronunciation duration variance, and a pronunciation duration distribution density of each phoneme corresponding to the text to be converted, so that a duration range of each phoneme may be determined according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme based on a normal distribution rule.
Then,terminal device 110 may determine any duration in the duration range as the phoneme duration of each phoneme based on the duration range of each phoneme in combination with the text semantic information corresponding to each phoneme in the text to be converted and/or the pre-stored user data including the user age information and the personality information, so as to generate the speech data according to each phoneme duration and the text to be converted.
The average pronunciation duration is used for representing the average value of each phoneme duration of the same phoneme, the pronunciation duration variance is used for representing the difference degree of each phoneme duration of the same phoneme and the average pronunciation duration, and the pronunciation duration distribution density is used for representing the probability that the same phoneme corresponds to different phoneme durations.
In addition, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. For example, taking the pronunciation rule of pinyin as an example, the initial corresponding to the pinyin of each character may be taken as one element, and the final of the pinyin may be taken as another element, for example, in "weather", the elements corresponding to the characters "day" may include "t" and "ian", and the elements corresponding to the characters "qi" may include "q" and "i".
It should be noted that, in practical application, referring to fig. 2, a speech synthesis scenario may further include aserver 120, and theterminal device 110 may be connected to theserver 120, so that theserver 120 may convert a text to be converted to obtain different speech data based on the same text to be converted.
In the process of generating the voice data, theterminal device 110 may first send a text to be converted to theserver 120, theserver 120 may determine a duration range of each phoneme according to the text to be converted, and then, in combination with semantic information of the text to be converted and pre-stored user data, may determine a phoneme duration of each phoneme from each duration range, so that the voice data may be generated according to each phoneme duration and the text to be converted, and send the generated voice data to theterminal device 110, so that theterminal device 110 may receive and play the voice data generated by theserver 120.
For simplicity, the following embodiments are only described by taking an example that theterminal device 110 is included in the speech synthesis scenario and theserver 120 is not included, but in practical applications, theterminal device 110 may convert the speech data into the speech data, and theserver 120 may also convert the speech data into the speech data, which is not limited in this application.
Fig. 3 is a block diagram of a terminal device according to an embodiment of the present disclosure. Referring to fig. 3, the terminal device may include aprocessor 310, anexternal memory interface 320, an internal memory 321, a Universal Serial Bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, an antenna 1, anantenna 2, amobile communication module 350, awireless communication module 360, anaudio module 370, aspeaker 370A, areceiver 370B, amicrophone 370C, anearphone interface 370D, a sensor module 380, a button 390, a motor 391, an indicator 392, a camera 393, a display 394, and a Subscriber Identification Module (SIM)card interface 395, and the like. The sensor module 380 may include a pressure sensor 380A, a gyroscope sensor 380B, an air pressure sensor 380C, a magnetic sensor 380D, an acceleration sensor 380E, a distance sensor 380F, a proximity light sensor 380G, a fingerprint sensor 380H, a temperature sensor 380J, a touch sensor 380K, an ambient light sensor 380L, abone conduction sensor 380M, and the like.
It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the terminal device. In other embodiments of the present application, a terminal device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 310 may include one or more processing units, such as: theprocessor 310 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.
The controller can be a neural center and a command center of the terminal equipment. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in theprocessor 310 for storing instructions and data. In some embodiments, the memory in theprocessor 310 is a cache memory. The memory may hold instructions or data that have just been used or recycled by theprocessor 310. If theprocessor 310 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of theprocessor 310, thereby increasing the efficiency of the system.
In some embodiments,processor 310 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.
The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, theprocessor 310 may include multiple sets of I2C buses. Theprocessor 310 may be coupled to the touch sensor 380K, the charger, the flash, the camera 393, etc., via different I2C bus interfaces. For example: theprocessor 310 may be coupled to the touch sensor 380K through an I2C interface, so that theprocessor 310 and the touch sensor 380K communicate through an I2C bus interface, thereby implementing the touch function of the terminal device.
The I2S interface may be used for audio communication. In some embodiments, theprocessor 310 may include multiple sets of I2S buses. Theprocessor 310 may be coupled to theaudio module 370 via an I2S bus to enable communication between theprocessor 310 and theaudio module 370. In some embodiments, theaudio module 370 may communicate audio signals to thewireless communication module 360 via an I2S interface, enabling answering of calls via a bluetooth headset.
The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, theaudio module 370 and thewireless communication module 360 may be coupled by a PCM bus interface. In some embodiments, theaudio module 370 may also transmit audio signals to thewireless communication module 360 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.
The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect theprocessor 310 with thewireless communication module 360. For example: theprocessor 310 communicates with the bluetooth module in thewireless communication module 360 through the UART interface to implement the bluetooth function. In some embodiments, theaudio module 370 may transmit the audio signal to thewireless communication module 360 through a UART interface, so as to realize the function of playing music through a bluetooth headset.
The MIPI interface may be used to connectprocessor 310 with peripheral devices such as display 394, camera 393, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments,processor 310 and camera 393 communicate over a CSI interface to implement the capture functionality of the terminal device. Theprocessor 310 and the display screen 394 communicate through a DSI interface to realize the display function of the terminal device.
The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect theprocessor 310 with the camera 393, the display 394, thewireless communication module 360, theaudio module 370, the sensor module 380, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.
The USB interface 330 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 330 may be used to connect a charger to charge the terminal device, or may be used to transmit data between the terminal device and the peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.
It should be understood that the interface connection relationship between the modules in the embodiment of the present invention is only an exemplary illustration, and does not form a structural limitation on the terminal device. In other embodiments of the present application, the terminal device may also adopt different interface connection manners or a combination of multiple interface connection manners in the foregoing embodiments.
The charging management module 340 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 340 may receive charging input from a wired charger via the USB interface 330. In some wireless charging embodiments, the charging management module 340 may receive a wireless charging input through a wireless charging coil of the terminal device. The charging management module 340 may also supply power to the electronic device through the power management module 341 while charging the battery 342.
The power management module 341 is configured to connect the battery 342, the charging management module 340 and theprocessor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340 and provides power to theprocessor 310, the internal memory 321, the external memory, the display 394, the camera 393, and thewireless communication module 360. The power management module 341 may also be configured to monitor parameters such as battery capacity, battery cycle count, and battery state of health (leakage, impedance). In other embodiments, the power management module 341 may also be disposed in theprocessor 310. In other embodiments, the power management module 341 and the charging management module 340 may be disposed in the same device.
The wireless communication function of the terminal device can be realized by the antenna 1, theantenna 2, themobile communication module 350, thewireless communication module 360, the modem processor, the baseband processor, and the like.
Theantennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in a terminal device may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
Themobile communication module 350 may provide a solution including 2G/3G/4G/5G wireless communication applied on the terminal device. Themobile communication module 350 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. Themobile communication module 350 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the filtered electromagnetic wave to the modem processor for demodulation. Themobile communication module 350 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of themobile communication module 350 may be disposed in theprocessor 310. In some embodiments, at least some of the functional modules of themobile communication module 350 may be disposed in the same device as at least some of the modules of theprocessor 310.
The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to thespeaker 370A, thereceiver 370B, etc.) or displays images or video through the display 394. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be separate from theprocessor 310, and may be disposed in the same device as themobile communication module 350 or other functional modules.
Thewireless communication module 360 may provide solutions for wireless communication applied to the terminal device, including Wireless Local Area Networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. Thewireless communication module 360 may be one or more devices integrating at least one communication processing module. Thewireless communication module 360 receives electromagnetic waves via theantenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to theprocessor 310. Thewireless communication module 360 may also receive a signal to be transmitted from theprocessor 310, frequency-modulate and amplify the signal, and convert the signal into electromagnetic waves via theantenna 2 to radiate the electromagnetic waves.
In some embodiments, the terminal device's antenna 1 is coupled to themobile communication module 350 andantenna 2 is coupled to thewireless communication module 360 so that the terminal device can communicate with the network and other devices via wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).
The terminal device implements the display function through the GPU, the display screen 394, and the application processor, etc. The GPU is an image processing microprocessor coupled to a display 394 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Theprocessor 310 may include one or more GPUs that execute program instructions to generate or alter display information.
The display screen 394 is used to display images, video, and the like. The display screen 394 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the terminal device may include 1 or N display screens 394, N being a positive integer greater than 1.
The terminal device may implement the shooting function through the ISP, the camera 393, the video codec, the GPU, the display 394, the application processor, and the like.
The ISP is used to process the data fed back by the camera 393. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in camera 393.
Camera 393 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the terminal device may include 1 or N cameras 393, N being a positive integer greater than 1.
The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used for performing fourier transform and the like on the frequency point energy.
Video codecs are used to compress or decompress digital video. The terminal device may support one or more video codecs. In this way, the terminal device can play or record videos in a plurality of coding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.
The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can realize the intelligent cognition and other applications of the terminal equipment, such as: image recognition, face recognition, speech recognition, text understanding, and the like.
Theexternal memory interface 320 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the terminal device. The external memory card communicates with theprocessor 310 through theexternal memory interface 320 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.
The internal memory 321 may be used to store computer-executable program code, which includes instructions. Theprocessor 310 executes various functional applications of the terminal device and data processing by executing instructions stored in the internal memory 321. The internal memory 321 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, a phonebook, etc.) created during use of the terminal device, and the like. In addition, the internal memory 321 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.
The terminal device may implement an audio function through theaudio module 370, thespeaker 370A, thereceiver 370B, themicrophone 370C, theearphone interface 370D, and the application processor. Such as music playing, recording, etc.
Theaudio module 370 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. Theaudio module 370 may also be used to encode and decode audio signals. In some embodiments, theaudio module 370 may be disposed in theprocessor 310, or some functional modules of theaudio module 370 may be disposed in theprocessor 310.
Thespeaker 370A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The terminal device can listen to music through thespeaker 370A or listen to a handsfree call.
Thereceiver 370B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the terminal device answers a call or voice information, it is possible to answer a voice by placing thereceiver 370B close to the human ear.
Microphone 370C, also known as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal into themicrophone 370C by speaking the user's mouth near themicrophone 370C. The terminal device may be provided with at least onemicrophone 370C. In other embodiments, the terminal device may be provided with twomicrophones 370C, which may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device may further include three, four, ormore microphones 370C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
Theheadphone interface 370D is used to connect wired headphones. Theheadset interface 370D may be the USB interface 330, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
The pressure sensor 380A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 380A may be disposed on the display screen 394. The pressure sensor 380A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, or the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 380A, the capacitance between the electrodes changes. The terminal device determines the intensity of the pressure from the change in capacitance. When a touch operation is applied to the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A. The terminal device can also calculate the touched position from the detection signal of the pressure sensor 380A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.
The gyro sensor 380B may be used to determine the motion attitude of the terminal device. In some embodiments, the angular velocity of the terminal device about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 380B. The gyro sensor 380B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance to be compensated for by the lens module according to the shake angle, and enables the lens to counteract the shake of the terminal device through reverse movement, thereby achieving anti-shake. The gyro sensor 380B may also be used for navigation, somatosensory gaming scenes.
The air pressure sensor 380C is used to measure air pressure. In some embodiments, the terminal device calculates altitude, aiding positioning and navigation, from the barometric pressure values measured by barometric pressure sensor 380C.
The magnetic sensor 380D includes a hall sensor. The terminal device may detect the opening and closing of the flip holster using the magnetic sensor 380D. In some embodiments, when the terminal device is a flip, the terminal device may detect the opening and closing of the flip according to the magnetic sensor 380D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.
The acceleration sensor 380E can detect the magnitude of the terminal device acceleration in various directions (typically three axes). When the terminal equipment is static, the size and the direction of gravity can be detected. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.
A distance sensor 380F for measuring distance. The terminal device may measure the distance by infrared or laser. In some embodiments, the scene is photographed and the terminal device may range using the distance sensor 380F to achieve fast focus.
The proximity light sensor 380G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light to the outside through the light emitting diode. The terminal device detects infrared reflected light from a nearby object using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device may determine that there is no object near the terminal device. The terminal device can detect that the user holds the terminal device by the proximity light sensor 380G and calls near the ear, so that the screen is automatically turned off to achieve the purpose of saving power. The proximity light sensor 380G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.
The ambient light sensor 380L is used to sense the ambient light level. The terminal device may adaptively adjust the brightness of the display 394 based on the perceived ambient light level. The ambient light sensor 380L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 380L may also cooperate with the proximity light sensor 380G to detect whether the terminal device is in a pocket to prevent accidental touches.
The fingerprint sensor 380H is used to capture a fingerprint. The terminal equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering and the like.
The temperature sensor 380J is used to detect temperature. In some embodiments, the terminal device executes a temperature processing strategy using the temperature detected by the temperature sensor 380J. For example, when the temperature reported by the temperature sensor 380J exceeds the threshold, the terminal device performs a reduction in the performance of the processor located near the temperature sensor 380J, so as to reduce power consumption and implement thermal protection. In other embodiments, the terminal device heats the battery 342 when the temperature is below another threshold to avoid abnormal shutdown of the terminal device due to low temperature. In other embodiments, the terminal device performs boosting of the output voltage of the battery 342 when the temperature is below a further threshold value to avoid abnormal shutdown due to low temperature.
The touch sensor 380K is also referred to as a "touch panel". The touch sensor 380K may be disposed on the display screen 394, and the touch sensor 380K and the display screen 394 form a touch screen, which is also referred to as a "touch screen". The touch sensor 380K is used to detect a touch operation applied thereto or thereabout. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided via the display 394. In other embodiments, the touch sensor 380K may be disposed on the surface of the terminal device at a different location than the display 394.
Thebone conduction sensor 380M can acquire a vibration signal. In some embodiments, thebone conduction transducer 380M can acquire a vibration signal of the vibrating bone mass of the human voice. Thebone conduction sensor 380M may also contact the human body pulse to receive the blood pressure pulsation signal. In some embodiments, thebone conduction sensor 380M may also be disposed in a headset, integrated into a bone conduction headset. Theaudio module 370 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by thebone conduction sensor 380M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by thebone conduction sensor 380M, so as to realize the heart rate detection function.
Keys 390 include a power-on key, a volume key, etc. The keys 390 may be mechanical keys. Or may be touch keys. The terminal device may receive a key input, and generate a key signal input related to user setting and function control of the terminal device.
The motor 391 may generate a vibration cue. The motor 391 may be used for both incoming call vibration prompting and touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 391 may also respond to different vibration feedback effects by performing touch operations on different areas of the display 394. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.
Indicator 392 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.
TheSIM card interface 395 is for connecting a SIM card. The SIM card can be attached to and detached from the terminal device by being inserted into or pulled out of theSIM card interface 395. The terminal equipment can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. TheSIM card interface 395 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. Multiple cards can be inserted into the sameSIM card interface 395 at the same time. The types of the plurality of cards may be the same or different. TheSIM card interface 395 may also be compatible with different types of SIM cards. TheSIM card interface 395 may also be compatible with an external memory card. The terminal equipment interacts with the network through the SIM card to realize functions of conversation, data communication and the like. In some embodiments, the end-point device employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.
Fig. 4 is a schematic flow chart of a speech synthesis method provided in an embodiment of the present application, which may be applied to the terminal device described above by way of example and not limitation, and referring to fig. 4, the method includes:
step 401, determining a duration range of each phoneme corresponding to the text to be converted.
In the process of generating the voice data, the terminal device can determine the phoneme duration of each phoneme corresponding to the text to be converted, and generates the voice data through a preset acoustic model and a vocoder in combination with the language features of the text to be converted. Moreover, in the process of determining the phoneme duration, the duration range of each phoneme may be determined first, so that in the subsequent step, different phoneme durations may be selected based on the duration ranges to improve the naturalness and diversity of the generated speech data.
In a specific implementation, the terminal device may extract each phoneme in the text to be converted, then obtain pronunciation information of each phoneme based on a pre-trained model, and then determine a duration range of each phoneme according to the pronunciation information of each phoneme, referring to fig. 5, thisstep 401 may include:step 401a and step 401 b.
401a, determining the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each phoneme corresponding to the text to be converted.
The terminal device may input the text to be converted into a pre-trained model, extract each phoneme of the text to be converted through the model, and analyze the text to be converted to determine pronunciation information of each phoneme, such as pronunciation duration, pronunciation duration variance and pronunciation duration distribution density, so that in subsequent steps, a pronunciation range of each phoneme may be determined according to the pronunciation information.
In the process of extracting the phonemes through the model, the terminal device may split the text to be converted to obtain a plurality of characters arranged in sequence, extract at least one phoneme of each character based on the pronunciation rule of each character, and obtain a plurality of phonemes of the text to be converted after the extraction of the phonemes of each character is completed.
For example, based on the pronunciation rules of pinyin, the terminal device may use the initial consonant and the final sound corresponding to each character as the phoneme of the character, and if the text to be converted is "today is sunny", the phonemes of the text to be converted may be "j", "in", "t", "ian", "t", "ian", "q", "i", "h", "ao", "q", "ing", "l", and "ang", respectively.
In addition, since different types of information can be included in the pronunciation information of each phoneme, the terminal device can input the text to be converted into different models to obtain different pronunciation information. For example, the terminal device may input the text to be converted into a preset text analysis model, and obtain a pronunciation duration distribution density of each phoneme output by the text analysis model, where the text analysis model may be a Deep Neural Network (DNN) model. And/or the terminal equipment can input the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
401b, determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
After obtaining the pronunciation information of each phoneme, the terminal device may substitute different parameters included in the pronunciation information into a calculation formula according to a preset calculation formula, so as to obtain a duration range of each phoneme. For example, assuming that the phoneme duration of each phoneme follows a normal distribution, the duration range of each phoneme may be determined by calculating according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme by a normal distribution algorithm and using a formula of a normal distribution.
For example, referring to FIG. 6, the average pronunciation duration of the xth phoneme is determined as t (x), and the pronunciation duration variance is std2(x) The pronunciation time length distribution density is p (x), which is in accordance with the formula p (x) N (t (x), std2(x) X1 and x2 are obtained by solving for x, and if x1 is smaller than x2, the interval [ x1, x2 ] can be divided]As the duration range of the xth phoneme.
Step 402, determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme.
After obtaining the duration range of each phoneme, the terminal device may determine the phoneme duration of each phoneme based on the text semantic information of each phoneme in the text to be converted, the user personality age, or randomly, so that the terminal device may generate different speech data based on the same text to be converted.
Optionally, for each phoneme, the terminal device may obtain text semantic information of the phoneme according to a position of the character corresponding to the phoneme in the text to be converted, and then determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
In a possible implementation manner, for each phoneme, the terminal device may first determine a word corresponding to the phoneme, then determine a sentence in which the word is located in the text to be converted, analyze and determine the semantics of the sentence, and determine the text semantic information of the phoneme in combination with the semantics expressed by all sentences in the text to be converted. Then, the terminal device may select the phoneme duration of the phoneme from a plurality of durations corresponding to the duration range according to the text semantic information.
For example, the text semantic information of a phoneme indicates a happy mood, and the duration of a short note may be selected from the duration range as the phoneme duration of the phoneme.
Alternatively, the terminal device may acquire user data, which may include age information and character information of the user, and the terminal device may determine the phoneme time length of each phoneme from the time length range of each phoneme according to the age and character of the user, thereby generating speech information matching the user.
In a possible implementation manner, the terminal device may obtain pre-stored user data, or request the user data from the server, determine a voice type matched with the user according to the user data, and then select the phoneme duration of the phoneme from a plurality of durations corresponding to the duration range according to the voice type of the user.
For example, if the user data indicates that the user is a middle aged person and the character is stable, the speech type matching the user may be a slow-striped speeches type, and accordingly a long duration may be selected as the phoneme duration of the phoneme.
It should be noted that, in practical applications, the user data may further include other information indicating the type of the user speaking, for example, the user data may include search data indicating the emotion of the user, and may further include shopping data indicating whether the user has recently purchased a commodity, and the like, which is not limited in this embodiment of the present application.
In addition, if the terminal device can obtain the text semantic information and the user data at the same time, the phoneme duration of each phoneme can be further determined according to weights respectively corresponding to the text semantic information and the user data. However, if the terminal device cannot acquire the text semantic information and cannot acquire the user data, the phoneme duration of each phoneme may be determined according to a rule of normal distribution according to a condition that each phoneme in the text to be converted obeys the normal distribution, and a manner of determining the phoneme duration is not limited in the embodiment of the present application.
Step 403, generating voice data according to the text to be converted and the phoneme duration of each phoneme.
In the process of synthesizing the speech data, the terminal device may generate the speech data based on the phoneme duration of each phoneme in different manners. For example, the terminal device may generate the speech data in a parametric method, a concatenation method, or an end-to-end method, and regardless of which method is used to generate the speech data, the phoneme duration of each phoneme corresponding to the text to be converted may be determined in the above manner.
Taking the parametric method as an example, the terminal device may determine the phoneme duration of each phoneme according to the above-mentioned manner, input the phoneme duration and the extracted language features of the text to be converted into the acoustic model to obtain parameters such as a fundamental frequency for generating the speech data, and generate the speech data according to the parameters such as the fundamental frequency through the vocoder.
The process of generating voice data by using the splicing method or the end-to-end method is similar to the process of generating voice data by using the parameter method, and is not described herein again.
To sum up, in the speech synthesis method provided in the embodiment of the present application, the duration range of each phoneme corresponding to the text to be converted is determined, then any duration in the duration range of each phoneme is determined as the phoneme duration of the corresponding phoneme, and finally, speech data is generated according to the text to be converted and the phoneme duration of each phoneme. Aiming at a plurality of voice data of the same text to be converted, the phoneme time length of the same phoneme in the plurality of voice data may be different based on the same time length range, so that a plurality of different voice data can be obtained by synthesis, the same voice data obtained by synthesizing the same text to be converted every time is avoided, the mechanicalness of voice synthesis is reduced, and the naturalness and diversity of voice synthesis are improved.
Moreover, the duration range of each phoneme is determined, and the duration of the phoneme is selected in the duration range, so that the value of the phoneme duration does not generate huge deviation, the condition that the voice data is abnormal due to the fact that the value of the phoneme duration is too large or too small can be further avoided, and the stability of voice synthesis is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application, which corresponds to the speech synthesis method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.
Referring to fig. 7, the apparatus includes:
arange determining module 701, configured to determine a duration range of each phoneme corresponding to the text to be converted;
aduration determining module 702, configured to determine any duration in the duration range as the phoneme duration of each phoneme;
agenerating module 703, configured to generate speech data according to the text to be converted and the phoneme duration of each phoneme.
Optionally, therange determining module 701 is specifically configured to determine an average pronunciation duration, a pronunciation duration variance, and a pronunciation duration distribution density of each phoneme corresponding to the text to be converted; and determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
Optionally, therange determining module 701 is further specifically configured to input the text to be converted into a preset text analysis model, so as to obtain a pronunciation duration distribution density of each phoneme output by the text analysis model; and inputting the text to be converted into a preset time length model to obtain the average pronunciation time length and the pronunciation time length variance of each phoneme output by the time length model.
Optionally, therange determining module 701 is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme.
Optionally, theduration determining module 702 is specifically configured to, for each phoneme, obtain text semantic information of the phoneme according to a position of a character corresponding to the phoneme in the text to be converted; determining a phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
Optionally, theduration determining module 702 is specifically configured to obtain user data, where the user data includes age information and personality information of the user; a phoneme duration for each of the phonemes is determined based on the duration range of the phoneme and the user data.
Optionally, thegenerating module 703 is specifically configured to generate the speech data through a preset acoustic model and vocoder according to the text to be converted and the phoneme duration of each phoneme.
To sum up, the speech synthesis apparatus provided in the embodiment of the present application determines the duration range of each phoneme corresponding to the text to be converted, determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally generates the speech data according to the text to be converted and the phoneme duration of each phoneme. Aiming at a plurality of voice data of the same text to be converted, the phoneme time length of the same phoneme in the plurality of voice data may be different based on the same time length range, so that a plurality of different voice data can be obtained by synthesis, the same voice data obtained by synthesizing the same text to be converted every time is avoided, the mechanicalness of voice synthesis is reduced, and the naturalness and diversity of voice synthesis are improved.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

Translated fromChinese
1.一种语音合成方法,其特征在于,包括:1. a speech synthesis method, is characterized in that, comprises:确定待转换文本对应的每个音素的时长范围;Determine the duration range of each phoneme corresponding to the text to be converted;将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;Determining any duration in the duration range of each of the phonemes as the phoneme duration of the corresponding phoneme;根据所述待转换文本和每个所述音素的音素时长,生成语音数据。Voice data is generated according to the text to be converted and the phoneme duration of each of the phonemes.2.如权利要求1所述的语音合成方法,其特征在于,所述确定待转换文本对应的每个音素的时长范围,包括:2. The speech synthesis method according to claim 1, wherein the determining the duration range of each phoneme corresponding to the text to be converted comprises:确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度;Determine the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each described phoneme corresponding to the text to be converted;根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围。According to the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each said phoneme, the duration range of each said phoneme is determined.3.如权利要求2所述的语音合成方法,其特征在于,所述确定所述待转换文本对应的每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,包括:3. speech synthesis method as claimed in claim 2, is characterized in that, the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each described phoneme corresponding to the described text to be converted are determined, comprise:将所述待转换文本输入预先设置的文本分析模型,得到所述文本分析模型输出的每个所述音素的发音时长分布密度;Inputting the text to be converted into a preset text analysis model to obtain the pronunciation duration distribution density of each of the phonemes output by the text analysis model;将所述待转换文本输入预先设置的时长模型,得到所述时长模型输出的每个所述音素的平均发音时长和发音时长方差。Input the text to be converted into a preset duration model, and obtain the average pronunciation duration and pronunciation duration variance of each of the phonemes output by the duration model.4.如权利要求2所述的语音合成方法,其特征在于,所述根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,确定每个所述音素的时长范围,包括:4. speech synthesis method as claimed in claim 2 is characterized in that, described according to the average pronunciation duration of each described phoneme, pronunciation duration variance and pronunciation duration distribution density, determine the duration scope of each described phoneme, including :根据每个所述音素的平均发音时长、发音时长方差和发音时长分布密度,通过正态分布算法确定每个所述音素的时长范围。According to the average pronunciation duration, pronunciation duration variance, and pronunciation duration distribution density of each phoneme, the duration range of each phoneme is determined through a normal distribution algorithm.5.如权利要求1所述的语音合成方法,其特征在于,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:5. The speech synthesis method according to claim 1, wherein, determining any duration in the duration range of each of the phonemes as the phoneme duration of the corresponding phoneme, comprising:对于每个所述音素,根据所述音素对应的文字在所述待转换文本中的位置,获取所述音素的文本语义信息;For each of the phonemes, obtain textual semantic information of the phoneme according to the position of the text corresponding to the phoneme in the text to be converted;基于所述音素的时长范围和所述音素的文本语义信息,确定所述音素的音素时长。The phoneme duration of the phoneme is determined based on the duration range of the phoneme and the textual semantic information of the phoneme.6.如权利要求1所述的语音合成方法,其特征在于,所述将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长,包括:6. The speech synthesis method according to claim 1, characterized in that, determining any duration in the duration range of each of the phonemes as the phoneme duration of the corresponding phoneme, comprising:获取用户数据,所述用户数据包括用户的年龄信息和性格信息;Obtain user data, where the user data includes the user's age information and character information;基于所述音素的时长范围和所述用户数据,确定每个所述音素的音素时长。A phoneme duration of each of the phonemes is determined based on the duration range of the phoneme and the user data.7.如权利要求1至6任一所述的语音合成方法,其特征在于,所述根据所述待转换文本和每个所述音素的音素时长,生成语音数据,包括:7. The speech synthesis method according to any one of claims 1 to 6, wherein, generating speech data according to the text to be converted and the phoneme duration of each of the phonemes, comprising:根据所述待转换文本和每个所述音素的音素时长,通过预先设置的声学模型和声码器生成所述语音数据。According to the text to be converted and the phoneme duration of each phoneme, the voice data is generated through a preset acoustic model and a vocoder.8.一种语音合成装置,其特征在于,包括:8. A device for speech synthesis, comprising:范围确定模块,用于确定待转换文本对应的每个音素的时长范围;a range determination module, used to determine the duration range of each phoneme corresponding to the text to be converted;时长确定模块,用于将每个所述音素的时长范围中的任一时长,确定为相对应的音素的音素时长;A duration determination module, for determining any duration in the duration range of each described phoneme as the phoneme duration of the corresponding phoneme;生成模块,用于根据所述待转换文本和每个所述音素的音素时长,生成语音数据。A generating module, configured to generate speech data according to the text to be converted and the phoneme duration of each of the phonemes.9.一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。9. A terminal device, comprising a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor implements the computer program as claimed in the claims when executing the computer program The method of any one of 1 to 7.10.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。10 . A computer-readable storage medium storing a computer program, wherein the computer program implements the method according to any one of claims 1 to 7 when the computer program is executed by a processor. 11 .
CN202010456116.1A2020-05-262020-05-26 Speech synthesis method and devicePendingCN113793589A (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
CN202010456116.1ACN113793589A (en)2020-05-262020-05-26 Speech synthesis method and device
PCT/CN2021/080403WO2021238338A1 (en)2020-05-262021-03-12Speech synthesis method and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010456116.1ACN113793589A (en)2020-05-262020-05-26 Speech synthesis method and device

Publications (1)

Publication NumberPublication Date
CN113793589Atrue CN113793589A (en)2021-12-14

Family

ID=78745521

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010456116.1APendingCN113793589A (en)2020-05-262020-05-26 Speech synthesis method and device

Country Status (2)

CountryLink
CN (1)CN113793589A (en)
WO (1)WO2021238338A1 (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4470150A (en)*1982-03-181984-09-04Federal Screw WorksVoice synthesizer with automatic pitch and speech rate modulation
KR19980047261A (en)*1996-12-141998-09-15양승택 Duration Modeling Device of Speech Synthesizer and Modeling Method Using the Same
US6064960A (en)*1997-12-182000-05-16Apple Computer, Inc.Method and apparatus for improved duration modeling of phonemes
US20020107688A1 (en)*1998-03-102002-08-08Mitsuru OtsukaSpeech synthesizing method and apparatus
JP2006215110A (en)*2005-02-012006-08-17Nippon Hoso Kyokai <Nhk> Duration time length generating device and duration time length generating program
WO2008083689A1 (en)*2007-01-142008-07-17The Engineering Company For The Development Of Computer Systems ; (Rdi)System and method for qur'an recitation rules
JP2013015829A (en)*2011-06-072013-01-24Yamaha CorpVoice synthesizer
CN106601226A (en)*2016-11-182017-04-26中国科学院自动化研究所Phoneme duration prediction modeling method and phoneme duration prediction method
CN107705782A (en)*2017-09-292018-02-16百度在线网络技术(北京)有限公司Method and apparatus for determining phoneme pronunciation duration
CN108597492A (en)*2018-05-022018-09-28百度在线网络技术(北京)有限公司Phoneme synthesizing method and device
CN109192224A (en)*2018-09-142019-01-11科大讯飞股份有限公司A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
CN109599092A (en)*2018-12-212019-04-09秒针信息技术有限公司A kind of audio synthetic method and device
CN109841202A (en)*2019-01-042019-06-04平安科技(深圳)有限公司Rhythm generation method, device and terminal device based on speech synthesis
CN109979428A (en)*2019-04-022019-07-05北京地平线机器人技术研发有限公司Audio generation method and device, storage medium, electronic equipment
CN110459200A (en)*2019-07-052019-11-15深圳壹账通智能科技有限公司Phoneme synthesizing method, device, computer equipment and storage medium
US20200027440A1 (en)*2017-03-232020-01-23D&M Holdings, Inc.System Providing Expressive and Emotive Text-to-Speech
CN110992927A (en)*2019-12-112020-04-10广州酷狗计算机科技有限公司Audio generation method and device, computer readable storage medium and computing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
JPH0580791A (en)*1991-09-201993-04-02Hitachi LtdDevice and method for speech rule synthesis
CN107481715B (en)*2017-09-292020-12-08百度在线网络技术(北京)有限公司Method and apparatus for generating information

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4470150A (en)*1982-03-181984-09-04Federal Screw WorksVoice synthesizer with automatic pitch and speech rate modulation
KR19980047261A (en)*1996-12-141998-09-15양승택 Duration Modeling Device of Speech Synthesizer and Modeling Method Using the Same
US6064960A (en)*1997-12-182000-05-16Apple Computer, Inc.Method and apparatus for improved duration modeling of phonemes
US20020107688A1 (en)*1998-03-102002-08-08Mitsuru OtsukaSpeech synthesizing method and apparatus
JP2006215110A (en)*2005-02-012006-08-17Nippon Hoso Kyokai <Nhk> Duration time length generating device and duration time length generating program
WO2008083689A1 (en)*2007-01-142008-07-17The Engineering Company For The Development Of Computer Systems ; (Rdi)System and method for qur'an recitation rules
JP2013015829A (en)*2011-06-072013-01-24Yamaha CorpVoice synthesizer
CN106601226A (en)*2016-11-182017-04-26中国科学院自动化研究所Phoneme duration prediction modeling method and phoneme duration prediction method
US20200027440A1 (en)*2017-03-232020-01-23D&M Holdings, Inc.System Providing Expressive and Emotive Text-to-Speech
CN107705782A (en)*2017-09-292018-02-16百度在线网络技术(北京)有限公司Method and apparatus for determining phoneme pronunciation duration
CN108597492A (en)*2018-05-022018-09-28百度在线网络技术(北京)有限公司Phoneme synthesizing method and device
CN109192224A (en)*2018-09-142019-01-11科大讯飞股份有限公司A kind of speech evaluating method, device, equipment and readable storage medium storing program for executing
CN109599092A (en)*2018-12-212019-04-09秒针信息技术有限公司A kind of audio synthetic method and device
CN109841202A (en)*2019-01-042019-06-04平安科技(深圳)有限公司Rhythm generation method, device and terminal device based on speech synthesis
CN109979428A (en)*2019-04-022019-07-05北京地平线机器人技术研发有限公司Audio generation method and device, storage medium, electronic equipment
CN110459200A (en)*2019-07-052019-11-15深圳壹账通智能科技有限公司Phoneme synthesizing method, device, computer equipment and storage medium
CN110992927A (en)*2019-12-112020-04-10广州酷狗计算机科技有限公司Audio generation method and device, computer readable storage medium and computing device

Also Published As

Publication numberPublication date
WO2021238338A1 (en)2021-12-02

Similar Documents

PublicationPublication DateTitle
CN111669515B (en) A video generation method and related device
CN114242037B (en) Method and device for generating virtual character
CN113838490A (en)Video synthesis method and device, electronic equipment and storage medium
CN112651510B (en)Model updating method, working node and model updating system
CN111563466B (en) Face detection methods and related products
CN113467735A (en)Image adjusting method, electronic device and storage medium
CN111176465A (en) Use state identification method, device, storage medium and electronic device
CN112256868A (en)Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
CN112783330A (en)Electronic equipment operation method and device and electronic equipment
CN113542580A (en) Method, device and electronic device for removing speckle from glasses
CN114005016A (en)Image processing method, electronic equipment, image processing system and chip system
WO2022062884A1 (en)Text input method, electronic device, and computer-readable storage medium
CN111460942B (en) Proximity detection method and device, computer readable medium and terminal equipment
CN116843583B (en)Image processing method, device, electronic equipment and storage medium
CN113380240A (en)Voice interaction method and electronic equipment
CN114120987A (en)Voice awakening method, electronic equipment and chip system
CN113509145A (en) Sleep risk monitoring method, electronic device and storage medium
CN114093368B (en) Cross-device voiceprint registration method, electronic device and storage medium
CN113793590B (en)Speech synthesis method and device
CN113793589A (en) Speech synthesis method and device
CN115393676A (en) Gesture control optimization method, device, terminal and storage medium
CN114201609A (en)Fault data processing method and device and terminal equipment
CN113823379A (en) Recommended methods, apparatus and equipment for paddling frequency
CN113364067A (en)Charging precision calibration method and electronic equipment
CN114079694B (en) Control labeling method and device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20211214


[8]ページ先頭

©2009-2025 Movatter.jp