Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The speech synthesis method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific type of the terminal device at all.
For example, the terminal device may be a Station (ST) in a WLAN, which may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA) device, a handheld device with Wireless communication capability, a computing device or other processing device connected to a Wireless modem, a vehicle-mounted device, a vehicle-mounted networking terminal, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite Wireless device, a Wireless modem card, a television set-top box (STB), a Customer Premises Equipment (CPE), and/or other devices for communicating over a Wireless system and a next generation communication system, such as a Mobile terminal in a 5G Network or a Public Land Mobile Network (future evolved, PLMN) mobile terminals in the network, etc.
By way of example and not limitation, when the terminal device is a wearable device, the wearable device may also be a generic term for intelligently designing daily wearing by applying wearable technology, developing wearable devices, such as glasses, gloves, watches, clothing, shoes, and the like. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud interaction. The generalized wearable intelligent device has the advantages that the generalized wearable intelligent device is complete in function and large in size, can realize complete or partial functions without depending on a smart phone, such as a smart watch or smart glasses, and only is concentrated on a certain application function, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets for monitoring physical signs, smart jewelry and the like.
Fig. 1 is a scene schematic diagram of a speech synthesis scene related to a speech synthesis method provided in an embodiment of the present application, and referring to fig. 1, the speech synthesis scene may include aterminal device 110, and theterminal device 110 may obtain a text to be converted and adjust a phoneme duration of each phoneme corresponding to the text to be converted, so that different speech data may be generated based on the same text to be converted.
In a possible implementation manner,terminal device 110 may obtain a text to be converted, and input the text to be converted into a pre-trained text analysis model and a pre-trained duration model respectively to obtain an average pronunciation duration, a pronunciation duration variance, and a pronunciation duration distribution density of each phoneme corresponding to the text to be converted, so that a duration range of each phoneme may be determined according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme based on a normal distribution rule.
Then,terminal device 110 may determine any duration in the duration range as the phoneme duration of each phoneme based on the duration range of each phoneme in combination with the text semantic information corresponding to each phoneme in the text to be converted and/or the pre-stored user data including the user age information and the personality information, so as to generate the speech data according to each phoneme duration and the text to be converted.
The average pronunciation duration is used for representing the average value of each phoneme duration of the same phoneme, the pronunciation duration variance is used for representing the difference degree of each phoneme duration of the same phoneme and the average pronunciation duration, and the pronunciation duration distribution density is used for representing the probability that the same phoneme corresponds to different phoneme durations.
In addition, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. For example, taking the pronunciation rule of pinyin as an example, the initial corresponding to the pinyin of each character may be taken as one element, and the final of the pinyin may be taken as another element, for example, in "weather", the elements corresponding to the characters "day" may include "t" and "ian", and the elements corresponding to the characters "qi" may include "q" and "i".
It should be noted that, in practical application, referring to fig. 2, a speech synthesis scenario may further include aserver 120, and theterminal device 110 may be connected to theserver 120, so that theserver 120 may convert a text to be converted to obtain different speech data based on the same text to be converted.
In the process of generating the voice data, theterminal device 110 may first send a text to be converted to theserver 120, theserver 120 may determine a duration range of each phoneme according to the text to be converted, and then, in combination with semantic information of the text to be converted and pre-stored user data, may determine a phoneme duration of each phoneme from each duration range, so that the voice data may be generated according to each phoneme duration and the text to be converted, and send the generated voice data to theterminal device 110, so that theterminal device 110 may receive and play the voice data generated by theserver 120.
For simplicity, the following embodiments are only described by taking an example that theterminal device 110 is included in the speech synthesis scenario and theserver 120 is not included, but in practical applications, theterminal device 110 may convert the speech data into the speech data, and theserver 120 may also convert the speech data into the speech data, which is not limited in this application.
Fig. 3 is a block diagram of a terminal device according to an embodiment of the present disclosure. Referring to fig. 3, the terminal device may include aprocessor 310, anexternal memory interface 320, an internal memory 321, a Universal Serial Bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, an antenna 1, anantenna 2, amobile communication module 350, awireless communication module 360, anaudio module 370, aspeaker 370A, areceiver 370B, amicrophone 370C, anearphone interface 370D, a sensor module 380, a button 390, a motor 391, an indicator 392, a camera 393, a display 394, and a Subscriber Identification Module (SIM)card interface 395, and the like. The sensor module 380 may include a pressure sensor 380A, a gyroscope sensor 380B, an air pressure sensor 380C, a magnetic sensor 380D, an acceleration sensor 380E, a distance sensor 380F, a proximity light sensor 380G, a fingerprint sensor 380H, a temperature sensor 380J, a touch sensor 380K, an ambient light sensor 380L, abone conduction sensor 380M, and the like.
It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the terminal device. In other embodiments of the present application, a terminal device may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 310 may include one or more processing units, such as: theprocessor 310 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.
The controller can be a neural center and a command center of the terminal equipment. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in theprocessor 310 for storing instructions and data. In some embodiments, the memory in theprocessor 310 is a cache memory. The memory may hold instructions or data that have just been used or recycled by theprocessor 310. If theprocessor 310 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of theprocessor 310, thereby increasing the efficiency of the system.
In some embodiments,processor 310 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.
The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, theprocessor 310 may include multiple sets of I2C buses. Theprocessor 310 may be coupled to the touch sensor 380K, the charger, the flash, the camera 393, etc., via different I2C bus interfaces. For example: theprocessor 310 may be coupled to the touch sensor 380K through an I2C interface, so that theprocessor 310 and the touch sensor 380K communicate through an I2C bus interface, thereby implementing the touch function of the terminal device.
The I2S interface may be used for audio communication. In some embodiments, theprocessor 310 may include multiple sets of I2S buses. Theprocessor 310 may be coupled to theaudio module 370 via an I2S bus to enable communication between theprocessor 310 and theaudio module 370. In some embodiments, theaudio module 370 may communicate audio signals to thewireless communication module 360 via an I2S interface, enabling answering of calls via a bluetooth headset.
The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, theaudio module 370 and thewireless communication module 360 may be coupled by a PCM bus interface. In some embodiments, theaudio module 370 may also transmit audio signals to thewireless communication module 360 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.
The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect theprocessor 310 with thewireless communication module 360. For example: theprocessor 310 communicates with the bluetooth module in thewireless communication module 360 through the UART interface to implement the bluetooth function. In some embodiments, theaudio module 370 may transmit the audio signal to thewireless communication module 360 through a UART interface, so as to realize the function of playing music through a bluetooth headset.
The MIPI interface may be used to connectprocessor 310 with peripheral devices such as display 394, camera 393, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments,processor 310 and camera 393 communicate over a CSI interface to implement the capture functionality of the terminal device. Theprocessor 310 and the display screen 394 communicate through a DSI interface to realize the display function of the terminal device.
The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect theprocessor 310 with the camera 393, the display 394, thewireless communication module 360, theaudio module 370, the sensor module 380, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.
The USB interface 330 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 330 may be used to connect a charger to charge the terminal device, or may be used to transmit data between the terminal device and the peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.
It should be understood that the interface connection relationship between the modules in the embodiment of the present invention is only an exemplary illustration, and does not form a structural limitation on the terminal device. In other embodiments of the present application, the terminal device may also adopt different interface connection manners or a combination of multiple interface connection manners in the foregoing embodiments.
The charging management module 340 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 340 may receive charging input from a wired charger via the USB interface 330. In some wireless charging embodiments, the charging management module 340 may receive a wireless charging input through a wireless charging coil of the terminal device. The charging management module 340 may also supply power to the electronic device through the power management module 341 while charging the battery 342.
The power management module 341 is configured to connect the battery 342, the charging management module 340 and theprocessor 310. The power management module 341 receives input from the battery 342 and/or the charge management module 340 and provides power to theprocessor 310, the internal memory 321, the external memory, the display 394, the camera 393, and thewireless communication module 360. The power management module 341 may also be configured to monitor parameters such as battery capacity, battery cycle count, and battery state of health (leakage, impedance). In other embodiments, the power management module 341 may also be disposed in theprocessor 310. In other embodiments, the power management module 341 and the charging management module 340 may be disposed in the same device.
The wireless communication function of the terminal device can be realized by the antenna 1, theantenna 2, themobile communication module 350, thewireless communication module 360, the modem processor, the baseband processor, and the like.
Theantennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in a terminal device may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
Themobile communication module 350 may provide a solution including 2G/3G/4G/5G wireless communication applied on the terminal device. Themobile communication module 350 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. Themobile communication module 350 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the filtered electromagnetic wave to the modem processor for demodulation. Themobile communication module 350 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of themobile communication module 350 may be disposed in theprocessor 310. In some embodiments, at least some of the functional modules of themobile communication module 350 may be disposed in the same device as at least some of the modules of theprocessor 310.
The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to thespeaker 370A, thereceiver 370B, etc.) or displays images or video through the display 394. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be separate from theprocessor 310, and may be disposed in the same device as themobile communication module 350 or other functional modules.
Thewireless communication module 360 may provide solutions for wireless communication applied to the terminal device, including Wireless Local Area Networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. Thewireless communication module 360 may be one or more devices integrating at least one communication processing module. Thewireless communication module 360 receives electromagnetic waves via theantenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to theprocessor 310. Thewireless communication module 360 may also receive a signal to be transmitted from theprocessor 310, frequency-modulate and amplify the signal, and convert the signal into electromagnetic waves via theantenna 2 to radiate the electromagnetic waves.
In some embodiments, the terminal device's antenna 1 is coupled to themobile communication module 350 andantenna 2 is coupled to thewireless communication module 360 so that the terminal device can communicate with the network and other devices via wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).
The terminal device implements the display function through the GPU, the display screen 394, and the application processor, etc. The GPU is an image processing microprocessor coupled to a display 394 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Theprocessor 310 may include one or more GPUs that execute program instructions to generate or alter display information.
The display screen 394 is used to display images, video, and the like. The display screen 394 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the terminal device may include 1 or N display screens 394, N being a positive integer greater than 1.
The terminal device may implement the shooting function through the ISP, the camera 393, the video codec, the GPU, the display 394, the application processor, and the like.
The ISP is used to process the data fed back by the camera 393. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in camera 393.
Camera 393 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the terminal device may include 1 or N cameras 393, N being a positive integer greater than 1.
The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device selects the frequency point, the digital signal processor is used for performing fourier transform and the like on the frequency point energy.
Video codecs are used to compress or decompress digital video. The terminal device may support one or more video codecs. In this way, the terminal device can play or record videos in a plurality of coding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.
The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can realize the intelligent cognition and other applications of the terminal equipment, such as: image recognition, face recognition, speech recognition, text understanding, and the like.
Theexternal memory interface 320 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the terminal device. The external memory card communicates with theprocessor 310 through theexternal memory interface 320 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.
The internal memory 321 may be used to store computer-executable program code, which includes instructions. Theprocessor 310 executes various functional applications of the terminal device and data processing by executing instructions stored in the internal memory 321. The internal memory 321 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, a phonebook, etc.) created during use of the terminal device, and the like. In addition, the internal memory 321 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.
The terminal device may implement an audio function through theaudio module 370, thespeaker 370A, thereceiver 370B, themicrophone 370C, theearphone interface 370D, and the application processor. Such as music playing, recording, etc.
Theaudio module 370 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. Theaudio module 370 may also be used to encode and decode audio signals. In some embodiments, theaudio module 370 may be disposed in theprocessor 310, or some functional modules of theaudio module 370 may be disposed in theprocessor 310.
Thespeaker 370A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The terminal device can listen to music through thespeaker 370A or listen to a handsfree call.
Thereceiver 370B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the terminal device answers a call or voice information, it is possible to answer a voice by placing thereceiver 370B close to the human ear.
Microphone 370C, also known as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal into themicrophone 370C by speaking the user's mouth near themicrophone 370C. The terminal device may be provided with at least onemicrophone 370C. In other embodiments, the terminal device may be provided with twomicrophones 370C, which may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device may further include three, four, ormore microphones 370C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
Theheadphone interface 370D is used to connect wired headphones. Theheadset interface 370D may be the USB interface 330, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
The pressure sensor 380A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 380A may be disposed on the display screen 394. The pressure sensor 380A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, or the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 380A, the capacitance between the electrodes changes. The terminal device determines the intensity of the pressure from the change in capacitance. When a touch operation is applied to the display screen 394, the terminal device detects the intensity of the touch operation according to the pressure sensor 380A. The terminal device can also calculate the touched position from the detection signal of the pressure sensor 380A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.
The gyro sensor 380B may be used to determine the motion attitude of the terminal device. In some embodiments, the angular velocity of the terminal device about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 380B. The gyro sensor 380B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyroscope sensor 380B detects the shake angle of the terminal device, calculates the distance to be compensated for by the lens module according to the shake angle, and enables the lens to counteract the shake of the terminal device through reverse movement, thereby achieving anti-shake. The gyro sensor 380B may also be used for navigation, somatosensory gaming scenes.
The air pressure sensor 380C is used to measure air pressure. In some embodiments, the terminal device calculates altitude, aiding positioning and navigation, from the barometric pressure values measured by barometric pressure sensor 380C.
The magnetic sensor 380D includes a hall sensor. The terminal device may detect the opening and closing of the flip holster using the magnetic sensor 380D. In some embodiments, when the terminal device is a flip, the terminal device may detect the opening and closing of the flip according to the magnetic sensor 380D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.
The acceleration sensor 380E can detect the magnitude of the terminal device acceleration in various directions (typically three axes). When the terminal equipment is static, the size and the direction of gravity can be detected. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.
A distance sensor 380F for measuring distance. The terminal device may measure the distance by infrared or laser. In some embodiments, the scene is photographed and the terminal device may range using the distance sensor 380F to achieve fast focus.
The proximity light sensor 380G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The terminal device emits infrared light to the outside through the light emitting diode. The terminal device detects infrared reflected light from a nearby object using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device. When insufficient reflected light is detected, the terminal device may determine that there is no object near the terminal device. The terminal device can detect that the user holds the terminal device by the proximity light sensor 380G and calls near the ear, so that the screen is automatically turned off to achieve the purpose of saving power. The proximity light sensor 380G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.
The ambient light sensor 380L is used to sense the ambient light level. The terminal device may adaptively adjust the brightness of the display 394 based on the perceived ambient light level. The ambient light sensor 380L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 380L may also cooperate with the proximity light sensor 380G to detect whether the terminal device is in a pocket to prevent accidental touches.
The fingerprint sensor 380H is used to capture a fingerprint. The terminal equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering and the like.
The temperature sensor 380J is used to detect temperature. In some embodiments, the terminal device executes a temperature processing strategy using the temperature detected by the temperature sensor 380J. For example, when the temperature reported by the temperature sensor 380J exceeds the threshold, the terminal device performs a reduction in the performance of the processor located near the temperature sensor 380J, so as to reduce power consumption and implement thermal protection. In other embodiments, the terminal device heats the battery 342 when the temperature is below another threshold to avoid abnormal shutdown of the terminal device due to low temperature. In other embodiments, the terminal device performs boosting of the output voltage of the battery 342 when the temperature is below a further threshold value to avoid abnormal shutdown due to low temperature.
The touch sensor 380K is also referred to as a "touch panel". The touch sensor 380K may be disposed on the display screen 394, and the touch sensor 380K and the display screen 394 form a touch screen, which is also referred to as a "touch screen". The touch sensor 380K is used to detect a touch operation applied thereto or thereabout. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided via the display 394. In other embodiments, the touch sensor 380K may be disposed on the surface of the terminal device at a different location than the display 394.
Thebone conduction sensor 380M can acquire a vibration signal. In some embodiments, thebone conduction transducer 380M can acquire a vibration signal of the vibrating bone mass of the human voice. Thebone conduction sensor 380M may also contact the human body pulse to receive the blood pressure pulsation signal. In some embodiments, thebone conduction sensor 380M may also be disposed in a headset, integrated into a bone conduction headset. Theaudio module 370 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by thebone conduction sensor 380M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by thebone conduction sensor 380M, so as to realize the heart rate detection function.
Keys 390 include a power-on key, a volume key, etc. The keys 390 may be mechanical keys. Or may be touch keys. The terminal device may receive a key input, and generate a key signal input related to user setting and function control of the terminal device.
The motor 391 may generate a vibration cue. The motor 391 may be used for both incoming call vibration prompting and touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 391 may also respond to different vibration feedback effects by performing touch operations on different areas of the display 394. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.
Indicator 392 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.
TheSIM card interface 395 is for connecting a SIM card. The SIM card can be attached to and detached from the terminal device by being inserted into or pulled out of theSIM card interface 395. The terminal equipment can support 1 or N SIM card interfaces, and N is a positive integer greater than 1. TheSIM card interface 395 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. Multiple cards can be inserted into the sameSIM card interface 395 at the same time. The types of the plurality of cards may be the same or different. TheSIM card interface 395 may also be compatible with different types of SIM cards. TheSIM card interface 395 may also be compatible with an external memory card. The terminal equipment interacts with the network through the SIM card to realize functions of conversation, data communication and the like. In some embodiments, the end-point device employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the terminal device and cannot be separated from the terminal device.
Fig. 4 is a schematic flow chart of a speech synthesis method provided in an embodiment of the present application, which may be applied to the terminal device described above by way of example and not limitation, and referring to fig. 4, the method includes:
step 401, determining a duration range of each phoneme corresponding to the text to be converted.
In the process of generating the voice data, the terminal device can determine the phoneme duration of each phoneme corresponding to the text to be converted, and generates the voice data through a preset acoustic model and a vocoder in combination with the language features of the text to be converted. Moreover, in the process of determining the phoneme duration, the duration range of each phoneme may be determined first, so that in the subsequent step, different phoneme durations may be selected based on the duration ranges to improve the naturalness and diversity of the generated speech data.
In a specific implementation, the terminal device may extract each phoneme in the text to be converted, then obtain pronunciation information of each phoneme based on a pre-trained model, and then determine a duration range of each phoneme according to the pronunciation information of each phoneme, referring to fig. 5, thisstep 401 may include:step 401a and step 401 b.
401a, determining the average pronunciation duration, pronunciation duration variance and pronunciation duration distribution density of each phoneme corresponding to the text to be converted.
The terminal device may input the text to be converted into a pre-trained model, extract each phoneme of the text to be converted through the model, and analyze the text to be converted to determine pronunciation information of each phoneme, such as pronunciation duration, pronunciation duration variance and pronunciation duration distribution density, so that in subsequent steps, a pronunciation range of each phoneme may be determined according to the pronunciation information.
In the process of extracting the phonemes through the model, the terminal device may split the text to be converted to obtain a plurality of characters arranged in sequence, extract at least one phoneme of each character based on the pronunciation rule of each character, and obtain a plurality of phonemes of the text to be converted after the extraction of the phonemes of each character is completed.
For example, based on the pronunciation rules of pinyin, the terminal device may use the initial consonant and the final sound corresponding to each character as the phoneme of the character, and if the text to be converted is "today is sunny", the phonemes of the text to be converted may be "j", "in", "t", "ian", "t", "ian", "q", "i", "h", "ao", "q", "ing", "l", and "ang", respectively.
In addition, since different types of information can be included in the pronunciation information of each phoneme, the terminal device can input the text to be converted into different models to obtain different pronunciation information. For example, the terminal device may input the text to be converted into a preset text analysis model, and obtain a pronunciation duration distribution density of each phoneme output by the text analysis model, where the text analysis model may be a Deep Neural Network (DNN) model. And/or the terminal equipment can input the text to be converted into a preset duration model to obtain the average pronunciation duration and the pronunciation duration variance of each phoneme output by the duration model.
401b, determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
After obtaining the pronunciation information of each phoneme, the terminal device may substitute different parameters included in the pronunciation information into a calculation formula according to a preset calculation formula, so as to obtain a duration range of each phoneme. For example, assuming that the phoneme duration of each phoneme follows a normal distribution, the duration range of each phoneme may be determined by calculating according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme by a normal distribution algorithm and using a formula of a normal distribution.
For example, referring to FIG. 6, the average pronunciation duration of the xth phoneme is determined as t (x), and the pronunciation duration variance is std2(x) The pronunciation time length distribution density is p (x), which is in accordance with the formula p (x) N (t (x), std2(x) X1 and x2 are obtained by solving for x, and if x1 is smaller than x2, the interval [ x1, x2 ] can be divided]As the duration range of the xth phoneme.
Step 402, determining any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme.
After obtaining the duration range of each phoneme, the terminal device may determine the phoneme duration of each phoneme based on the text semantic information of each phoneme in the text to be converted, the user personality age, or randomly, so that the terminal device may generate different speech data based on the same text to be converted.
Optionally, for each phoneme, the terminal device may obtain text semantic information of the phoneme according to a position of the character corresponding to the phoneme in the text to be converted, and then determine the phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
In a possible implementation manner, for each phoneme, the terminal device may first determine a word corresponding to the phoneme, then determine a sentence in which the word is located in the text to be converted, analyze and determine the semantics of the sentence, and determine the text semantic information of the phoneme in combination with the semantics expressed by all sentences in the text to be converted. Then, the terminal device may select the phoneme duration of the phoneme from a plurality of durations corresponding to the duration range according to the text semantic information.
For example, the text semantic information of a phoneme indicates a happy mood, and the duration of a short note may be selected from the duration range as the phoneme duration of the phoneme.
Alternatively, the terminal device may acquire user data, which may include age information and character information of the user, and the terminal device may determine the phoneme time length of each phoneme from the time length range of each phoneme according to the age and character of the user, thereby generating speech information matching the user.
In a possible implementation manner, the terminal device may obtain pre-stored user data, or request the user data from the server, determine a voice type matched with the user according to the user data, and then select the phoneme duration of the phoneme from a plurality of durations corresponding to the duration range according to the voice type of the user.
For example, if the user data indicates that the user is a middle aged person and the character is stable, the speech type matching the user may be a slow-striped speeches type, and accordingly a long duration may be selected as the phoneme duration of the phoneme.
It should be noted that, in practical applications, the user data may further include other information indicating the type of the user speaking, for example, the user data may include search data indicating the emotion of the user, and may further include shopping data indicating whether the user has recently purchased a commodity, and the like, which is not limited in this embodiment of the present application.
In addition, if the terminal device can obtain the text semantic information and the user data at the same time, the phoneme duration of each phoneme can be further determined according to weights respectively corresponding to the text semantic information and the user data. However, if the terminal device cannot acquire the text semantic information and cannot acquire the user data, the phoneme duration of each phoneme may be determined according to a rule of normal distribution according to a condition that each phoneme in the text to be converted obeys the normal distribution, and a manner of determining the phoneme duration is not limited in the embodiment of the present application.
Step 403, generating voice data according to the text to be converted and the phoneme duration of each phoneme.
In the process of synthesizing the speech data, the terminal device may generate the speech data based on the phoneme duration of each phoneme in different manners. For example, the terminal device may generate the speech data in a parametric method, a concatenation method, or an end-to-end method, and regardless of which method is used to generate the speech data, the phoneme duration of each phoneme corresponding to the text to be converted may be determined in the above manner.
Taking the parametric method as an example, the terminal device may determine the phoneme duration of each phoneme according to the above-mentioned manner, input the phoneme duration and the extracted language features of the text to be converted into the acoustic model to obtain parameters such as a fundamental frequency for generating the speech data, and generate the speech data according to the parameters such as the fundamental frequency through the vocoder.
The process of generating voice data by using the splicing method or the end-to-end method is similar to the process of generating voice data by using the parameter method, and is not described herein again.
To sum up, in the speech synthesis method provided in the embodiment of the present application, the duration range of each phoneme corresponding to the text to be converted is determined, then any duration in the duration range of each phoneme is determined as the phoneme duration of the corresponding phoneme, and finally, speech data is generated according to the text to be converted and the phoneme duration of each phoneme. Aiming at a plurality of voice data of the same text to be converted, the phoneme time length of the same phoneme in the plurality of voice data may be different based on the same time length range, so that a plurality of different voice data can be obtained by synthesis, the same voice data obtained by synthesizing the same text to be converted every time is avoided, the mechanicalness of voice synthesis is reduced, and the naturalness and diversity of voice synthesis are improved.
Moreover, the duration range of each phoneme is determined, and the duration of the phoneme is selected in the duration range, so that the value of the phoneme duration does not generate huge deviation, the condition that the voice data is abnormal due to the fact that the value of the phoneme duration is too large or too small can be further avoided, and the stability of voice synthesis is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 7 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application, which corresponds to the speech synthesis method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.
Referring to fig. 7, the apparatus includes:
arange determining module 701, configured to determine a duration range of each phoneme corresponding to the text to be converted;
aduration determining module 702, configured to determine any duration in the duration range as the phoneme duration of each phoneme;
agenerating module 703, configured to generate speech data according to the text to be converted and the phoneme duration of each phoneme.
Optionally, therange determining module 701 is specifically configured to determine an average pronunciation duration, a pronunciation duration variance, and a pronunciation duration distribution density of each phoneme corresponding to the text to be converted; and determining the duration range of each phoneme according to the average pronunciation duration, the pronunciation duration variance and the pronunciation duration distribution density of each phoneme.
Optionally, therange determining module 701 is further specifically configured to input the text to be converted into a preset text analysis model, so as to obtain a pronunciation duration distribution density of each phoneme output by the text analysis model; and inputting the text to be converted into a preset time length model to obtain the average pronunciation time length and the pronunciation time length variance of each phoneme output by the time length model.
Optionally, therange determining module 701 is further specifically configured to determine the duration range of each phoneme through a normal distribution algorithm according to the average pronunciation duration, the pronunciation duration variance, and the pronunciation duration distribution density of each phoneme.
Optionally, theduration determining module 702 is specifically configured to, for each phoneme, obtain text semantic information of the phoneme according to a position of a character corresponding to the phoneme in the text to be converted; determining a phoneme duration of the phoneme based on the duration range of the phoneme and the text semantic information of the phoneme.
Optionally, theduration determining module 702 is specifically configured to obtain user data, where the user data includes age information and personality information of the user; a phoneme duration for each of the phonemes is determined based on the duration range of the phoneme and the user data.
Optionally, thegenerating module 703 is specifically configured to generate the speech data through a preset acoustic model and vocoder according to the text to be converted and the phoneme duration of each phoneme.
To sum up, the speech synthesis apparatus provided in the embodiment of the present application determines the duration range of each phoneme corresponding to the text to be converted, determines any duration in the duration range of each phoneme as the phoneme duration of the corresponding phoneme, and finally generates the speech data according to the text to be converted and the phoneme duration of each phoneme. Aiming at a plurality of voice data of the same text to be converted, the phoneme time length of the same phoneme in the plurality of voice data may be different based on the same time length range, so that a plurality of different voice data can be obtained by synthesis, the same voice data obtained by synthesizing the same text to be converted every time is avoided, the mechanicalness of voice synthesis is reduced, and the naturalness and diversity of voice synthesis are improved.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.