Movatterモバイル変換


[0]ホーム

URL:


CN111462769B - End-to-end accent conversion method - Google Patents

End-to-end accent conversion method
Download PDF

Info

Publication number
CN111462769B
CN111462769BCN202010239586.2ACN202010239586ACN111462769BCN 111462769 BCN111462769 BCN 111462769BCN 202010239586 ACN202010239586 ACN 202010239586ACN 111462769 BCN111462769 BCN 111462769B
Authority
CN
China
Prior art keywords
accent
speaker
channel
speech
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010239586.2A
Other languages
Chinese (zh)
Other versions
CN111462769A (en
Inventor
刘颂湘
王迪松
曹悦雯
孙立发
吴锡欣
康世胤
吴志勇
刘循英
蒙美玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dadan Shusheng Technology Co ltd
Original Assignee
Shenzhen Dadan Shusheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dadan Shusheng Technology Co ltdfiledCriticalShenzhen Dadan Shusheng Technology Co ltd
Priority to CN202010239586.2ApriorityCriticalpatent/CN111462769B/en
Publication of CN111462769ApublicationCriticalpatent/CN111462769A/en
Application grantedgrantedCritical
Publication of CN111462769BpublicationCriticalpatent/CN111462769B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses an end-to-end accent conversion method, which is used for converting non-tunnel accents into tunnel accents, belongs to the technical field of voice processing, and can also be used for converting voices of patients with dysarthria into standard voices; the signal parameters of the non-channel accent and the speaker vector are input to a voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of a specific speaker; the beneficial effects are that: the non-channel accent can be converted into the channel accent without any guidance of the channel accent reference audio in the conversion process, and the original tone of the speaker is maintained.

Description

End-to-end accent conversion method
Technical Field
The invention relates to the technical field of voice processing, in particular to an end-to-end accent conversion method.
Background
The voice recognition technology is widely applied, the existing voice recognition library is basically based on the voice of the standard national language, the standard national language of a speaker is converted into characters, and the accuracy is high. However, most people in real life have nonstandard pronunciation, carry more or less local accents, and in order to communicate better with them, the non-channel accents need to be changed into channel accents. In addition, many patients with dysarthria (such as Chinese patients) cannot communicate with other people normally at present, and the daily communication of the patients is particularly important by converting the nonstandard pronunciation of the patients into the standard pronunciation. The traditional voice conversion method is to convert the source speaker identity of the channel accent into the speaker identity of the non-channel accent, i.e. only the tone color is changed, while the basic content and pronunciation remain unchanged. This prevents their use in real life due to the need to use the accent of the ground track during the transition phase. Therefore, this is a problem to be solved at present.
Disclosure of Invention
The invention aims to provide an end-to-end accent conversion method, which aims to solve the problem that the conventional voice recognition library needs to use the tunnel accent for converting non-tunnel accents into tunnel accents.
In order to solve the technical problems, the technical scheme of the invention is as follows: an end-to-end accent conversion method comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of an input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input to the voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker.
As a preferred embodiment of the present invention, the speaker encoder is an extensible and speaker-independent neural network framework that converts arbitrary-length input speech-generated acoustic frames into fixed-dimensional speaker-embedded vectors that are associated with only the speaker.
As a preferred embodiment of the present invention, the speech synthesis module uses a mean square error loss L for the speaker-embedded vectorTTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone.
As a preferred embodiment of the present invention, the voice recognition module is configured to adjust the acoustic characteristic of the non-channel accent to the signal parameter of the channel accent.
As a preferred embodiment of the present invention, the neural network vocoder is a WaveRNN network, an LPCNet or a WaveNet.
As a preferred embodiment of the present invention, the method comprises the steps of: a. collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The beneficial effects of adopting above-mentioned technical scheme are: the end-to-end accent conversion method provided by the invention can convert non-channel accents into channel accents without any guidance of channel accent reference audio frequency in the conversion process, and can keep the original tone of a speaker.
Drawings
FIG. 1 is a schematic diagram of a training phase of the present invention;
FIG. 2 is a schematic diagram of a transition phase according to the present invention;
FIG. 3 is a graph showing the mean opinion score results during the experimental stage of the present invention;
FIG. 4 is a schematic diagram of the test results of the present invention at the experimental stage.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The embodiment provides an end-to-end accent conversion method, which comprises an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, the voice recognition module is used for adjusting the acoustic characteristics of the input non-channel accent into signal parameters of the channel accent, and the signal parameters are only related to the speaking content of the non-channel accent; the signal parameters of the non-channel accent and the speaker vector are input into a voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of the specific speaker. The voice recognition module is used for adjusting the acoustic characteristics of the non-channel accent into the signal parameters of the channel accent.
The accent conversion method mainly comprises the following steps: collecting voice information and corresponding text information of a plurality of speakers, and training a speaker encoder and a voice synthesis module; b. screening out speakers with non-tunnel accents as target speakers; c. the text sample and the speech sample of the target speaker are used for training of the speech recognition module.
The speaker encoder is a neural network framework that is scalable and can verify the speaker, generates variable-length acoustic features from input speech of arbitrary length, and converts the variable-length acoustic features into fixed-dimension speaker-embedded vectors. The speaker encoder model used herein is an extensible and very accurate speaker verification neural network framework that generates a fixed-dimensional speaker-embedded vector from a series of acoustic frames computed from speech sentences of arbitrary length. The method utilizes the speaker embedded vector to adjust a TTS model of a reference voice signal of a target speaker, so that the generated voice has the speaker identity of the target speaker. Generalized end-to-end speaker verification loss-loss is optimized by training the speaker encoder such that speech embedding from the same speaker has a high co-sinusoidal similarity, while speech embedding from different speakers is spatially far apart. It is desirable for the speaker encoder to learn a representation related to speech synthesis that captures the characteristics of non-channel accents that are not visible during training.
The speech synthesis module uses the mean square error loss L for speaker embedded vectorTTS The speech synthesis model is trained to generate only the channel accents embedded by the speaker in the corresponding tone. The model is an attention-based encoder-decoder model to support multiple speakers. As shown in fig. 1, the speaker-embedded vector calculated by the speaker encoder of the desired target speaker is connected to the TTS decoder output at each time step, and the attentive TTS decoder uses these as additional inputs to generate the mel-frequency spectrogram. Loss L using Mean Square Error (MSE)TTS The TTS model is trained to generate only the channel accents whose identity/timbre is determined by the speaker's embedding. We map the text transcript into a phoneme sequence as input to the TTS model, as it has been shown that: the use of phoneme sequences can increase convergence speed and improve pronunciation effects of rare words and proper nouns.
The multi-task ASR model is also connected with a full-connection conversion layer for calculating a connector time classification loss LCTC . We learn the accent independent language representation from the acoustic features using the accent ASR model. The ASR model applies an end-to-end based attention encoder-decoder framework. Given a pair of audio and its phoneme transcriptions, by TThe TS decoder calculates speech features from the audio and linguistic representations of the phoneme sequence, respectively. We add a Full Connection (FC) conversion layer to the ASR encoder and calculate a Connector Time Class (CTC) penalty LCTC To stabilize the training process. Since the training data of the ASR model contains accents, we connect the accent embedding and acoustic features of each frame as input to the ASR model and add an accent classifier on top of the ASR encoder, making it more robust to the recognition of accents. We assume that different accents are associated with different speakers. The speaker's accent embedding is obtained herein by averaging all the embeddings of the speaker. The output of the accent classifier is used to calculate the cross entropy loss LACC . The phoneme labels and linguistic representations in the two streams are pre-determined using an attention-based ASR decoder. Cross entropy loss LCE MSE loss L for phoneme label predictionTTSE For measuring the language difference between the TTS decoder output Hl and the ASR decoder output Hl.
The neural network vocoder is a neural network vocoder such as a WaveRNN network, LPCNet or WaveNet. The WaveRNN network is preferably employed herein as a neural network vocoder, with training being accomplished using open source Pytorch. Since the mel-frequency spectrogram captures all relevant details required for high quality speech synthesis, we only train the WaveRNN using the mel-frequency spectrograms from multiple speakers, without adding any speaker embeddings.
In the training phase, as shown in fig. 3, we first train the speaker encoder model. Then, at loss LTTS In the case of (2) training a multi-speaker TTS model using only native english speech data. Thereafter, a speech conversion (AC) technique ASR model is first pre-trained using speech data from a plurality of native language users and a non-native language user. The ASR model then uses only speech data from non-native spoken target speakers for fine tuning. In both phases we train a WaveRNN using the lossy lidar in equation 2 and using only the voice data of the channel accent user. In the accent conversion phase, as shown in fig. 4, acoustic features are first calculated from non-channel accents. Then, the speakerThe encoder receives the acoustic features and outputs a speaker-embedded vector representing the identity of the non-authentic accent speaker. Accent embedding is the average embedding of speakers who are non-channel accents. We connect the accent embedding with the acoustic features of each frame, then input it into the ASR model to generate the linguistic representation H-l, and then the attention-based TTS decoder combines the linguistic representation with the speaker embedding to generate the channel accent acoustic features. Finally, we use the WaveRNN model to transform acoustic features into time domain waveforms, hopefully to more naturally present accents.
The accent conversion method comprises the following steps: a. collecting voice information of multiple speakers; b. screening out the speaker of the voice of the seal land as a target speaker; c. a portion of the target speaker's speech samples are used for multi-speaker TTS model and neural network vocoder training, and another portion of the target speaker's speech samples are used for training of the multi-tasking ASR model.
During the experimental stage, we collected speech samples that contained 109 speakers with a clear speech up to 44 hours. In this study, we selected the speaker p248 of the open-ended voice as the target speaker. We consider a person without the hindi accent as a person in native english and then resample the audio to 22.05kHz. For training of the TTS model and the WaveRNN model we only use speech data of 105 native speakers. The mel-frequency spectrogram has 80 channels, calculated using a 50ms window width and a 12.5ms frame offset. 1000 samples were randomly drawn as a validation set, with the remaining samples used for training. In the training of the accent ASR model, the voice data of the channel accent and p248 are used. 40 mel-frequency spectra calculated with a window width of 25 milliseconds and a displacement of 10 milliseconds and delta characteristics thereof are taken as acoustic characteristics. The number of utterances used for training, verification, and testing in p248 are 326, 25, and 25, respectively. The SI-ASR model for extracting ppg was trained using the TIMIT dataset.
As shown in fig. 1, is the training phase of the method. hs is a linguistic representation, hl is a linguistic representation of the TTS decoder, and Hl is a linguistic representation of the ASR decoder.
Fig. 2 investigates the transition phase of the method. hs and H1 respectively represent speakingA person and a language characterization. L (L)TTSE The expression is as follows:
where N is the number of training samples. ASR models are training with multitasking losses: l (L)ASR =λ1 LCE2 LTTSE3 LCTC4 LACC Where λ is the hyper-parameter, weighting the four losses.
LTTS The speaker encoder is a 3-layer LSTM with 256 hidden nodes followed by a 256-unit projection layer. The output is the second level normalized hidden state of the last layer, which is a vector of 256 elements. The TTS model employs the same architecture as in [ in ]. The ASR encoder is a 5-layer bi-directional LSTM (BLSTM) with 320 units per direction. 300-dimensional location-aware attention is used in the attention layer. The ASR decoder is a single layer LSTM with 320 units. The accent classifier is a 2-layer 1-dimensional convolutional network with 128 channels and 3 kernel sizes, followed by an averaging pooling layer and a final FC output layer. In equation 2, λ1=0.5, λ2=0.1, λ3=0.5, and λ4=0.1 heuristically brings the four penalty terms to a similar numerical scale. Transform 1 and transform 2 in the baseline approach are 2-layer and 4-layer BLSTMs, respectively, of 128 units per direction.
The speaker encoder model was trained using the Adam optimizer in 1000k steps with a batch size of 640 and a learning rate of 0.0001. The Adam optimizer was used to train the TTS model in 100k steps with a batch size of 16 and a learning rate of 0.001. The accent ASR model first pre-trains 160k steps using an Adadelta optimizer that uses a batch size of 16 and the learning rate of the speech data obtained from the native speaker and p 248. Then, while keeping the batch size and learning rate unchanged, fine tuning was performed for only the voice data from p248 by another 5k steps.
As shown in fig. 3, (a) is the mean opinion score result for the 95% confidence interval. (b) For speaker phase preference test results, AB-BL, P-AB represent comparison of ablation method v.s baseline method, proposed method v.s baseline method, proposed method v.s ablation method, respectively.
As shown in fig. 4, the accent preference test results are shown. "AB-BL", "P-AB", "P-L2" and "P-L1" respectively represent a comparison of "ablation method v.s baseline method", "proposed method v.s baseline method", "proposed method v.s ablation method", "proposed method v.s non-channel accent recording" and "proposed method v.s channel accent recording".
Three perceptual hearing tests were used to evaluate the conversion performance of Baseline (BL), proposed (P) and Ablation (AB) systems: mean Opinion Score (MOS) test of audio naturalness, speaker similarity XAB test, and accent AB test. We randomly extract 20 utterances from the test utterances of the p248 speaker for evaluation.
Audio naturalness. In the MOS test, the audio naturalness is evaluated in terms of five-component (1-component difference, 2-component difference, 3-component difference, 4-component difference, 5-component difference). The three system-generated audio and the reference recordings of non-channel accents ("L2-Ref") are randomly ordered before being presented to the listener. Each set of audio corresponds to the same text content. The listener can listen to the audio repeatedly as desired. This method resulted in 4.0 MOSs (L2-Refereeives-MOS 4.4), which were statistically significantly higher than the baseline method. The maximum likelihood of this approach is slightly higher than that of the ablative system, which illustrates that the addition of an accent embedding and classifier helps to extract accent-independent language content from the accent that is beneficial for speech synthesis. Since we use the ASR and TTS models based on the seq2seq, the converted speech has a pronunciation pattern more like the accent of the channel, such as in terms of duration and speech speed, which is very different from the speech of the source accent. We suggest that the reader listen to the audio samples.
Speaker similarity. We compare the difference in speaker similarity between the converted speech and the non-channel accent. In the XAB test, X represents a non-meatus accent reference sample. We present pairs of speech samples (a and B) that have the same text content as the reference, and ask the listener to determine which speech sample has a tone color closer to the reference. The listener may also replay the audio samples and may select "No Preference (NP)" if the difference cannot be distinguished. To avoid the impact of potential content, the audio is played back in reverse. We can see that the baseline system has better similarity performance than the proposed system. The results are reasonable since the speech synthesis model in this approach never finds speech data from non-channel accents. It is desirable that the synthesis model be able to infer the timbre of a speaker from the speaker-embedded generated by a speaker encoder that has only one speech (i.e., a phonetic clone). Speech data access from more speakers facilitates training of more versatile speaker encoders. By training the speaker encoder with speech data from 18K speakers, very good speech cloning performance can be achieved, but we cannot access such a large corpus of data. We found that there was no statistical difference between the two ablation systems (p-value 0.36). In the accent AB test, we first let the participant listen to the reference audio of the channel accent and the non-channel accent. Pairs of speech samples (a and B) with the same text content are then presented and the listener is asked to select a sample that more closely resembles accent speech. The results are shown in FIG. 4. According to the preference test between "P-BL" and "P-AB", the listener is very confident that the proposed method is able to produce more accents of the earth (P < 0.001) than the baseline and ablation methods, even though the ablation training is able to obtain better accent performance (P < 0.001) than the baseline treatment. The DTW process in the baseline approach may introduce alignment errors and the mapping from L2-PPGs to L1-PPGs using neural networks may be ineffective. From the results of "P-L2" and "P-L1", we can conclude that: the method can remove non-channel accents from the speech in the second language, so that the converted speech is similar to the channel accents in accent. The p values were 2.3 Xe-8 and 0.06, respectively.
An end-to-end accent conversion method is presented herein, which is the first model that can convert non-channel accents into channel accents without any guidance of channel accent reference audio during the conversion process. It consists of four independently trained neural networks: a speaker encoder, a multi-speaker TTS model, a multi-tasking ASR model, and a neural network vocoder. The experimental results show that: the method can convert the English speech of the non-channel accent into the English speech of the non-channel accent, and the English speech of the non-channel accent is difficult to distinguish from the channel accent on the accent. It is desirable for the synthesis model to be able to generate the desired timbre of the target speaker from the speaker embedding obtained from the speaker encoder. Taking the english accent conversion as an example, the same framework may be used for accent conversion in any other language.
In addition, the end-to-end accent conversion method provided herein can also be used for pronunciation correction of patients with dysarthria, and causes of dysarthria include brain injury, stroke, parkinson's disease or amyotrophic lateral sclerosis, etc. Accent modification is particularly important for the patient's daily communications. The implementation method is consistent with the steps, only the non-channel accent data is needed to be replaced by the non-standard accent data of the patient, and other training methods and conversion methods are the same as the conversion method of the non-channel accent.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims (6)

1. An end-to-end accent conversion method is characterized by comprising an accent conversion system for realizing the accent conversion method, wherein the accent conversion system comprises a voice recognition module, a speaker encoder, a voice synthesis module and a neural network vocoder, and the voice recognition module is used for adjusting the acoustic characteristics of input non-channel accents into signal parameters of channel accents, and the signal parameters are only related to the speaking content of the non-channel accents; the signal parameters of the non-channel accent and the speaker vector are input to the voice synthesis module, and the voice processed by the voice synthesis module finally passes through a neural network vocoder to synthesize the channel accent of a specific speaker;
CN202010239586.2A2020-03-302020-03-30End-to-end accent conversion methodActiveCN111462769B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010239586.2ACN111462769B (en)2020-03-302020-03-30End-to-end accent conversion method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010239586.2ACN111462769B (en)2020-03-302020-03-30End-to-end accent conversion method

Publications (2)

Publication NumberPublication Date
CN111462769A CN111462769A (en)2020-07-28
CN111462769Btrue CN111462769B (en)2023-10-27

Family

ID=71681783

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010239586.2AActiveCN111462769B (en)2020-03-302020-03-30End-to-end accent conversion method

Country Status (1)

CountryLink
CN (1)CN111462769B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US11335324B2 (en)*2020-08-312022-05-17Google LlcSynthesized data augmentation using voice conversion and speech recognition models
CN112233646B (en)*2020-10-202024-05-31携程计算机技术(上海)有限公司Voice cloning method, system, equipment and storage medium based on neural network
CN112786052B (en)*2020-12-302024-05-31科大讯飞股份有限公司Speech recognition method, electronic equipment and storage device
CN113223542B (en)*2021-04-262024-04-12北京搜狗科技发展有限公司Audio conversion method and device, storage medium and electronic equipment
US11948550B2 (en)*2021-05-062024-04-02Sanas.ai Inc.Real-time accent conversion model
CN113593534B (en)*2021-05-282023-07-14思必驰科技股份有限公司 Method and device for multi-accent speech recognition
CN113345431B (en)*2021-05-312024-06-07平安科技(深圳)有限公司Cross-language voice conversion method, device, equipment and medium
CN113327575B (en)*2021-05-312024-03-01广州虎牙科技有限公司Speech synthesis method, device, computer equipment and storage medium
CN113470622B (en)*2021-09-062021-11-19成都启英泰伦科技有限公司Conversion method and device capable of converting any voice into multiple voices
CN114464160B (en)*2022-03-072025-03-14云知声智能科技股份有限公司 A speech synthesis method and system for abnormal speakers
CN116994553A (en)*2022-09-152023-11-03腾讯科技(深圳)有限公司Training method of speech synthesis model, speech synthesis method, device and equipment
CN119741916A (en)*2025-03-042025-04-01深圳市活力天汇科技股份有限公司 A method, device, medium and electronic device for generating standard speech

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101359473A (en)*2007-07-302009-02-04国际商业机器公司Auto speech conversion method and apparatus
CN101399044A (en)*2007-09-292009-04-01国际商业机器公司Voice conversion method and system
CN102982809A (en)*2012-12-112013-03-20中国科学技术大学Conversion method for sound of speaker
CN108108357A (en)*2018-01-122018-06-01京东方科技集团股份有限公司Accent conversion method and device, electronic equipment
CN110335584A (en)*2018-03-292019-10-15福特全球技术公司 Neural Network Generative Modeling to Transform Speech Articulation and Augment Training Data
CN110600047A (en)*2019-09-172019-12-20南京邮电大学Perceptual STARGAN-based many-to-many speaker conversion method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101359473A (en)*2007-07-302009-02-04国际商业机器公司Auto speech conversion method and apparatus
CN101399044A (en)*2007-09-292009-04-01国际商业机器公司Voice conversion method and system
CN102982809A (en)*2012-12-112013-03-20中国科学技术大学Conversion method for sound of speaker
CN108108357A (en)*2018-01-122018-06-01京东方科技集团股份有限公司Accent conversion method and device, electronic equipment
CN110335584A (en)*2018-03-292019-10-15福特全球技术公司 Neural Network Generative Modeling to Transform Speech Articulation and Augment Training Data
CN110600047A (en)*2019-09-172019-12-20南京邮电大学Perceptual STARGAN-based many-to-many speaker conversion method

Also Published As

Publication numberPublication date
CN111462769A (en)2020-07-28

Similar Documents

PublicationPublication DateTitle
CN111462769B (en)End-to-end accent conversion method
Liu et al.End-to-end accent conversion without using native utterances
Geng et al.Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition
Kelly et al.Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
Sarikaya et al.High resolution speech feature parametrization for monophone-based stressed speech recognition
Womack et al.N-channel hidden Markov models for combined stressed speech classification and recognition
Doshi et al.Extending parrotron: An end-to-end, speech conversion and speech recognition model for atypical speech
Zhao et al.Using phonetic posteriorgram based frame pairing for segmental accent conversion
CN115359775B (en) An end-to-end Chinese speech cloning method with timbre and emotion transfer
CN112992118B (en)Speech model training and synthesizing method with few linguistic data
Quamer et al.Zero-shot foreign accent conversion without a native reference
CN110570842B (en)Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN114023343A (en)Voice conversion method based on semi-supervised feature learning
Zheng et al.CASIA voice conversion system for the voice conversion challenge 2020
Das et al.Understanding the effect of voice quality and accent on talker similarity
Nazir et al.Deep learning end to end speech synthesis: A review
CN116312466B (en)Speaker adaptation method, voice translation method and system based on small amount of samples
Andra et al.Improved transcription and speaker identification system for concurrent speech in Bahasa Indonesia using recurrent neural network
Leung et al.Applying articulatory features to telephone-based speaker verification
Shinde et al.Vowel Classification based on LPC and ANN
Zhe et al.Incorporating Speaker’s Speech Rate Features for Improved Voice Cloning
CN119049448B (en) A Chinese syllable speech synthesis method and system based on improved Tacotron2 model
CN120496503B (en) Non-parallel any-to-any speech conversion method based on attention feature fusion
KhanAudio-visual speaker separation
ZhangWhisper speech processing: Analysis, modeling, and detection with applications to keyword spotting

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
TA01Transfer of patent application right
TA01Transfer of patent application right

Effective date of registration:20220728

Address after:518000 Honglang North 408, Zhongli Chuangye community, No. 49, Dabao Road, Dalang community, Xin'an street, Bao'an District, Shenzhen, Guangdong Province

Applicant after:Shenzhen Dadan Shusheng Technology Co.,Ltd.

Address before:518101 2710, building 2, huichuangxin Park, No. 2 Liuxian Avenue, Xingdong community, Xin'an street, Bao'an District, Shenzhen, Guangdong Province

Applicant before:SPEECHX LTD.

GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp