技术领域Technical Field
本申请涉及智能语音技术领域,特别是涉及一种语音训练数据的获取方法、装置、设备及介质。The present application relates to the field of intelligent voice technology, and in particular to a method, device, equipment and medium for acquiring voice training data.
背景技术Background Art
在人机交互中,语音交互是一种最自然的方式,十分便捷易懂。近些年人工智能技术的发展突飞猛进,目前已经可以让机器识别并理解语音的内部含义,并及时做出相应的反应,如播放语音、调用相关技能等。在这种交互过程中,语音合成、声音克隆的效果十分影响用户体验,需要提供多语言、混合语言的支持,对生成语音进行语速、情感、音色、音质的控制。In human-computer interaction, voice interaction is the most natural way, which is very convenient and easy to understand. In recent years, artificial intelligence technology has developed by leaps and bounds. At present, machines can recognize and understand the inner meaning of voice and make corresponding responses in time, such as playing voice and calling related skills. In this interactive process, the effects of speech synthesis and sound cloning have a great impact on user experience. It is necessary to provide support for multiple languages and mixed languages, and control the speed, emotion, timbre and sound quality of the generated voice.
目前的语音合成是基于VITS(Variational Inference with adversariallearning for end-to-end Text-to-Speech)模型框架。如Bert-VITS2。但其存在发音机械不自然,无法零样本克隆音色等问题。最近几年语言大模型的发展如火如荼,图像、语音等方向在不断借助语言大模型的框架提升其方向的算法性能,这里就包括语音合成与声音克隆,最新的语音生成方法是基于语音量化编码的语音生成大模型,是一种深度融合文本理解和语音生成的新技术,其对语音进行离散化编码,借助大模型技术生成自然流畅的语音,与传统方法相比其自然度宛如真人一般,但在训练数据规模上,其需要海量数据,通常是数万小时训练数据,其训练数据的量级已和语音识别规模相当,语音识别需要的是泛化能力,所以对数据中包含背景音、噪声、多说话人、音质等问题并不敏感,但语音生成需要对语速、情感、音色、音质控制,所以其训练语料要求较高。The current speech synthesis is based on the VITS (Variational Inference with adversariallearning for end-to-end Text-to-Speech) model framework. For example, Bert-VITS2. However, it has problems such as mechanical and unnatural pronunciation and inability to clone timbre with zero samples. In recent years, the development of language big models has been in full swing. The image, speech and other directions are constantly using the framework of language big models to improve the algorithm performance in their directions, including speech synthesis and sound cloning. The latest speech generation method is a speech generation big model based on speech quantization coding. It is a new technology that deeply integrates text understanding and speech generation. It discretizes speech and generates natural and fluent speech with the help of big model technology. Compared with traditional methods, its naturalness is like that of a real person. However, in terms of the scale of training data, it requires massive data, usually tens of thousands of hours of training data. The scale of its training data is comparable to that of speech recognition. Speech recognition requires generalization ability, so it is not sensitive to background sound, noise, multiple speakers, sound quality and other issues in the data. However, speech generation requires control of speech speed, emotion, timbre and sound quality, so its training corpus requirements are relatively high.
目前语料多从音频平台、视频平台和开源数据集等中获取,存在语料质量差的缺陷。Currently, most corpora are obtained from audio platforms, video platforms, and open source data sets, which results in poor corpus quality.
发明内容Summary of the invention
本申请的目的是提供一种语音训练数据的获取方法、装置、设备及介质,能够获得语料质量的语音训练数据。The purpose of this application is to provide a method, device, equipment and medium for obtaining speech training data, which can obtain speech training data of corpus quality.
为实现上述目的,本申请提供了如下方案:To achieve the above objectives, this application provides the following solutions:
第一方面,本申请提供了一种语音训练数据的获取方法,包括:In a first aspect, the present application provides a method for acquiring speech training data, comprising:
获取海量音频数据;Get massive audio data;
将所述音频数据中的多通道音频数据拆分为单通道音频数据;Splitting the multi-channel audio data in the audio data into single-channel audio data;
去除所有所述单通道音频数据中的背景音乐和背景噪声,得到去噪后单通道音频数据;Removing background music and background noise from all the single-channel audio data to obtain denoised single-channel audio data;
根据说话人日志,采用语音活动检测技术将所述去噪后单通道音频数据中的多人对话音频拆分为单一说话人音频片段,其中,所述说话人日志包括每个说话人的发言时间和发言顺序;According to the speaker log, voice activity detection technology is used to split the multi-person conversation audio in the denoised single-channel audio data into single-speaker audio segments, wherein the speaker log includes the speaking time and speaking order of each speaker;
筛选出符合预设时长的所述单一说话人音频片段;Filter out the single speaker audio segment that meets the preset duration;
对符合预设时长的所述单一说话人音频片段依次进行语种识别和语音识别,得到识别后音频片段文本;Performing language recognition and speech recognition on the single speaker audio segment that meets the preset time length in sequence to obtain a recognized audio segment text;
对所述识别后音频片段文本添加标点,得到添加标点文本;Adding punctuation to the recognized audio segment text to obtain punctuation-added text;
采用合成语音自动评估质量评测模型对所述添加标点文本对应的音频片段进行质量评估;Using a synthetic speech automatic assessment quality evaluation model to perform quality assessment on the audio segment corresponding to the punctuation-added text;
对所述质量评估小于质量阈值的音频片段进行音质增强处理,得到增强后音频片段,其中,所述音质增强处理包括替换音色、恢复音频失真和扩展音频带宽;Performing sound quality enhancement processing on the audio segment whose quality evaluation is less than the quality threshold to obtain an enhanced audio segment, wherein the sound quality enhancement processing includes replacing timbre, restoring audio distortion, and expanding audio bandwidth;
将所述增强后音频片段作为语音训练数据。The enhanced audio segment is used as speech training data.
第二方面,本申请提供了一种语音训练数据的获取装置,包括:In a second aspect, the present application provides a device for acquiring speech training data, comprising:
海量音频数据获取模块,用于获取海量音频数据;A massive audio data acquisition module, used to acquire massive audio data;
第一拆分模块,用于将所述音频数据中的多通道音频数据拆分为单通道音频数据;A first splitting module, used for splitting the multi-channel audio data in the audio data into single-channel audio data;
去燥模块,用于去除所有所述单通道音频数据中的背景音乐和背景噪声,得到去噪后单通道音频数据;A denoising module, used to remove background music and background noise from all the single-channel audio data to obtain denoised single-channel audio data;
第二拆分模块,用于根据说话人日志,采用语音活动检测技术将所述去噪后单通道音频数据中的多人对话音频拆分为单一说话人音频片段,其中,所述说话人日志包括每个说话人的发言时间和发言顺序;A second splitting module is used to split the multi-person conversation audio in the denoised single-channel audio data into single-speaker audio segments according to the speaker log and using voice activity detection technology, wherein the speaker log includes the speaking time and speaking order of each speaker;
筛选模块,用于筛选出符合预设时长的所述单一说话人音频片段;A screening module, used to screen out the audio segment of the single speaker that meets the preset time length;
识别模块,用于对符合预设时长的所述单一说话人音频片段依次进行语种识别和语音识别,得到识别后音频片段文本;A recognition module, used to perform language recognition and voice recognition on the single speaker audio segment that meets the preset time length in sequence to obtain a recognized audio segment text;
标点添加模块,用于对所述识别后音频片段文本添加标点,得到添加标点文本;A punctuation adding module, used for adding punctuation to the recognized audio segment text to obtain a punctuation added text;
质量评估模块,用于采用合成语音自动评估质量评测模型对所述添加标点文本对应的音频片段进行质量评估;A quality assessment module, used to perform quality assessment on the audio segment corresponding to the punctuation-added text using a synthetic speech automatic assessment quality evaluation model;
音质增强模块,用于对所述质量评估小于质量阈值的音频片段进行音质增强处理,得到增强后音频片段,其中,所述音质增强处理包括替换音色、恢复音频失真和扩展音频带宽;A sound quality enhancement module, configured to perform sound quality enhancement processing on the audio segment whose quality evaluation is less than the quality threshold to obtain an enhanced audio segment, wherein the sound quality enhancement processing includes replacing timbre, restoring audio distortion and expanding audio bandwidth;
语音训练数据获取模块,用于将所述增强后音频片段作为语音训练数据。The speech training data acquisition module is used to use the enhanced audio segment as speech training data.
第三方面,本申请提供了一种计算机设备,包括:存储器、处理器以存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序以实现上述第一方面所述的语音训练数据的获取方法。In a third aspect, the present application provides a computer device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method for acquiring speech training data described in the first aspect above.
第四方面,本申请提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述第一方面所述的语音训练数据的获取方法。In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for acquiring speech training data described in the first aspect above.
根据本申请提供的具体实施例,本申请具有了以下技术效果:According to the specific embodiments provided in this application, this application has the following technical effects:
本申请提供了一种语音训练数据的获取方法、装置、设备及介质,该方法包括:多通道音频拆分为单通道;去除背景音乐和背景噪声;将多人对话音频拆分为单一说话人片段;标点添加;对质量评分差的音频进行音质增强,能够获得语料质量的语音训练数据。The present application provides a method, apparatus, device and medium for acquiring speech training data, the method comprising: splitting multi-channel audio into single channels; removing background music and background noise; splitting multi-person conversation audio into single speaker segments; adding punctuation; and enhancing the sound quality of audio with poor quality scores, so as to obtain speech training data of corpus quality.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1为本申请实施例1提供的一种语音训练数据的获取方法的流程示意图;FIG1 is a flow chart of a method for acquiring speech training data provided in Example 1 of the present application;
图2为本申请实施例3提供的一种计算机设备的结构示意图。FIG2 is a schematic diagram of the structure of a computer device provided in Example 3 of the present application.
具体实施方式DETAILED DESCRIPTION
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.
使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, the present application is further described in detail below with reference to the accompanying drawings and specific implementation methods.
实施例1Example 1
由于从音频平台、视频平台、开源数据集等获取的海量音频数据,场景复杂,普遍包含背景音、噪音、多人对话和音质层次不齐。Due to the massive amount of audio data obtained from audio platforms, video platforms, open source data sets, etc., the scenes are complex and generally include background sounds, noise, multi-person conversations, and uneven sound quality.
对此,本实施例提供了一种语音训练数据的获取方法,包括:In this regard, this embodiment provides a method for acquiring speech training data, including:
S1:获取海量音频数据。S1: Obtain massive audio data.
S2:将所述音频数据中的多通道音频数据拆分为单通道音频数据。S2: Split the multi-channel audio data in the audio data into single-channel audio data.
S3:去除所有所述单通道音频数据中的背景音乐和背景噪声,得到去噪后单通道音频数据。S3: removing background music and background noise from all the single-channel audio data to obtain denoised single-channel audio data.
S31:利用噪声评估模型评估所述单通道音频数据是否需要降噪,得到评估结果。S31: Using a noise evaluation model to evaluate whether the single-channel audio data needs noise reduction, and obtaining an evaluation result.
S32:当所述评估结果为是时,利用语音降噪模型去除所述单通道音频数据中的背景噪声。S32: When the evaluation result is yes, the background noise in the single-channel audio data is removed using a speech noise reduction model.
S4:根据说话人日志,采用语音活动检测技术将所述去噪后单通道音频数据中的多人对话音频拆分为单一说话人音频片段,其中,所述说话人日志包括每个说话人的发言时间和发言顺序。S4: According to the speaker log, voice activity detection technology is used to split the multi-person conversation audio in the denoised single-channel audio data into single-speaker audio segments, wherein the speaker log includes the speaking time and speaking order of each speaker.
S5:筛选出符合预设时长的所述单一说话人音频片段。S5: Filter out the audio segment of the single speaker that meets the preset time length.
S6:对符合预设时长的所述单一说话人音频片段依次进行语种识别和语音识别,得到识别后音频片段文本。S6: Performing language recognition and voice recognition on the single speaker audio segment that meets the preset time length in sequence to obtain a recognized audio segment text.
S7:对所述识别后音频片段文本添加标点,得到添加标点文本。S7: adding punctuation marks to the recognized audio segment text to obtain punctuation-added text.
S8:采用合成语音自动评估质量评测模型对所述添加标点文本对应的音频片段进行质量评估。S8: Using a synthetic speech automatic assessment quality evaluation model to perform quality assessment on the audio segment corresponding to the punctuation-added text.
S9:对所述质量评估小于质量阈值的音频片段进行音质增强处理,得到增强后音频片段,其中,所述音质增强处理包括替换音色、恢复音频失真和扩展音频带宽。S9: Performing sound quality enhancement processing on the audio segment whose quality evaluation is less than the quality threshold to obtain an enhanced audio segment, wherein the sound quality enhancement processing includes replacing timbre, restoring audio distortion, and expanding audio bandwidth.
S10:将所述增强后音频片段作为语音训练数据。S10: Using the enhanced audio segment as speech training data.
下述结合图1对本实施例提供的语音训练数据的获取方法的构思过程进行具体阐释。The following specifically explains the conceptual process of the method for acquiring speech training data provided by this embodiment in conjunction with FIG. 1 .
语音训练数据的获取方法的处理流程如下:The processing flow of the method for obtaining speech training data is as follows:
1)多通道音频并变为单通道。1) Multi-channel audio is converted to single channel.
2)去除背景音乐。2) Remove background music.
3)去除背景噪声。理论上,一个理想的语音降噪模型应该能够识别并去除背景噪声,同时保留人声,但实际应用中,难免会对人声进行误伤,从而无法获得高质量人声。通过先利用噪声评估模型评估音频是否需要降噪,然后再利用语音降噪模型识别并去除背景噪声,可提升语料中高质量人声的比例。3) Remove background noise. In theory, an ideal speech noise reduction model should be able to identify and remove background noise while retaining human voices. However, in practical applications, human voices are inevitably accidentally damaged, making it impossible to obtain high-quality human voices. By first using a noise assessment model to assess whether the audio needs noise reduction, and then using a speech noise reduction model to identify and remove background noise, the proportion of high-quality human voices in the corpus can be increased.
4)根据说话人日志,采用VAD(VoiceActivity Detection,语音活动检测技术),将多人对话音频拆分为单一说话人片段。4) Based on the speaker log, VAD (Voice Activity Detection) is used to split the multi-person conversation audio into single speaker segments.
5)对音频片段时长合格的音频进行语种识别和语音识别,并获取ASR(AutomaticSpeech Recognition)识别结果中每个字或音节的时间戳,通过ASR识别结果筛选音频片段。5) Perform language recognition and speech recognition on the audio clips with qualified duration, obtain the timestamp of each word or syllable in the ASR (Automatic Speech Recognition) recognition result, and filter the audio clips according to the ASR recognition result.
但ASR识别本身存在误差,如何提高transcript(即转录文本)的准确性十分重要。However, ASR recognition itself has errors, so it is very important to improve the accuracy of the transcript.
a)已有转录文本的音频片段的识别质量评估:对已有transcript的音频片段,进行文本正则化处理,即将2019年转为二零一九年,本实施例可以使用基于语法规则的WFST(Weighted Finite-State Transducers)或者基于神经网络的模型等方法进行文本正则化;通过计算编辑距离统计词错率或字错率。由于目前主流中文识别模型直接使用汉字建模,因此存在识别其他同音字的情况,对此,本实施例使用拼音的编辑距离计算词错率和字错率。a) Recognition quality assessment of audio clips with existing transcribed texts: For audio clips with existing transcripts, text regularization is performed, that is, 2019 is converted to 2019. This embodiment can use methods such as WFST (Weighted Finite-State Transducers) based on grammatical rules or neural network-based models to perform text regularization; the word error rate or character error rate is calculated by calculating the edit distance. Since the current mainstream Chinese recognition model directly uses Chinese characters for modeling, there are situations where other homophones are recognized. For this, this embodiment uses the edit distance of pinyin to calculate the word error rate and character error rate.
b)没有转录文本的音频片段的识别质量评估:对无transcript的音频片段,本实施例通过ASR置信度估计判断识别结果的可靠性。其中置信度估计可以采用声学信息,如基于熵的字级ASR置信度估计方法,也可以采用语言模型对ASR识别结果进行打分,也两种方法联合使用。b) Recognition quality assessment of audio clips without transcripts: For audio clips without transcripts, this embodiment determines the reliability of the recognition results through ASR confidence estimation. The confidence estimation can use acoustic information, such as the word-level ASR confidence estimation method based on entropy, or use a language model to score the ASR recognition results, or use the two methods together.
c)对于识别质量大于质量阈值的音频片段,计算音频片段每个字发音平均时长,判断是否合理,并过滤不合理的音频片段。c) For audio segments whose recognition quality is greater than the quality threshold, the average pronunciation duration of each word in the audio segment is calculated to determine whether it is reasonable, and unreasonable audio segments are filtered out.
6)对ASR识别结果添加标点,通常语音生成时期望通过标点控制停顿,但对文本加标点的模型(即文本标点模型)通常不参考ASR的声学结果,所以标点和停顿无法一致,本实施例可通过ASR声学信息,即每个字的时间戳间隔判断是否补充标点和删除标点。6) Add punctuation to the ASR recognition results. Usually, when generating speech, it is expected that pauses are controlled by punctuation. However, the model for adding punctuation to text (i.e., the text punctuation model) usually does not refer to the acoustic results of ASR, so the punctuation and pauses cannot be consistent. This embodiment can determine whether to add or delete punctuation through ASR acoustic information, i.e., the timestamp interval of each word.
7)通过MOS(Mean Opinion Score)得分筛选。通常会采用众包形式,对于海量训练集来说成本偏高。本实施例通过合成语音自动质量评测模型估计MOS值,进一步保证音频质量。7) Screening by MOS (Mean Opinion Score). Crowdsourcing is usually used, which is costly for massive training sets. This embodiment estimates the MOS value through a synthetic speech automatic quality evaluation model to further ensure audio quality.
8)针对大量音质不合格的音频被丢弃的现象,本实施例对质量评分差的音频进行音质增强。包括替换高质量音色、恢复音频失真、扩展音频带宽来提高感知音频质量。8) In order to address the phenomenon that a large amount of audio with substandard sound quality is discarded, this embodiment enhances the sound quality of audio with poor quality scores, including replacing high-quality timbre, restoring audio distortion, and expanding audio bandwidth to improve perceived audio quality.
本实施例上述语音训练数据的获取方法总体来说就是对获取的海量语音数据,进行去噪、分离为单一说话人的高质量语料。本实施例提供的上述语音训练数据的获取方法能够实现:1.降低噪声对人声的损伤;2.判断ASR结果是否识别准确;3.使标点和音频的停顿一致;4.使用神经网络获得MOS得分;5.提高音质差的音频的音质,从而获得语料质量的语音训练数据。The method for acquiring the speech training data in this embodiment is generally to denoise and separate the acquired massive speech data into high-quality corpus of a single speaker. The method for acquiring the speech training data provided in this embodiment can achieve: 1. reduce the damage of noise to human voice; 2. determine whether the ASR result is accurate; 3. make punctuation and pauses of audio consistent; 4. use neural network to obtain MOS score; 5. improve the sound quality of audio with poor sound quality, so as to obtain speech training data of corpus quality.
实施例2Example 2
本实施例提供了一种语音训练数据的获取装置,包括:This embodiment provides a device for acquiring speech training data, including:
海量音频数据获取模块,用于获取海量音频数据;A massive audio data acquisition module, used to acquire massive audio data;
第一拆分模块,用于将所述音频数据中的多通道音频数据拆分为单通道音频数据;A first splitting module, used for splitting the multi-channel audio data in the audio data into single-channel audio data;
去燥模块,用于去除所有所述单通道音频数据中的背景音乐和背景噪声,得到去噪后单通道音频数据;A denoising module, used to remove background music and background noise from all the single-channel audio data to obtain denoised single-channel audio data;
第二拆分模块,用于根据说话人日志,采用语音活动检测技术将所述去噪后单通道音频数据中的多人对话音频拆分为单一说话人音频片段,其中,所述说话人日志包括每个说话人的发言时间和发言顺序;A second splitting module is used to split the multi-person conversation audio in the denoised single-channel audio data into single-speaker audio segments according to the speaker log and using voice activity detection technology, wherein the speaker log includes the speaking time and speaking order of each speaker;
筛选模块,用于筛选出符合预设时长的所述单一说话人音频片段;A screening module, used to screen out the audio segment of the single speaker that meets the preset time length;
识别模块,用于对符合预设时长的所述单一说话人音频片段依次进行语种识别和语音识别,得到识别后音频片段文本;A recognition module, used to perform language recognition and voice recognition on the single speaker audio segment that meets the preset time length in sequence to obtain a recognized audio segment text;
标点添加模块,用于对所述识别后音频片段文本添加标点,得到添加标点文本;A punctuation adding module, used for adding punctuation to the recognized audio segment text to obtain a punctuation added text;
质量评估模块,用于采用合成语音自动评估质量评测模型对所述添加标点文本对应的音频片段进行质量评估;A quality assessment module, used to perform quality assessment on the audio segment corresponding to the punctuation-added text using a synthetic speech automatic assessment quality evaluation model;
音质增强模块,用于对所述质量评估小于质量阈值的音频片段进行音质增强处理,得到增强后音频片段,其中,所述音质增强处理包括替换音色、恢复音频失真和扩展音频带宽;A sound quality enhancement module, configured to perform sound quality enhancement processing on the audio segment whose quality evaluation is less than the quality threshold to obtain an enhanced audio segment, wherein the sound quality enhancement processing includes replacing timbre, restoring audio distortion and expanding audio bandwidth;
语音训练数据获取模块,用于将所述增强后音频片段作为语音训练数据。The speech training data acquisition module is used to use the enhanced audio segment as speech training data.
实施例3Example 3
本实施例提供了一种计算机设备,该计算机设备可以是服务器或者终端,其内部结构图可以如图2所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output,简称I/O)和通信接口。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储语音训练数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现实施例1提供的语音训练数据的获取方法。The present embodiment provides a computer device, which can be a server or a terminal, and its internal structure diagram can be shown in Figure 2. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface. Among them, the processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program and a database. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store voice training data. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, the method for obtaining voice training data provided in Example 1 is implemented.
本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。在一个示例性的实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have a different arrangement of components. In an exemplary embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps in the above-mentioned method embodiments when executing the computer program.
实施例4Example 4
本实施例提供了一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时实现上述实施例1提供的语音训练数据的获取方法。This embodiment provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the method for acquiring speech training data provided in the above embodiment 1 is implemented.
实施例5Example 5
本实施例提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述实施例1提供的语音训练数据的获取方法。This embodiment provides a computer program product, including a computer program. When the computer program is executed by a processor, the method for acquiring speech training data provided in the above embodiment 1 is implemented.
需要说明的是,本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要符合相关规定。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant regulations.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive RandomAccess Memory,MRAM)、铁电存储器(Ferroelectric RandomAccess Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(RandomAccess Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static RandomAccess Memory,SRAM)或动态随机存取存储器(Dynamic RandomAccessMemory,DRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM may be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).
本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. The non-relational database may include a distributed database based on blockchain, etc., but is not limited thereto. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic device, a data processing logic device based on quantum computing, etc., but is not limited thereto.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be combined arbitrarily. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。This article uses specific examples to illustrate the principles and implementation methods of this application. The description of the above embodiments is only used to help understand the method and core ideas of this application. At the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting this application.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510147634.8ACN119993196B (en) | 2025-02-11 | 2025-02-11 | Voice training data acquisition method, device, equipment and medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510147634.8ACN119993196B (en) | 2025-02-11 | 2025-02-11 | Voice training data acquisition method, device, equipment and medium |
| Publication Number | Publication Date |
|---|---|
| CN119993196A CN119993196A (en) | 2025-05-13 |
| CN119993196Btrue CN119993196B (en) | 2025-07-04 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510147634.8AActiveCN119993196B (en) | 2025-02-11 | 2025-02-11 | Voice training data acquisition method, device, equipment and medium |
| Country | Link |
|---|---|
| CN (1) | CN119993196B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113782036A (en)* | 2021-09-10 | 2021-12-10 | 北京声智科技有限公司 | Audio quality evaluation method and device, electronic equipment and storage medium |
| CN113889096A (en)* | 2021-09-16 | 2022-01-04 | 北京捷通华声科技股份有限公司 | Method and device for analyzing sound library training data |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2006251721A (en)* | 2005-03-14 | 2006-09-21 | Advanced Telecommunication Research Institute International | Acoustic rating device, acoustic output device and program thereof |
| PL2346030T3 (en)* | 2008-07-11 | 2015-03-31 | Fraunhofer Ges Forschung | Audio encoder, method for encoding an audio signal and computer program |
| US10848443B2 (en)* | 2018-07-23 | 2020-11-24 | Avaya Inc. | Chatbot socialization |
| CN113901992B (en)* | 2021-09-17 | 2025-02-25 | 北京百舸飞驰科技有限公司 | Training data screening method, system, device and medium |
| US20230284074A1 (en)* | 2022-03-01 | 2023-09-07 | Nvidia Corporation | Application programing interface to indicate a number of wireless cells |
| CN116469367A (en)* | 2023-02-22 | 2023-07-21 | 深圳市木愚科技有限公司 | Sound synthesis training data acquisition method, device, server and storage medium |
| CN117894300A (en)* | 2023-12-29 | 2024-04-16 | 科大讯飞股份有限公司 | Sample audio data acquisition method, voice recognition method and related devices |
| CN117953898A (en)* | 2023-12-29 | 2024-04-30 | 浙江阿里巴巴机器人有限公司 | Speech recognition method, server and storage medium for video data |
| CN117690458B (en)* | 2024-01-15 | 2024-07-19 | 国能宁夏供热有限公司 | Intelligent voice quality inspection system based on telephone communication and quality inspection method thereof |
| CN118136002B (en)* | 2024-05-06 | 2024-09-24 | 证通股份有限公司 | Method and equipment for constructing voice recognition model and method and equipment for voice recognition |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113782036A (en)* | 2021-09-10 | 2021-12-10 | 北京声智科技有限公司 | Audio quality evaluation method and device, electronic equipment and storage medium |
| CN113889096A (en)* | 2021-09-16 | 2022-01-04 | 北京捷通华声科技股份有限公司 | Method and device for analyzing sound library training data |
| Publication number | Publication date |
|---|---|
| CN119993196A (en) | 2025-05-13 |
| Publication | Publication Date | Title |
|---|---|---|
| Szymański et al. | WER we are and WER we think we are | |
| US10210861B1 (en) | Conversational agent pipeline trained on synthetic data | |
| CN112420014B (en) | Virtual face construction method, device, computer equipment and computer readable medium | |
| CN104157285B (en) | Audio recognition method, device and electronic equipment | |
| WO2019210557A1 (en) | Voice quality inspection method and device, computer device and storage medium | |
| CN110998716A (en) | Domain adaptation in speech recognition via teacher-student learning | |
| CN109119067B (en) | Speech synthesis method and device | |
| JP2020154076A (en) | Reasoner, learning method and learning program | |
| CN114242033A (en) | Speech synthesis method, apparatus, equipment, storage medium and program product | |
| CN112908308B (en) | Audio processing method, device, equipment and medium | |
| CN118571229B (en) | Voice labeling method and device for voice feature description | |
| CN115050351A (en) | Method and device for generating timestamp and computer equipment | |
| Nagano et al. | Data augmentation based on vowel stretch for improving children's speech recognition | |
| CN114842858A (en) | Audio processing method and device, electronic equipment and storage medium | |
| CN113299270B (en) | Method, device, equipment and storage medium for generating voice synthesis system | |
| CN119993196B (en) | Voice training data acquisition method, device, equipment and medium | |
| Guo et al. | HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS | |
| CN115579022B (en) | Method, device, computer equipment and storage medium for detecting overlapping sound | |
| CN113763921A (en) | Method and apparatus for correcting text | |
| CN118136017A (en) | A target speaker speech recognition method based on lightweight cue fine-tuning | |
| CN114694629B (en) | Voice data amplification method and system for voice synthesis | |
| CN116978381A (en) | Audio data processing method, device, computer equipment and storage medium | |
| CN113920987B (en) | A method, device, equipment and storage medium for speech recognition | |
| US20230169961A1 (en) | Context-aware prosody correction of edited speech | |
| JP5366050B2 (en) | Acoustic model learning apparatus, speech recognition apparatus, and computer program for acoustic model learning |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |