CN112562736B

Movatterモバイル変換

Info

Publication number: CN112562736B
Application number: CN202011459130.3A
Authority: CN
Inventors: 李荪; 张蔚敏; 刘硕
Original assignee: China Academy of Information and Communications Technology CAICT
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2024-06-21
Anticipated expiration: 2040-12-11
Also published as: CN112562736A

Abstract

The application provides a voice data set quality evaluation method and device. The method comprises the following steps: acquiring a voice data set to be evaluated and an application scene corresponding to the voice data set; acquiring an evaluation value of the voice data set based on a quality evaluation model corresponding to the application scene; evaluating the speech quality of the speech data set based on the evaluation value; the quality evaluation model is used for calculating the coverage of language elements, random information quantity and signal effectiveness of an input voice data set, the feature matching degree of the characteristics of the voice data set and the application scene requirements, and the content similarity between the voice data set and the application scene preset data set; and carrying out weighted summation on the coverage, the random information quantity, the signal effectiveness, the feature matching degree and the content similarity according to the configured weight to obtain an evaluation value of the voice data set. The method can comprehensively, objectively and quantitatively evaluate the quality of the voice data set.

Description

Translated fromChinese

一种语音数据集质量评估方法和装置A method and device for evaluating the quality of speech data sets

技术领域Technical Field

本发明涉及语音处理技术领域，特别涉及一种语音数据集质量评估方法和装置。The present invention relates to the field of speech processing technology, and in particular to a speech data set quality assessment method and device.

背景技术Background technique

目前，在人工智能领域，针对某些任务，通用模型已经达到较高准确性，对于技术工程应用转化来说，小体量、高质量的数据是算法模型工程化应用的关键指标，因此如何多维度、全面评价一个语音数据集质量对于机器学习来说至关重要。At present, in the field of artificial intelligence, general models have achieved a high level of accuracy for certain tasks. For the transformation of technology and engineering applications, small-volume, high-quality data is a key indicator for the engineering application of algorithm models. Therefore, how to comprehensively evaluate the quality of a speech data set in multiple dimensions is crucial for machine learning.

智能语音技术产品应用场景广泛而复杂，涉及到各类实际环境和行业领域，因此在评价语音数据质量时，不仅需要考虑语音数据本身语言现象和信息质量，还需要结合语音识别、声纹识别等技术在真实场景应用中的特殊需求。The application scenarios of intelligent voice technology products are extensive and complex, involving various practical environments and industry fields. Therefore, when evaluating the quality of voice data, it is necessary not only to consider the language phenomena and information quality of the voice data itself, but also to combine the special needs of technologies such as voice recognition and voiceprint recognition in real-life application scenarios.

如何全面、客观地评估语音数据集质量的优劣是目前急需解决的技术问题。How to comprehensively and objectively evaluate the quality of speech datasets is a technical problem that urgently needs to be solved.

发明内容Summary of the invention

有鉴于此，本申请提供一种语音数据集质量评估方法和装置，能够全面、客观、量化地对语音数据集的质量进行评估。In view of this, the present application provides a method and device for evaluating the quality of a speech data set, which can comprehensively, objectively and quantitatively evaluate the quality of a speech data set.

为解决上述技术问题，本申请的技术方案是这样实现的：To solve the above technical problems, the technical solution of this application is implemented as follows:

在一个实施例中，提供了一种语音数据集质量评估方法，所述方法包括：In one embodiment, a method for evaluating the quality of a speech data set is provided, the method comprising:

获取待评估的语音数据集，以及所述语音数据集对应的应用场景；Obtaining a speech data set to be evaluated and an application scenario corresponding to the speech data set;

基于所述应用场景对应的质量评估模型获取所述语音数据集的评估值；Acquire an evaluation value of the speech data set based on a quality evaluation model corresponding to the application scenario;

根据所述评估值评估所述语音数据集的语音质量；Evaluating the speech quality of the speech data set according to the evaluation value;

其中，所述质量评估模型用于计算输入的语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间内容的相似度；并按照配置的权重对所述语言要素覆盖度、随机信息量、信号有效度、特征匹配度和内容相似度进行加权求和获得所述语音数据集的评估值。Among them, the quality assessment model is used to calculate the language element coverage, random information volume, signal effectiveness of the input speech data set, as well as the feature matching degree between the characteristics of the speech data set and the characteristics required by the application scenario and the content similarity between the speech data set and the data set preset for the application scenario; and the language element coverage, random information volume, signal effectiveness, feature matching degree and content similarity are weightedly summed according to the configured weights to obtain the evaluation value of the speech data set.

在另一个实施例中，提供了一种语音数据集质量评估装置，所述装置包括：存储单元、第一获取单元、第二获取单元和评估单元；In another embodiment, a speech data set quality assessment device is provided, the device comprising: a storage unit, a first acquisition unit, a second acquisition unit, and an assessment unit;

所述存储单元，用于应用场景对应的存储质量评估模型；其中，所述质量评估模型用于计算输入的语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间的内容相似度；并按照配置的权重对所述语言要素覆盖度、随机信息量、信号有效度、特征匹配度和内容相似度进行加权求和获得所述语音数据集的评估值；The storage unit is used for storing a quality assessment model corresponding to an application scenario; wherein the quality assessment model is used to calculate the language element coverage, random information volume, signal validity of an input speech data set, as well as the feature matching degree between the features of the speech data set and the features required by the application scenario and the content similarity between the speech data set and a data set preset for the application scenario; and the language element coverage, random information volume, signal validity, feature matching degree and content similarity are weighted and summed according to the configured weights to obtain an evaluation value of the speech data set;

所述第一获取单元，用于获取待评估的语音数据集，以及所述语音数据集对应的应用场景；The first acquisition unit is used to acquire a speech data set to be evaluated and an application scenario corresponding to the speech data set;

所述第二获取单元，用于基于所述第一获取单元获取的应用场景对应的所述存储单元中的质量评估模型获取所述语音数据集的评估值；The second acquisition unit is used to acquire an evaluation value of the speech data set based on the quality evaluation model in the storage unit corresponding to the application scenario acquired by the first acquisition unit;

所述评估单元，用于根据所述第二获取单元获取的评估值评估所述语音数据集的语音质量。The evaluation unit is used to evaluate the speech quality of the speech data set according to the evaluation value obtained by the second obtaining unit.

在另一个实施例中，提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现所述语音质量评估方法的步骤。In another embodiment, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the steps of the speech quality assessment method are implemented when the processor executes the program.

在另一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现所述语音质量评估方法的步骤。In another embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the steps of the speech quality assessment method are implemented.

由上面的技术方案可见，上述实施例中基于存储的质量评估模型获取对待评估的语音数据集的评估值，根据评估值评估所述语音数据集的语音质量，在质量评估模型建立的时候通过覆盖度、随机信息量、匹配度、相似度和信号有效度等多维指标值确定语音数据集的评估值。该方案能够全面、客观、量化地对语音数据集的质量进行评估。As can be seen from the above technical solution, in the above embodiment, the evaluation value of the speech data set to be evaluated is obtained based on the stored quality evaluation model, the speech quality of the speech data set is evaluated according to the evaluation value, and the evaluation value of the speech data set is determined by multi-dimensional index values such as coverage, random information volume, matching degree, similarity and signal effectiveness when the quality evaluation model is established. This solution can comprehensively, objectively and quantitatively evaluate the quality of the speech data set.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本申请实施例中语音数据集质量评估流程示意图；FIG1 is a schematic diagram of a speech data set quality assessment process in an embodiment of the present application;

图2为本申请实施例中应用于上述技术的装置结构示意图；FIG2 is a schematic diagram of the structure of a device applied to the above technology in an embodiment of the present application;

图3为本发明实施例提供的电子设备的实体结构示意图。FIG. 3 is a schematic diagram of the physical structure of an electronic device provided in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含。例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein, for example. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products, or devices.

下面以具体实施例对本发明的技术方案进行详细说明。下面几个具体实施例可以相互结合，对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solution of the present invention is described in detail with specific embodiments below. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

本申请实施例中提供一种语音数据集质量评估方法，应用于语音数据集质量评估装置上。In an embodiment of the present application, a method for evaluating the quality of a speech data set is provided, which is applied to a device for evaluating the quality of a speech data set.

本申请实施例中具体实现时，可以建立语音数据集的质量评估模型，将语音数据集输入该模型，直接输出评估值；When the embodiment of the present application is specifically implemented, a quality assessment model of a speech data set can be established, the speech data set is input into the model, and the assessment value is directly output;

也可以不建立模型，获取语音数据集时，直接通过计算语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间的内容相似度；并按照配置的权重对所述语言要素覆盖度、随机信息量、信号有效度、特征匹配度和内容相似度进行加权求和获得所述语音数据集的评估值。It is also possible to obtain a speech data set without building a model by directly calculating the language element coverage, random information volume, signal effectiveness of the speech data set, the feature matching degree between the features of the speech data set and the features required by the application scenario, and the content similarity between the speech data set and a data set preset for the application scenario; and performing weighted summation of the language element coverage, random information volume, signal effectiveness, feature matching degree and content similarity according to the configured weights to obtain an evaluation value of the speech data set.

本申请实施例中无论使用哪种方式实现，都需要确定语音数据集对应的应用场景，在确定评估值时，需要根据应用场景所要求的特征，以及应用场景对应的标准数据集来进行计算。Regardless of which method is used to implement the embodiments of the present application, it is necessary to determine the application scenario corresponding to the speech data set. When determining the evaluation value, it is necessary to calculate based on the characteristics required by the application scenario and the standard data set corresponding to the application scenario.

本申请实施例为了每次评估方便，通过质量评估模型的方式进行语音数据集的质量评估。In order to facilitate each evaluation, the embodiment of the present application performs quality evaluation of the speech data set by means of a quality evaluation model.

本申请实施例中预先针对不同的应用场景建立质量评估模型，建立的质量评估模型具体如下：In the embodiment of the present application, a quality assessment model is pre-established for different application scenarios, and the established quality assessment model is specifically as follows:

质量评估模型用于计算输入的语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间的内容相似度；并按照配置的权重对所述覆盖度、随机信息量、特征匹配度、内容相似度和信号有效度进行加权求和获得所述语音数据集的评估值。The quality assessment model is used to calculate the language element coverage, random information volume, signal effectiveness of the input speech data set, as well as the feature matching degree between the features of the speech data set and the features required by the application scenario and the content similarity between the speech data set and the data set preset for the application scenario; and to obtain the evaluation value of the speech data set by weighted summing up the coverage, random information volume, feature matching degree, content similarity and signal effectiveness according to the configured weights.

其中，in,

机器学习中语音技术离不开语言学的知识，模型训练需要对语言知识有全面的了解和掌握，满足语言的基本运用规则，具体指标项参考现代汉语词典和国家语委相关文件。Speech technology in machine learning is inseparable from knowledge of linguistics. Model training requires a comprehensive understanding and mastery of language knowledge to meet the basic rules of language use. Specific indicators refer to the Modern Chinese Dictionary and relevant documents of the National Language Commission.

下面给出计算所述语音数据集的语言要素覆盖度的具体步骤，包括：The specific steps for calculating the language element coverage of the speech data set are given below, including:

第一步、提取所述语音数据集的语言要素。The first step is to extract the language elements of the speech data set.

语言要素包括：单音字、双音子、三音子、声母、韵母、轻声、儿化音。Language elements include: monosyllabic characters, disyllabics, trisyllabic characters, initial consonants, final consonants, light tones, and erhua sounds.

第二步、计算提取的每个语言要素的元素个数与所述应用场景对应的基本特征的元素个数的比值。The second step is to calculate the ratio of the number of elements of each extracted language element to the number of elements of the basic features corresponding to the application scenario.

如韵母这个特征单元，应用场景对应的基本特征的元素个数为24个，语音数据集的韵母的元素个数为12个，则比值为50％，若语音数据集的韵母的元素个数为24个，则比值为100％。For example, for the feature unit of finals, the number of elements of the basic features corresponding to the application scenario is 24, and the number of elements of finals in the speech dataset is 12, then the ratio is 50%. If the number of elements of finals in the speech dataset is 24, then the ratio is 100%.

第三步、计算所有语言要素对应的比值的平均值，作为所述语音数据集的语言要素覆盖度。The third step is to calculate the average value of the ratios corresponding to all language elements as the language element coverage of the speech data set.

语音数据集的语言要素的覆盖度对应的值越大，语言要素覆盖度越大。The larger the value corresponding to the coverage of the language element of the speech data set is, the greater the coverage of the language element is.

数据最重要的是积累各种数据源，种类越多，说明数据越复杂，越能覆盖不确定特征信息，模型可以满足各种变动性，进而增加产品应用的鲁棒性。本申请实施例中通过应用最大熵原理计算数据随机信息量。The most important thing about data is to accumulate various data sources. The more types there are, the more complex the data is, and the more it can cover uncertain feature information. The model can meet various changes, thereby increasing the robustness of product applications. In the embodiment of the present application, the amount of random information in the data is calculated by applying the maximum entropy principle.

下面给出计算所述语音数据集的随机信息量的具体步骤，包括：The specific steps for calculating the random information amount of the speech data set are given below, including:

第一步、根据所述语音数据集的标签统计所述语音数据集的数据特征。The first step is to count the data features of the speech data set according to the labels of the speech data set.

每条语音数据集都有标签，如性别，录制设备等。Each voice data set has a label, such as gender, recording device, etc.

第二步、计算组合数据特征的信息熵，并求和。The second step is to calculate the information entropy of the combined data features and sum them up.

所有信息熵的和，通过下式计算：The sum of all information entropy is calculated by the following formula:

其中，p(x_i)表示组合数据特征为x_i的概率；n为数据特征的个数；S∈[0，log₂(n)]：S的值越大，信息熵越高。Where p(_xi ) represents the probability that the combined data feature is_xi ; n is the number of data features; S∈[0,_log2 (n)]: the larger the value of S, the higher the information entropy.

第三步、将所述和作为所述语音数据集的随机信息量。The third step is to use the sum as the random information amount of the speech data set.

和越大，随机信息量越大。The larger the sum, the greater the amount of random information.

机器学习的语音数据集需要与应用场景数据特征分布保持一致，本申请实施例通过对待评估语音数据集特征标签量化，计算比较特征数值和分布距离，客观定量描述待评估数据集构建与实际场景，主要流程包括：The speech data set for machine learning needs to be consistent with the feature distribution of the application scenario data. The embodiment of the present application quantifies the feature labels of the speech data set to be evaluated, calculates and compares the feature values and distribution distances, and objectively and quantitatively describes the construction of the data set to be evaluated and the actual scenario. The main process includes:

第四步：计算匹配度。将第三步中数值型和分布型特征相加，得到待评估语音数据集与应用场景特征匹配度。Step 4: Calculate the matching degree. Add the numerical and distributional features in step 3 to obtain the matching degree between the speech dataset to be evaluated and the application scenario features.

T(Data₁，Data₂)＝S(Data₁，Data₂)+E(Data₁，Data₂)T(Data₁ , Data₂ ) = S(Data₁ , Data₂ ) + E(Data₁ , Data₂ )

下面给出计算所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度的具体步骤，包括：The specific steps of calculating the feature matching degree between the features of the speech data set and the features required by the application scenario are given below, including:

第一步、根据所述应用场景要求的特征获取所述语音数据集的特征。The first step is to obtain the features of the speech data set according to the features required by the application scenario.

特征维度构建：根据待评估的语音数据集对应的应用场景的特征指标、数值和分布要求。Feature dimension construction: based on the feature indicators, values, and distribution requirements of the application scenario corresponding to the speech dataset to be evaluated.

对指标量化：如提取语音时长、采样率、性别分布、年龄分布等。Quantify indicators: such as extracting speech duration, sampling rate, gender distribution, age distribution, etc.

第二步、确定所述语音数据集的特征类型。The second step is to determine the feature type of the speech data set.

第三步、根据特征类型计算所述语音数据集的特征与所述应用场景要求的特征的距离。The third step is to calculate the distance between the features of the speech data set and the features required by the application scenario according to the feature type.

针对特征类型为数值类型的语音数据集，计算所述语音数据集的特征与所述应用场景要求的特征的欧式距离；For a speech data set whose feature type is a numerical type, calculating the Euclidean distance between the feature of the speech data set and the feature required by the application scenario;

如果是数值类型，如语音时长、采样率单数值特征，采用欧式距离表示两个特征之间的距离。If it is a numerical type, such as speech duration, sampling rate, and single numerical features, the Euclidean distance is used to represent the distance between two features.

其中i∈[1，N]表示共有N个维度的特征，表示待评估数据集的第i个特征，表示应用场景数据的第i个特征，ω_i为对应特征的权重，[1，n]表示第一种类型特征。Where i∈[1, N] represents a feature with N dimensions. represents the i-th feature of the dataset to be evaluated, represents the i-th feature of the application scenario data, ω_i is the weight of the corresponding feature, and [1, n] represents the first type of feature.

针对特征类型为分布类型的语音数据集，计算所述语音数据集的特征与所述应用场景要求的特征的KL散度。For a speech data set whose feature type is a distribution type, a KL divergence between a feature of the speech data set and a feature required by the application scenario is calculated.

如果是分布类型，例如频率分布、性别分布、年龄分布等，采用的是KL散度表示为两个分布之间的距离。If it is a distribution type, such as frequency distribution, gender distribution, age distribution, etc., the KL divergence is used to express the distance between two distributions.

其中i∈[1，N]表示共有N个维度的特征，表示待评估数据集的第i个特征，表示应用场景数据的第i个特征，ω_i为对应特征的权重，[n+1，N]表示分布型特征，在分布特征中，每一个特征是一个分布，其有M⁽ⁱ⁾个维度，每一个维度为x_j。Where i∈[1, N] represents a feature with N dimensions. represents the i-th feature of the dataset to be evaluated, represents the i-th feature of the application scenario data, ω_i is the weight of the corresponding feature, and [n+1, N] represents the distribution feature. In the distribution feature, each feature is a distribution with M⁽ⁱ⁾ dimensions, each dimension of which is x_j .

第四步，将所述待评估的语音数据的包含的特征类型对应的距离和作为所述语音数据集的特征与所述应用场景要求的特征的匹配度。In the fourth step, the distance corresponding to the feature type contained in the speech data to be evaluated is used as the matching degree between the feature of the speech data set and the feature required by the application scenario.

广域数据有利于机器学习模型通用性的提高，但对模型在某一应用领域效果没有实质作用。如果要测试某一领域的性能，必须对这一领域进行特别的数据积累，数据和领域内容的贴合程度成为重要的质量指标。本申请实施例中通过抽取待评估数据的词频计算与标准数据集的内容相似度，具体实现如下Wide-area data is conducive to improving the versatility of machine learning models, but it has no substantial effect on the effectiveness of the model in a certain application field. If you want to test the performance of a certain field, you must accumulate special data for this field, and the degree of fit between the data and the field content becomes an important quality indicator. In the embodiment of this application, the word frequency of the data to be evaluated is extracted to calculate the content similarity with the standard data set, which is specifically implemented as follows

本申请实施例中计算所述语音数据集与针对所述应用场景预设的数据集之间内容的相似度的具体步骤，包括：In the embodiment of the present application, the specific steps of calculating the similarity between the content of the speech data set and the data set preset for the application scenario include:

第一步、抽取所述语音数据集和针对所述应用场景预设的数据集的特征词频向量。The first step is to extract the characteristic word frequency vectors of the speech data set and the data set preset for the application scenario.

本申请实施例中获取针对应用场景预设的数据集，可以通过如下方式获取：In the embodiment of the present application, a data set preset for an application scenario can be obtained in the following manner:

在语音数据集应用的真实场景中，采集来源于实际应用的真实数据标准集，同时需要对语音数据集进行内容标注加工，转成文本文件，并对标注质量进行检查，达到99％的标准正确度。正确率采用随机抽选方法，按照转写规范要求，请专家进行检查校对，其结果作为参考正确结果(ref)，然后计算每句话的字正确率，最终以平均字正确率作为数据集评价指标，指标及计算方法。In the real scene of speech data set application, the real data standard set from actual application is collected. At the same time, the speech data set needs to be annotated and processed, converted into text files, and the annotation quality is checked to achieve a standard accuracy of 99%. The accuracy rate adopts a random selection method, and experts are invited to check and proofread according to the transcription specification requirements. The results are used as reference correct results (ref), and then the word accuracy of each sentence is calculated. Finally, the average word accuracy is used as the data set evaluation index, index and calculation method.

cer_i指的是每句话的字正确率； cer_i refers to the correctness of words in each sentence;

将评价指标高的数据集作为所述应用场景的预设的数据集。The data set with high evaluation index is used as the preset data set for the application scenario.

计算待评估的语音数据集的词频向量，以及预设的数据集的词频向量的方式相同，下面以计算语音数据集的词频向量的过程：The word frequency vector of the speech data set to be evaluated is calculated in the same way as the word frequency vector of the preset data set. The following is the process of calculating the word frequency vector of the speech data set:

计算语音数据集x的词频向量u的公式如下：The formula for calculating the word frequency vector u of the speech dataset x is as follows:

其中，f_n为某词在数据集中出现的次数为，f为数据集中最高频词出现次数，N为通用语料库文章的篇数，p_n为包含该词的文档数。Among them,_fn is the number of times a word appears in the data set, f is the number of times the most frequent word appears in the data set, N is the number of articles in the general corpus, and_pn is the number of documents containing the word.

例如，搜索主流搜索引擎，Google发现，包含″的″字的网页共有250亿张，假定这就是中文网页总数。包含″中国″的网页共有62.3亿张，包含″蜜蜂″的网页为0.484亿张，包含″养殖″的网页为0.973亿张。For example, searching the mainstream search engine, Google, found that there are 25 billion web pages containing the word "的", assuming that this is the total number of Chinese web pages. There are 6.23 billion web pages containing "中国", 48.4 million web pages containing "蜜", and 97.3 million web pages containing "殖".

第二步、基于余弦相似度算法计算所述语音数据集和针对所述应用场景预设的数据集的特征词频向量的相似度。The second step is to calculate the similarity of the characteristic word frequency vectors of the speech data set and the data set preset for the application scenario based on the cosine similarity algorithm.

依据标准数据集和待评估数据集的特征词频向量，计算两个向量的相似程度，假定P和Q是两个n维向量的数据集，P是[P1，P2，...，Pn]，Q是[Q1，Q2，...，Qn]，则P与Q的夹角θ的余弦等于：Based on the characteristic word frequency vectors of the standard data set and the data set to be evaluated, the similarity of the two vectors is calculated. Assuming that P and Q are two n-dimensional vector data sets, P is [P1, P2, ..., Pn], and Q is [Q1, Q2, ..., Qn], then the cosine of the angle θ between P and Q is equal to:

式中，P表示被评估数据集的词频向量；Q表示参考数据集的词频向量；N∈[0，1]：X的值越大，说明两个词频向量越接近，文本相似度越高。Where P represents the word frequency vector of the evaluated dataset; Q represents the word frequency vector of the reference dataset; N∈[0,1]: The larger the value of X, the closer the two word frequency vectors are and the higher the text similarity.

将计算的待评估的所述语音数据集和针对所述应用场景预设的数据集的特征词频向量的相似度，作为待评估的所述语音数据集和针对所述应用场景预设的数据集的内容相似度。The calculated similarity between the characteristic word frequency vectors of the speech data set to be evaluated and the data set preset for the application scenario is used as the content similarity between the speech data set to be evaluated and the data set preset for the application scenario.

语音数据集在采集的过程中，由于采集设备和说话人的原因，会造成信号缺失、录音不完整等问题，会产生很多无效信息，本申请实施例中对于语音数据集信号质量进行检测，通过如下方式计算所述语音数据集的信号有效度。During the collection process of the speech data set, due to the reasons of the collection equipment and the speaker, problems such as missing signals and incomplete recordings may occur, which will generate a lot of invalid information. In the embodiment of the present application, the signal quality of the speech data set is detected, and the signal validity of the speech data set is calculated in the following way.

第一步：采用语音边界检测手段，统计数据集中信号无效片段，Step 1: Use speech boundary detection to count invalid segments of the signal.

下面给出计算所述语音数据集的信号有效度的具体步骤，包括：The specific steps for calculating the signal validity of the speech data set are given below, including:

第一步、基于语音边界检测方法，获取所述语音数据集中的信号无效片段。The first step is to obtain invalid signal segments in the speech data set based on a speech boundary detection method.

信号异常是指语音数据集通常为连续的采样点，由于采音设备或者环境噪声的影响，会出现信号不连续的语音信号片段；静音片段通常为说话人和录音设备的问题，语音数据集会出现为静默信号片段，这些都为语音数据集无效片段。Signal anomalies refer to the fact that speech data sets are usually continuous sampling points, but due to the influence of sound collection equipment or environmental noise, speech signal segments with discontinuous signals may appear. Silence segments are usually problems with the speaker and the recording equipment, and the speech data set will appear as silence signal segments. These are all invalid segments of the speech data set.

第二步、统计获取的信号无效片段的时长，以及所述语音数据集的总时长。The second step is to count the duration of the acquired invalid signal segments and the total duration of the speech data set.

第三步、计算所述语音数据集的有效片段的时长，并计算所述有效片段的时长与所述总时长的比值。The third step is to calculate the duration of the valid segment of the speech data set, and calculate the ratio of the duration of the valid segment to the total duration.

当总时长为T，无效片段的总时长为T1，则有效片段的总时长为T-T1，有效片段与总时长的比值为(T-T1)/T。When the total duration is T, the total duration of the invalid segments is T1, then the total duration of the valid segments is T-T1, and the ratio of the valid segments to the total duration is (T-T1)/T.

第四步、将所述比值作为所述语音数据集的信号有效度。Step 4: Using the ratio as the signal validity of the speech data set.

所述比值越大，所述语音数据集的信号有效度越高。The larger the ratio is, the higher the signal validity of the speech data set is.

在面向机器学习应用时，针对不同的应用类型和训练目标，对于语音数据集质量的需求侧重与总体评价目标有所不同。因此，质量评估指标的筛选应该对评价目标有足够的覆盖面，同时与评价目标保持高度的一致性。当所有已选择的评估度量计算完成之后，数据子集在每个评估度量上都形成一个百分制的评分，然后将所有的评分聚合成一个最终的数据质量评分，并将数据集按照评分进行排序。例如，聚合评分包括如下方式：When it comes to machine learning applications, the quality requirements for speech datasets differ from the overall evaluation objectives for different application types and training objectives. Therefore, the selection of quality evaluation indicators should have sufficient coverage of the evaluation objectives while maintaining a high degree of consistency with the evaluation objectives. When all selected evaluation metrics are calculated, the data subset forms a percentage score on each evaluation metric, and then all scores are aggregated into a final data quality score, and the datasets are sorted by score. For example, aggregate scoring includes the following methods:

根据针对各个指标设置的权重，对各个指标进行加权求和，获得所述语音数据集的评分值：According to the weights set for each indicator, each indicator is weighted and summed to obtain the score value of the speech data set:

M＝w₁×Y+w₂×S+w₃×T+w₄×N+w₅×B；M＝_w1 ×Y+_w2 ×S+_w3 ×T+_w4 ×N+_w5 ×B；

其中，w₁到w₅为各指标的权重值，Y为语言要素覆盖度，S为随机信息量，T为特征匹配度，N为内容相似度，X为信号有效度。Among them,_w1 to_w5 are the weight values of each indicator, Y is the language element coverage, S is the amount of random information, T is the feature matching degree, N is the content similarity, and X is the signal effectiveness.

下面结合附图，详细描述本申请实施例中实现语音数据集质量评估过程。The following describes in detail the process of implementing the speech data set quality assessment in the embodiment of the present application in conjunction with the accompanying drawings.

参见图1，图1为本申请实施例中语音数据集质量评估流程示意图。具体步骤为：See Figure 1, which is a schematic diagram of the speech data set quality assessment process in an embodiment of the present application. The specific steps are:

步骤101，获取待评估的语音数据集，以及所述语音数据集对应的应用场景。Step 101: Obtain a speech data set to be evaluated and an application scenario corresponding to the speech data set.

本实施例中获取待评估的语音数据集，以及所述语音数据集对应的应用场景，包括：In this embodiment, the speech data set to be evaluated and the application scenario corresponding to the speech data set are obtained, including:

第一种：The first:

接收用户发送的语音数据集，以及所述语音数据集对应的应用场景，获取所述语音数据集，以及所述语音数据集对应的应用场景。Receive a voice data set sent by a user and an application scenario corresponding to the voice data set, and acquire the voice data set and the application scenario corresponding to the voice data set.

该种实现方式为用户需要对语音数据集进行质量评估时，将所述待评估的语音数据集，以及所述语音数据集对应的应用场景发送给本语音数据集质量评估装置。This implementation method is that when a user needs to perform quality assessment on a speech data set, the user sends the speech data set to be assessed and the application scenario corresponding to the speech data set to the speech data set quality assessment device.

第二种：Second type:

在所述应用场景下采集语音数据集，获取所述语音数据集，以及所述语音数据集对应的应用场景。A speech data set is collected in the application scenario to obtain the speech data set and the application scenario corresponding to the speech data set.

该种获取方式为本语音数据集质量评估装置指定应用场景，然后在所述应用场景中采集对应的语音数据集。This acquisition method is that the speech data set quality assessment device specifies an application scenario and then collects the corresponding speech data set in the application scenario.

采集过程具体可以通过如下方式实现：The collection process can be implemented in the following ways:

依据技术应用领域和数据特征要求，设计采集方案并进行实施，要求语料库的信息容量尽可能大并语料文本特征内容分布与实际中分布保持尽可能一致。主要流程包括：According to the technical application field and data feature requirements, the collection plan is designed and implemented, requiring the information capacity of the corpus to be as large as possible and the distribution of the corpus text feature content to be as consistent as possible with the actual distribution. The main process includes:

文本设计筛选Text design screening

建立语音录制的文本内容，参照所属语言体系中所包含的音素集，建立原始词表(包含所需的常用词和专有名词)，依据原始词表设计录制文本集。Establish the text content of the voice recording, refer to the phoneme set contained in the language system, establish the original vocabulary (including the required common words and proper nouns), and design the recording text set based on the original vocabulary.

语音录制和采集Voice recording and collection

依据语音数据集技术应用特征要求和文本录制内容，组织开展语音录制，录制环境和录音人需满足特征要求，并对每条语音数据集的特征信息进行记录。Organize and carry out voice recording based on the technical application feature requirements of the voice dataset and the text recording content. The recording environment and the recorder must meet the feature requirements, and the feature information of each voice dataset must be recorded.

步骤102，基于所述应用场景对应的质量评估模型获取所述语音数据集的评估值。Step 102: Obtain an evaluation value of the speech data set based on a quality evaluation model corresponding to the application scenario.

步骤103，根据所述评估值评估所述语音数据集的语音质量。Step 103: Evaluate the speech quality of the speech data set according to the evaluation value.

在具体实现时，评估值越大，待评估的语音数据集的语音质量的质量越高。In a specific implementation, the larger the evaluation value is, the higher the speech quality of the speech data set to be evaluated is.

在具体实现时，也可以设置多个质量等级，每个等级对应设置的评估值范围，来确定待评估数据的质量等级。In specific implementation, multiple quality levels may also be set, and each level corresponds to a set evaluation value range to determine the quality level of the data to be evaluated.

本申请实施例中依据对技术应用场景的需求分析，和对应语音数据集质量量化评分，可以直观体现语音数据集对于技术研发和场景应用的价值，提前发现语音数据集缺陷和问题，对数据进行及时调整修复，提高研发机构和企业算法模型的应用水平，节省计算资源开销。In the embodiments of the present application, based on the demand analysis of the technical application scenarios and the corresponding quantitative scoring of the voice data set quality, the value of the voice data set for technical research and development and scenario applications can be intuitively reflected, defects and problems of the voice data set can be discovered in advance, the data can be adjusted and repaired in a timely manner, the application level of the algorithm models of R&D institutions and enterprises can be improved, and computing resource overhead can be saved.

基于同样的发明构思，本申请实施例中还提供一种语音数据集质量评估装置。参见图2，图2为本申请实施例中应用于上述技术的装置结构示意图。所述装置包括：存储单元201、第一获取单元202、第二获取单元203和评估单元204；Based on the same inventive concept, a speech data set quality assessment device is also provided in the embodiment of the present application. Referring to FIG. 2 , FIG. 2 is a schematic diagram of the structure of the device applied to the above-mentioned technology in the embodiment of the present application. The device includes: a storage unit 201, a first acquisition unit 202, a second acquisition unit 203 and an assessment unit 204;

存储单元201，用于应用场景对应的存储质量评估模型；其中，所述质量评估模型用于计算输入的语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间的内容相似度；并按照配置的权重对所述语言要素覆盖度、随机信息量、信号有效度、特征匹配度和内容相似度进行加权求和获得所述语音数据集的评估值；A storage unit 201 is used to store a quality assessment model corresponding to an application scenario; wherein the quality assessment model is used to calculate the language element coverage, random information volume, and signal validity of an input speech data set, as well as the feature matching degree between the features of the speech data set and the features required by the application scenario, and the content similarity between the speech data set and a data set preset for the application scenario; and the language element coverage, random information volume, signal validity, feature matching degree, and content similarity are weighted and summed according to the configured weights to obtain an evaluation value of the speech data set;

第一获取单元202，用于获取待评估的语音数据集，以及所述语音数据集对应的应用场景；A first acquisition unit 202 is used to acquire a speech data set to be evaluated and an application scenario corresponding to the speech data set;

第二获取单元203，用于基于第一获取单元202获取的应用场景对应的存储单元201中的质量评估模型获取所述语音数据集的评估值；A second acquisition unit 203, configured to acquire an evaluation value of the speech data set based on a quality evaluation model in the storage unit 201 corresponding to the application scenario acquired by the first acquisition unit 202;

评估单元204，用于根据第二获取单元203获取的评估值评估所述语音数据集的语音质量。The evaluation unit 204 is configured to evaluate the speech quality of the speech data set according to the evaluation value acquired by the second acquisition unit 203 .

优选地，Preferably,

第一获取单元202，具体用于获取待评估的语音数据集，以及所述语音数据集对应的应用场景时，包括：接收用户发送的语音数据集，以及所述语音数据集对应的应用场景，获取所述语音数据集，以及所述语音数据集对应的应用场景；或，在所述应用场景下采集语音数据集，获取所述语音数据集，以及所述语音数据集对应的应用场景。The first acquisition unit 202 is specifically used to acquire a speech data set to be evaluated and an application scenario corresponding to the speech data set, including: receiving a speech data set sent by a user and an application scenario corresponding to the speech data set, and acquiring the speech data set and the application scenario corresponding to the speech data set; or collecting a speech data set under the application scenario, and acquiring the speech data set and the application scenario corresponding to the speech data set.

优选地，Preferably,

存储单元201，具体用于计算所述语音数据集的语言要素覆盖度时，包括：提取所述语音数据集的语言要素；计算提取的每个语言要素的元素个数与所述应用场景对应的基本特征的元素个数的比值；计算所有语言要素对应的比值的平均值，作为所述语音数据集的语言要素覆盖度。The storage unit 201 is specifically used to calculate the language element coverage of the speech data set, including: extracting the language elements of the speech data set; calculating the ratio of the number of elements of each extracted language element to the number of elements of the basic features corresponding to the application scenario; and calculating the average value of the ratios corresponding to all language elements as the language element coverage of the speech data set.

优选地，Preferably,

存储单元201，具体用于计算所述语音数据集的随机信息量时，包括：根据所述语音数据集的标签统计所述语音数据集的数据特征；计算每个数据特征的信息熵，并求和；将所述和作为所述语音数据集的随机信息量。The storage unit 201 is specifically used to calculate the random information amount of the speech data set, including: counting the data features of the speech data set according to the labels of the speech data set; calculating the information entropy of each data feature and summing them; and using the sum as the random information amount of the speech data set.

优选地，Preferably,

存储单元201，具体用于计算所述语音数据集的特征与所述应用场景要求的特征的匹配度时，包括：确定所述语音数据集的数值类型；针对非离散数值类型的语音数据集，计算所述语音数据集的特征与所述应用场景要求的特征的欧式距离；针对离散数值类型的语音数据集，计算所述语音数据集的特征与所述应用场景要求的特征的KL散度；将所述欧式距离和所述KL散度求和作为所述语音数据集的特征与所述应用场景要求的特征的匹配度。The storage unit 201 is specifically used to calculate the matching degree between the features of the speech data set and the features required by the application scenario, including: determining the numerical type of the speech data set; for a speech data set of a non-discrete numerical type, calculating the Euclidean distance between the features of the speech data set and the features required by the application scenario; for a speech data set of a discrete numerical type, calculating the KL divergence between the features of the speech data set and the features required by the application scenario; and summing the Euclidean distance and the KL divergence as the matching degree between the features of the speech data set and the features required by the application scenario.

优选地，Preferably,

存储单元201，具体用于计算所述语音数据集与针对所述应用场景预设的数据集之间内容的相似度时，包括：抽取所述语音数据集和针对所述应用场景预设的数据集的特征词频向量；基于余弦相似度算法计算所述语音数据集和针对所述应用场景预设的数据集的特征词频向量的相似度。The storage unit 201 is specifically used to calculate the similarity of the content between the speech data set and the data set preset for the application scenario, including: extracting the characteristic word frequency vectors of the speech data set and the data set preset for the application scenario; and calculating the similarity of the characteristic word frequency vectors of the speech data set and the data set preset for the application scenario based on the cosine similarity algorithm.

优选地，Preferably,

存储单元201，具体用于计算所述语音数据集的信号有效度，包括：基于语音边界检测方法，获取所述语音数据集中的信号无效片段；统计获取的信号无效片段的时长，以及所述语音数据集的总时长；计算所述语音数据集的有效片段的时长，并计算所述有效片段的时长与所述总时长的比值；将所述比值作为所述语音数据集的信号有效度。The storage unit 201 is specifically used to calculate the signal validity of the speech data set, including: based on a speech boundary detection method, obtaining signal invalid segments in the speech data set; counting the duration of the obtained signal invalid segments and the total duration of the speech data set; calculating the duration of the valid segments of the speech data set, and calculating the ratio of the duration of the valid segments to the total duration; and using the ratio as the signal validity of the speech data set.

上述实施例的单元可以集成于一体，也可以分离部署；可以合并为一个单元，也可以进一步拆分成多个子单元。The units in the above-mentioned embodiments may be integrated into one body or deployed separately; they may be combined into one unit or further divided into multiple sub-units.

在另一个实施例中，还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现所述语音数据集质量评估方法的步骤。In another embodiment, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the speech data set quality assessment method when executing the program.

在另一个实施例中，还提供一种计算机可读存储介质，其上存储有计算机指令，所述指令被处理器执行时可实现所述语音数据集质量评估方法中的步骤。In another embodiment, a computer-readable storage medium is provided, on which computer instructions are stored. When the instructions are executed by a processor, the steps in the speech data set quality assessment method can be implemented.

图3为本发明实施例提供的一种电子设备的实体结构示意图。如图3所示，该电子设备可以包括：处理器(Processor)310、通信接口(Communications Interface)320、存储器(Memory)330和通信总线340，其中，处理器310，通信接口320，存储器330通过通信总线340完成相互间的通信。处理器310可以调用存储器330中的逻辑指令，以执行如下方法：FIG3 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present invention. As shown in FIG3, the electronic device may include: a processor (Processor) 310, a communication interface (Communications Interface) 320, a memory (Memory) 330 and a communication bus 340, wherein the processor 310, the communication interface 320, and the memory 330 communicate with each other through the communication bus 340. The processor 310 can call the logic instructions in the memory 330 to execute the following method:

其中，所述质量评估模型用于计算输入的语音数据集的语言要素覆盖度、随机信息量、信号有效度，以及所述语音数据集的特征与所述应用场景要求特征之间的特征匹配度和所述语音数据集与针对所述应用场景预设的数据集之间的内容相似度；并按照配置的权重对所述语言要素覆盖度、随机信息量、信号有效度、特征匹配度和内容相似度进行加权求和获得所述语音数据集的评估值。Among them, the quality assessment model is used to calculate the language element coverage, random information volume, signal effectiveness of the input speech data set, as well as the feature matching degree between the characteristics of the speech data set and the characteristics required by the application scenario and the content similarity between the speech data set and the data set preset for the application scenario; and the language element coverage, random information volume, signal effectiveness, feature matching degree and content similarity are weightedly summed according to the configured weights to obtain the evaluation value of the speech data set.

此外，上述的存储器330中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 330 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the scope of protection of the present invention.