CN113901992B

Movatterモバイル変換

Info

Publication number: CN113901992B
Application number: CN202111109966.5A
Authority: CN
Inventors: 袁正鹏; 王强强
Original assignee: Beijing Baige Feichi Technology Co ltd
Current assignee: Beijing Baige Feichi Technology Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2025-02-25
Anticipated expiration: 2041-09-17
Also published as: CN113901992A

Abstract

Translated fromChinese

本发明涉及语音识别处理技术领域，特别适用于语音识别转写时使用的机器学习模型的训练数据的获取。针对不同场景需要大批量数据训练模型并需要大量人工标注数据而存在数据获取成本高、消耗大、数据质量/准确度差以及现有伪标签数据筛选效果差等缺陷，提出了本发明的训练数据的筛选方法、系统、装置及介质，旨在解决如何基于半监督学习的伪标签准确度筛选高质量的应用于语音识别、搜索、转写等模型的训练数据的技术问题。为此，本发明的方法通过在解码中利用解码路径的节点链接个数均值对解码结果排序以筛选排序靠前的伪标签语音数据作为模型训练数据。提高了筛选效率和数据质量，降低成本减少消耗。

The present invention relates to the field of speech recognition processing technology, and is particularly suitable for obtaining training data for machine learning models used in speech recognition transcription. In view of the high cost of data acquisition, high consumption, poor data quality/accuracy, and poor screening effect of existing pseudo-label data in different scenarios, a method, system, device, and medium for screening training data of the present invention are proposed, aiming to solve the technical problem of how to screen high-quality training data for models such as speech recognition, search, and transcription based on pseudo-label accuracy of semi-supervised learning. To this end, the method of the present invention uses the mean number of node links of the decoding path in decoding to sort the decoding results to screen the pseudo-label speech data with the highest sorting as model training data. The screening efficiency and data quality are improved, and the cost and consumption are reduced.