CN110490237B

Movatterモバイル変換

Info

Publication number: CN110490237B
Application number: CN201910713784.5A
Authority: CN
Inventors: 罗彤
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2022-05-17
Anticipated expiration: 2039-08-02
Also published as: CN110490237A

Abstract

The application discloses a data processing method, a data processing device, a storage medium and electronic equipment. The method comprises the following steps: acquiring a plurality of data, wherein the plurality of data carry the same class label; dividing the plurality of data into a first data set and a second data set; extracting features of each of the first data set and the second data set; acquiring the correctness information of the category label of each data in the second data set; training a preset two-classification model according to the correctness information of the class label of each data in the second data set and the characteristics of each data to obtain a target model; acquiring first target data of which the class labels are judged to be correct in the first data set by using the target model and the characteristics of each data in the first data set; and obtaining second target data according to the first target data and the data with correct class labels in the second data set. The data cleaning efficiency can be improved.

Description

Translated fromChinese

数据处理方法、装置、存储介质及电子设备Data processing method, device, storage medium and electronic device

技术领域technical field

本申请属于数据技术领域，尤其涉及一种数据处理方法、装置、存储介质及电子设备。The present application belongs to the field of data technology, and in particular, relates to a data processing method, device, storage medium and electronic device.

背景技术Background technique

数据清洗是指对数据进行重新审查和校验的过程，其目的在于将数据集中的错误信息删除。以分类图片的数据清洗处理为例，主要是检查图片的分类标签是否正确，并将分类标签错误的图片删除。然而，相关技术中，数据清洗处理的效率较低。Data cleaning refers to the process of re-examining and verifying data, and its purpose is to remove erroneous information from the data set. Taking the data cleaning process of classified pictures as an example, the main purpose is to check whether the classification labels of the pictures are correct, and delete the pictures with wrong classification labels. However, in the related art, the efficiency of data cleaning processing is low.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种数据处理方法、装置、存储介质及电子设备，可以提高数据清洗的效率。Embodiments of the present application provide a data processing method, apparatus, storage medium, and electronic device, which can improve the efficiency of data cleaning.

本申请实施例提供一种数据处理方法，包括：The embodiment of the present application provides a data processing method, including:

获取多个数据，所述多个数据携带有相同的类别标签；Acquiring a plurality of data, the plurality of data carrying the same category label;

将所述多个数据划分为第一数据集和第二数据集；dividing the plurality of data into a first data set and a second data set;

提取所述第一数据集和所述第二数据集中每一数据的特征；extracting features of each data in the first data set and the second data set;

获取所述第二数据集中每一数据的类别标签的正确与否信息；Acquiring correctness information of the category label of each data in the second data set;

根据所述第二数据集中每一数据的类别标签的正确与否信息以及每一数据的特征，训练预设的二分类模型，得到目标模型；According to the correctness information of the category label of each data in the second data set and the characteristics of each data, train a preset two-class model to obtain a target model;

利用所述目标模型以及所述第一数据集中每一数据的特征，获取所述第一数据集中类别标签被判定为正确的第一目标数据；Using the target model and the feature of each data in the first data set, obtain the first target data whose category label in the first data set is judged to be correct;

根据所述第一目标数据以及所述第二数据集中类别标签正确的数据，得到第二目标数据。The second target data is obtained according to the first target data and the data with correct category labels in the second data set.

本申请实施例提供一种数据处理装置，包括：An embodiment of the present application provides a data processing device, including:

第一获取模块，用于获取多个数据，所述多个数据携带有相同的类别标签；a first acquisition module, used for acquiring a plurality of data, the plurality of data carrying the same category label;

划分模块，用于将所述多个数据划分为第一数据集和第二数据集；a dividing module for dividing the plurality of data into a first data set and a second data set;

提取模块，用于提取所述第一数据集和所述第二数据集中每一数据的特征；an extraction module, configured to extract the features of each data in the first data set and the second data set;

第二获取模块，用于获取所述第二数据集中每一数据的类别标签的正确与否信息；A second obtaining module, configured to obtain the correctness information of the category label of each data in the second data set;

训练模块，用于根据所述第二数据集中每一数据的类别标签的正确与否信息以及每一数据的特征，训练预设的二分类模型，得到目标模型；A training module, configured to train a preset two-class model according to the correctness information of the category label of each data in the second data set and the characteristics of each data, and obtain a target model;

第三获取模块，用于利用所述目标模型以及所述第一数据集中每一数据的特征，获取所述第一数据集中类别标签被判定为正确的第一目标数据；a third obtaining module, configured to obtain the first target data whose category label in the first data set is determined to be correct by using the target model and the feature of each data in the first data set;

处理模块，用于根据所述第一目标数据以及所述第二数据集中类别标签正确的数据，得到第二目标数据。A processing module, configured to obtain second target data according to the first target data and data with correct category labels in the second data set.

本申请实施例提供一种存储介质，其上存储有计算机程序，当所述计算机程序在计算机上执行时，使得所述计算机执行本申请实施例提供的数据处理方法中的流程。The embodiments of the present application provide a storage medium on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the process of the data processing method provided by the embodiments of the present application.

本申请实施例还提供一种电子设备，包括存储器，处理器，所述处理器通过调用所述存储器中存储的计算机程序，用于执行本申请实施例提供的数据处理方法中的流程。An embodiment of the present application further provides an electronic device, including a memory and a processor, where the processor is configured to execute the process in the data processing method provided by the embodiment of the present application by invoking a computer program stored in the memory.

本申请实施例中，电子设备可以利用经过学习训练的二分类模型来进行数据清洗工作。由于该经过学习训练的二分类模型可以快速地确定出类别标签正确的数据。因此，本实施例可以快速地得到干净数据。相比于相关技术中由人工逐一浏览检查数据的标签信息是否有误的数据清洗方式，本实施例可以提高数据清洗的效率。In the embodiment of the present application, the electronic device may use the learned and trained binary classification model to perform data cleaning. Because the learned and trained binary classification model can quickly determine the data with the correct category label. Therefore, in this embodiment, clean data can be obtained quickly. Compared with the data cleaning method in the related art in which the label information of the data is checked manually one by one, the present embodiment can improve the efficiency of data cleaning.

附图说明Description of drawings

下面结合附图，通过对本申请的具体实施方式详细描述，将使本申请的技术方案及其有益效果显而易见。The technical solutions of the present application and the beneficial effects thereof will be apparent through the detailed description of the specific embodiments of the present application in conjunction with the accompanying drawings.

图1是本申请实施例提供的数据处理方法的第一种流程示意图。FIG. 1 is a first schematic flowchart of a data processing method provided by an embodiment of the present application.

图2是本申请实施例提供的数据处理方法的第二种流程示意图。FIG. 2 is a second schematic flowchart of the data processing method provided by the embodiment of the present application.

图3为本申请实施例提供的数据处理方法的第三种流程示意图。FIG. 3 is a third schematic flowchart of the data processing method provided by the embodiment of the present application.

图4是本申请实施例提供的第四模型的结构示意图。FIG. 4 is a schematic structural diagram of a fourth model provided by an embodiment of the present application.

图5至图7是本申请实施例提供的数据处理方法的场景示意图。5 to 7 are schematic diagrams of scenarios of the data processing method provided by the embodiments of the present application.

图8是本申请实施例提供的数据处理方法的第四种流程示意图。FIG. 8 is a fourth schematic flowchart of the data processing method provided by the embodiment of the present application.

图9是本申请实施例提供的数据处理装置的结构示意图。FIG. 9 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.

图10是本申请实施例提供的电子设备的结构示意图。FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

图11是本申请实施例提供的电子设备的另一结构示意图。FIG. 11 is another schematic structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

请参照图示，其中相同的组件符号代表相同的组件，本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例，其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the drawings, wherein the same component symbols represent the same components, and the principles of the present application are exemplified by being implemented in a suitable computing environment. The following description is based on illustrated specific embodiments of the present application and should not be construed as limiting other specific embodiments of the present application not detailed herein.

可以理解的是，本申请实施例的执行主体可以是诸如智能手机或平板电脑或台式电脑或服务器等的电子设备。It can be understood that, the execution body of the embodiment of the present application may be an electronic device such as a smart phone or a tablet computer, a desktop computer, or a server.

请参阅图1和图2，图1是本申请实施例提供的数据处理方法的第一种流程示意图，图2是本申请实施例提供的数据处理方法的第二种流程示意图，流程可以包括：Please refer to FIG. 1 and FIG. 2. FIG. 1 is a first schematic flowchart of a data processing method provided by an embodiment of the present application, and FIG. 2 is a second schematic flowchart of the data processing method provided by an embodiment of the present application. The process may include:

101、获取多个数据，该多个数据携带有相同的类别标签。101. Acquire multiple pieces of data, where the multiple pieces of data carry the same category label.

数据清洗是指对数据进行重新审查和校验的过程，其目的在于将数据集中的错误信息删除。以分类图片的数据清洗处理为例，相关技术中主要是通过人工检查的方式来进行数据清洗。例如，由人工来检查图片的分类标签是否正确，并将分类标签错误的图片删除。然而，相关技术中，数据清洗处理的效率较低。Data cleaning refers to the process of re-examining and verifying data, and its purpose is to remove erroneous information from the data set. Taking the data cleaning processing of classified pictures as an example, in the related art, data cleaning is mainly performed by manual inspection. For example, it is human to check whether the classification label of the picture is correct, and the picture with the wrong classification label is deleted. However, in the related art, the efficiency of data cleaning processing is low.

在本申请实施例中，比如，电子设备可以先获取多个数据，这些数据可以携带有相同的类别标签。可以理解的是，该多个数据即为需要进行数据清洗的数据。比如，电子设备可以获取需要进行数据清洗的一个数据集。In this embodiment of the present application, for example, the electronic device may first acquire multiple pieces of data, and these pieces of data may carry the same category label. It can be understood that the multiple pieces of data are the data that need to be cleaned. For example, an electronic device may acquire a data set that needs to be cleaned.

例如，需要进行数据清洗的数据为一个图片集，该图片集中包含的图片可以是具有同一类别标签的图片。例如，该图片集中包含的图片的类别标签为花卉类别等。For example, the data to be cleaned is a picture set, and the pictures included in the picture set may be pictures with the same category label. For example, the category label of the pictures contained in the image set is the category of flowers and so on.

102、将该多个数据划分为第一数据集和第二数据集。102. Divide the plurality of data into a first data set and a second data set.

比如，在获取到需要进行数据清洗的数据后，电子设备可以将这些数据划分为第一数据集和第二数据集。For example, after acquiring the data that needs to be cleaned, the electronic device may divide the data into a first data set and a second data set.

例如，需要进行数据清洗的数据为1000张图片，那么电子设备可以将这1000张图片划分为第一数据集(如第一图片集)和第二数据集(如第二图片集)。For example, if the data to be cleaned is 1,000 pictures, the electronic device can divide the 1,000 pictures into a first data set (eg, a first picture set) and a second data set (eg, a second picture set).

103、提取第一数据集和第二数据集中每一数据的特征。103. Extract features of each data in the first data set and the second data set.

比如，在划分得到第一数据集和第二数据集后，电子设备可以提取第一数据集中的每一数据的特征，并提取第二数据集中每一数据的特征。For example, after the first data set and the second data set are obtained by dividing, the electronic device may extract the features of each data in the first data set, and extract the features of each data in the second data set.

例如，电子设备可以提取第一数据集中每一张图片的特征，并提取第二数据集中每一张图片的特征。For example, the electronic device may extract features of each picture in the first data set, and extract features of each picture in the second data set.

104、获取第二数据集中每一数据的类别标签的正确与否信息。104. Obtain information on whether the category label of each data in the second data set is correct or not.

比如，在得到第二数据集后，电子设备可以获取该第二数据集中每一数据的类别标签的正确与否信息。例如，电子设备可以获取第二数据集中每一张图片的类别标签是否正确的信息。For example, after obtaining the second data set, the electronic device may obtain information on whether the category label of each data in the second data set is correct or not. For example, the electronic device may obtain information on whether the category label of each picture in the second data set is correct.

例如，第二数据集中包含200张图片，那么可以先通过人工检查的方式来确定这200张图片中的每一张图片的类别标签是否正确。若图片的类别标签正确，则检查人员可以通过电子设备为该图片标注相应的表示类别标签正确的信息，如标注数字“1”或者英文字母“T”等。若图片的类别标签错误，则检查人员可以通过电子设备为该图片标注相应的表示类别标签错误的信息，如标注数字“0”或英文字母“F”等。这样，电子设备即可以获取到这200张图片的类别标签正确与否的信息。For example, if the second dataset contains 200 pictures, it can be determined by manual inspection whether the category label of each picture in the 200 pictures is correct. If the category label of the picture is correct, the inspector can use the electronic device to mark the picture with corresponding information indicating the correct category label, such as a number "1" or an English letter "T". If the category label of the picture is wrong, the inspector can use the electronic device to mark the picture with corresponding information indicating the wrong category label, such as marking the number "0" or the English letter "F". In this way, the electronic device can obtain information on whether the category labels of the 200 pictures are correct or not.

105、根据第二数据集中每一数据的类别标签的正确与否信息以及每一数据的特征，训练预设的二分类模型，得到目标模型。105. According to the correctness information of the category label of each data in the second data set and the characteristics of each data, train a preset two-class model to obtain a target model.

比如，在获取到第二数据集中每一数据的类别标签的正确与否信息后，电子设备可以根据该第二数据集中每一数据的类别标签的正确与否信息以及第二数据集中每一数据的特征，对预设的二分类模型进行学习训练，从而得到经过学习训练的模型，即目标模型。For example, after obtaining the correctness information of the category label of each data in the second data set, the electronic device can obtain the correctness information of the category label of each data in the second data set and The characteristics of the preset binary classification model are learned and trained, so as to obtain the learned and trained model, that is, the target model.

例如，在获取到第二数据集中的200张图片的类别标签正确与否的信息后，电子设备可以将这200张图片的类别标签正确与否的信息以及这200图片的特征作为输入数据，输入到预设的二分类模型中以对该二分类模型进行学习训练，从而得到目标模型。For example, after obtaining the information on whether the category labels of the 200 pictures in the second data set are correct or not, the electronic device can use the information on whether the category labels of the 200 pictures are correct or not and the characteristics of the 200 pictures as input data, and input into a preset two-class model to learn and train the two-class model, thereby obtaining a target model.

在一种实施方式中，例如，图片P_i为第二数据集中的一张图片，那么图片P_i的特征f_i和该图片P_i的类别标签正确与否的信息b_i可以表示成<f_i,b_i>的形式，那么<f_i,b_i>可以作为二分类模型的一条学习样本数据。In one embodiment, for example, the picture P_i is a picture in the second data set, then the feature f_i of the picture P_i and the information b_i of whether the category label of the picture P_i is correct or not can be expressed as < f_i , b_i > form, then < f_i , b_i > can be used as a piece of learning sample data for the binary classification model.

可以理解的是，经过学习训练得到的目标模型即是可以根据图片特征输出图片的类别标签是否正确的信息的模型。It can be understood that the target model obtained by learning and training is a model that can output information on whether the category label of the picture is correct or not according to the characteristics of the picture.

106、利用目标模型以及第一数据集中每一数据的特征，获取第一数据集中类别标签被判定为正确的第一目标数据。106. Using the target model and the feature of each data in the first data set, obtain the first target data whose category label in the first data set is determined to be correct.

比如，在得到目标模型后，电子设备可以利用该目标模型以及第一数据集中的每一数据的特征，获取第一数据集中类别标签被判定为正确的数据，即第一目标数据。For example, after obtaining the target model, the electronic device can use the target model and the features of each data in the first data set to obtain the data whose category labels in the first data set are determined to be correct, that is, the first target data.

例如，第一数据集中包含800张图片。那么，在得到目标模型后，电子设备可以将这800张图片中每一张图片的特征输入到该目标模型中，由该目标模型根据每一张图片的特征输出该图片的类别标签是否正确的信息，并将判定出的类别标签正确的图片确定为第一目标数据。可以理解的是，第一目标数据可以包括多张图片。For example, the first dataset contains 800 images. Then, after obtaining the target model, the electronic device can input the features of each of the 800 pictures into the target model, and the target model can output whether the category label of the picture is correct according to the features of each picture. information, and the determined picture with the correct category label is determined as the first target data. It can be understood that the first target data may include multiple pictures.

107、根据第一目标数据以及第二数据集中类别标签正确的数据，得到第二目标数据。107. Obtain second target data according to the first target data and data with correct category labels in the second data set.

比如，在得到第一目标数据后，电子设备还可以获取第二数据集中类别标签正确的数据，并将该第一目标数据和该第二数据集中类别标签正确的数据合并得到第二目标数据。可以理解的是，该第二目标数据即为数据清洗处理后得到的干净数据。For example, after obtaining the first target data, the electronic device may also obtain data with correct category labels in the second data set, and combine the first target data and data with correct category labels in the second data set to obtain second target data. It can be understood that the second target data is clean data obtained after data cleaning.

例如，在104中电子设备获取到第二数据集中200张图片的类别标签的正确与否的信息，这200张图片中有190张图片的类别标签是正确的。在106中，电子设备利用目标模型判定出第一数据集的800张图片中有790张图片的类别标签正确，那么电子设备可以将第二数据集中的上述190张图片和第一数据集中的上述790张图片合并，得到的980张图片。这980张图片可以认为是经过数据清洗后得到的干净数据。For example, in 104, the electronic device obtains information on whether the category labels of 200 pictures in the second data set are correct or not, and among the 200 pictures, the category labels of 190 pictures are correct. Instep 106, the electronic device uses the target model to determine that 790 of the 800 pictures in the first data set have correct category labels, then the electronic device can compare the above 190 pictures in the second data set and the above 190 pictures in the first data set. 790 images were merged, resulting in 980 images. These 980 pictures can be considered as clean data obtained after data cleaning.

可以理解的是，本实施例中，电子设备可以利用经过学习训练的二分类模型来进行数据清洗工作。由于该经过学习训练的二分类模型可以快速地确定出类别标签正确的数据。因此，本实施例可以快速地得到干净数据。相比于相关技术中由人工逐一浏览检查数据的标签信息是否有误的数据清洗方式，本实施例可以提高数据清洗的效率。It can be understood that, in this embodiment, the electronic device can use the learned and trained two-class model to perform data cleaning. Because the learned and trained binary classification model can quickly determine the data with the correct category label. Therefore, in this embodiment, clean data can be obtained quickly. Compared with the data cleaning method in the related art in which the label information of the data is checked manually one by one, the present embodiment can improve the efficiency of data cleaning.

本申请实施例中，将二分类模型引入分类数据的清洗工作。使用已知的第二数据集中数据的类别标签正确与否的信息和以及第二数据集中数据的特征来训练二分类模型，得到目标模型，再使用该目标模型输出第一数据集中的数据的类别标签是否正确的信息，最后获取所有类别标签正确的数据，即为干净数据。In the embodiment of the present application, the binary classification model is introduced into the cleaning work of the classified data. Use the known information about whether the category label of the data in the second data set is correct or not and the characteristics of the data in the second data set to train the two-class model to obtain the target model, and then use the target model to output the category of the data in the first data set Whether the label is correct or not, and finally obtain all the data with the correct category label, that is, clean data.

请参阅图3，图3为本申请实施例提供的数据处理方法的第三种流程示意图，流程可以包括：Please refer to FIG. 3. FIG. 3 is a third schematic flowchart of a data processing method provided by an embodiment of the present application. The process may include:

201、电子设备获取多个数据，该多个数据携带有相同的类别标签。201. The electronic device acquires multiple pieces of data, where the multiple pieces of data carry the same category label.

比如，电子设备可以获取需要进行数据清洗的1000张图片。这1000张照片携带有相同的类别标签。例如，这1000张图片具有人工标注的相同的花卉类别标签。例如，这1000张图片分别为P₁、P₂、P₃，……，P₁₀₀₀。For example, an electronic device can acquire 1000 pictures that need to be cleaned. These 1000 photos carry the same category tags. For example, these 1000 images have the same flower category labels that were manually annotated. For example, the 1000 pictures are respectively P₁ , P₂ , P₃ , ..., P₁₀₀₀ .

202、电子设备将该多个数据划分为第一数据集和第二数据集。202. The electronic device divides the plurality of data into a first data set and a second data set.

比如，在获取到上述1000张图片后，电子设备可以将这1000张图片划分为第一数据集和第二数据集。例如，电子设备可以从这1000张图片中随机抽取800张图片归为第一数据集，并将剩余的200张图片归为第二数据集。For example, after acquiring the above 1000 pictures, the electronic device may divide the 1000 pictures into a first data set and a second data set. For example, the electronic device may randomly select 800 pictures from the 1000 pictures and classify them as the first data set, and classify the remaining 200 pictures as the second data set.

在将1000张图片划分为第一数据集和第二数据集后，电子设备可以检测其当前的计算能力是否低于预设阈值。After dividing the 1000 pictures into the first data set and the second data set, the electronic device can detect whether its current computing capability is lower than a preset threshold.

若检测到当前的计算能力低于预设阈值，则可以认为电子设备当前的计算能力较弱。在这种情况下，可以进入203中。If it is detected that the current computing capability is lower than the preset threshold, it may be considered that the current computing capability of the electronic device is weak. In this case, 203 can be entered.

若检测到当前的计算能力不低于预设阈值，则可以认为电子设备当前的计算能力较强。在这种情况下，可以进入204中。If it is detected that the current computing capability is not lower than the preset threshold, it can be considered that the current computing capability of the electronic device is relatively strong. In this case, 204 can be entered.

203、当电子设备的计算能力低于预设阈值时，该电子设备利用预设特征提取模型，提取第一数据集和第二数据集中每一数据的特征。203. When the computing capability of the electronic device is lower than the preset threshold, the electronic device extracts the feature of each data in the first data set and the second data set by using a preset feature extraction model.

比如，电子设备检测到当前的计算能力低于预设阈值，那么该电子设备可以获取预设特征提取模型，并利用该预设特征提取模型提取第一数据集和第二数据集中每一张图片的特征。For example, if the electronic device detects that the current computing capability is lower than a preset threshold, the electronic device can obtain a preset feature extraction model, and use the preset feature extraction model to extract each picture in the first data set and the second data set Characteristics.

在一种实施方式中，本申请实施例可以通过如下方式获得预设特征提取模型：In an implementation manner, in this embodiment of the present application, a preset feature extraction model may be obtained in the following manner:

电子设备获取第一模型，该第一模型为根据ImageNet训练得到的ResNet模型；The electronic device obtains a first model, and the first model is a ResNet model trained according to ImageNet;

电子设备利用所述多个数据对所述ResNet模型进行学习训练，得到第二模型；The electronic device uses the plurality of data to learn and train the ResNet model to obtain a second model;

电子设备将位于所述第二模型最后一层的全连接层移除得到第三模型，并将所述第三模型确定为预设特征提取模型。The electronic device removes the fully connected layer at the last layer of the second model to obtain a third model, and determines the third model as a preset feature extraction model.

比如，当所述多个数据为图片，即需要进行数据清洗的数据为图片时，电子设备可以先获取第一模型，其中该第一模型是根据ImageNet训练得到的ResNet模型。For example, when the multiple pieces of data are pictures, that is, the data to be cleaned is pictures, the electronic device may first obtain a first model, where the first model is a ResNet model trained according to ImageNet.

需要说明的是，ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库。超过1400万的图像URL被ImageNet手动注释，以指示图片中的对象。自2010年以来，ImageNet项目每年举办一次软件比赛，即ImageNet大规模视觉识别挑战赛(ILSVRC)，软件程序竞相正确分类检测物体和场景。It should be noted that the ImageNet project is a large-scale visualization database for visual object recognition software research. Over 14 million image URLs were manually annotated by ImageNet to indicate objects in pictures. Since 2010, the ImageNet project has held an annual software competition, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), in which software programs compete to correctly classify and detect objects and scenes.

ResNet(Residual Neural Network)通过使用ResNet Unit成功训练出了152层的神经网络，并在ILSVRC2015比赛中取得冠军。ResNet的结构可以极快的加速神经网络的训练，模型的准确率也有比较大的提升。ResNet (Residual Neural Network) successfully trained a 152-layer neural network by using ResNet Unit, and won the championship in the ILSVRC2015 competition. The structure of ResNet can speed up the training of neural network very quickly, and the accuracy of the model is also greatly improved.

也就是说，ImageNet是一个开放的、免费的大型图片数据库，其中包含了2.2万类的已分类图片。而ResNet则是一个用ImageNet中的数据训练好的图片分类模型。That is to say, ImageNet is an open, free large-scale image database that contains 22,000 categories of classified images. ResNet is an image classification model trained with data from ImageNet.

比如，在获取到ResNet模型后，电子设备可以先利用需要进行数据清洗的图片(如上述1000张图片)对ResNet模型进行机器学习训练，从而得到第二模型。在得到第二模型后，电子设备可以将位于第二模型最后一层的全连接层移除，从而得到第三模型，并将该第三模型确定为预设特征提取模型。需要说明的是，ResNet模型的最后一层为全连接层，该全连接层在模型中的作用是对图片进行分类，而该ResNet模型中除该最后一层的全连接层外的其它神经网络层的作用是提取特征，因此将第二模型的最后一层的全连接层移除后得到的神经网络层可以用于作为特征提取模型。另外，之所以要利用需要进行数据清洗的图片对ResNet模型再进行一次学习训练，是因为ResNet是一个较为通用的分类模型，用需要进行数据清洗的图片对ResNet模型再进行一次学习训练得到第二模型，可以使得第二模型对需要进行数据清洗的图片的分类更有针对性，进而使得第三模型对需要进行数据清洗的图片的特征提取更准确。For example, after obtaining the ResNet model, the electronic device can first perform machine learning training on the ResNet model by using the pictures that need to be data cleaned (such as the above 1000 pictures), so as to obtain the second model. After obtaining the second model, the electronic device may remove the fully connected layer located at the last layer of the second model, thereby obtaining a third model, and determining the third model as a preset feature extraction model. It should be noted that the last layer of the ResNet model is a fully connected layer. The function of the fully connected layer in the model is to classify pictures, and the ResNet model has other neural networks except the fully connected layer of the last layer. The function of the layer is to extract features, so the neural network layer obtained by removing the fully connected layer of the last layer of the second model can be used as a feature extraction model. In addition, the reason why the ResNet model needs to be learned and trained again with the pictures that need to be cleaned is because ResNet is a more general classification model, and the pictures that need to be cleaned are used for the ResNet model. The second model can make the classification of the pictures that need to be cleaned by the second model more targeted, so that the third model can extract the features of the pictures that need to be cleaned more accurately.

204、当电子设备的计算能力不低于预设阈值时，电子设备获取第四模型，并利用该第四模型提取第一数据集和第二数据集中每一数据的特征，其中该第四模型的特征提取精度高于预设特征提取模型。204. When the computing capability of the electronic device is not lower than the preset threshold, the electronic device acquires a fourth model, and uses the fourth model to extract the features of each data in the first data set and the second data set, wherein the fourth model The feature extraction accuracy is higher than the preset feature extraction model.

比如，电子设备检测到其当前的计算能力不低于预设阈值，那么电子设备可以获取第四模型，并利用该第四模型提取第一数据集和第二数据集中每一张图片的特征，其中该第四模型的特征提取精度高于预设特征提取模型。For example, if the electronic device detects that its current computing capability is not lower than the preset threshold, the electronic device can obtain the fourth model, and use the fourth model to extract the features of each picture in the first data set and the second data set, The feature extraction accuracy of the fourth model is higher than the preset feature extraction model.

例如，相较于本实施例中使用的ResNet模型，第四模型可以是结构更加复杂的单个模型，如Inception-Resnet-v2。或者，第四模型可以是多个模型的融合(stacking)。例如，第四模型的结构可以如图4所示。将图片数据同时输入给多个一级模型(Level 1)，然后将一级模型提取的特征作为二级模型的输入，最后使用二级模型的输出作为输出特征。其中Model 1、Model 2、Model3可以选用常用的深度学习模型，如ResNet、Inception、MobileNet等，而Model4可以选择较简单的传统机器学习模型，如线性回归等。多模型的融合综合了多种模型的优势，对特征的提取能力更强，使得后续清洗的效果更好，但消耗的资源也更多，适合在电子设备运算能力充足的情况下使用。For example, compared to the ResNet model used in this embodiment, the fourth model may be a single model with a more complex structure, such as Inception-Resnet-v2. Alternatively, the fourth model may be a stacking of multiple models. For example, the structure of the fourth model may be as shown in FIG. 4 . Input the image data to multiple first-level models (Level 1) at the same time, then use the features extracted by the first-level model as the input of the second-level model, and finally use the output of the second-level model as the output feature. Among them,Model 1,Model 2, andModel 3 can choose commonly used deep learning models, such as ResNet, Inception, MobileNet, etc., whileModel 4 can choose simpler traditional machine learning models, such as linear regression. The fusion of multiple models combines the advantages of multiple models, and has a stronger ability to extract features, which makes the effect of subsequent cleaning better, but consumes more resources, and is suitable for use when the computing power of electronic equipment is sufficient.

在一种实施方式中，电子设备的运算能力可以是诸如CPU占用率和/或剩余运行内存的容量和/或剩余运行内存容量在运行内存总容量中所占的比值等。In one embodiment, the computing capability of the electronic device may be, for example, the CPU occupancy rate and/or the capacity of the remaining running memory and/or the ratio of the remaining running memory capacity to the total capacity of the running memory, and the like.

205、电子设备获取第二数据集中每一数据的类别标签的正确与否信息。205. The electronic device acquires information on whether the category label of each data in the second data set is correct or not.

比如，在划分出第一数据集和第二数据集之后，电子设备可以获取该第二数据集中每一张图片的类别标签的正确与否信息。For example, after dividing the first data set and the second data set, the electronic device may obtain information on whether the category label of each picture in the second data set is correct or not.

206、根据第二数据集中每一数据的类别标签的正确与否信息以及每一数据的特征，电子设备训练预设的二分类模型，得到目标模型。206. According to the correctness information of the category label of each data in the second data set and the characteristics of each data, the electronic device trains a preset two-class model to obtain a target model.

比如，在获取到第二数据集中每一张图片的类别标签的正确与否信息后，电子设备可以根据该第二数据集中每一张图片的类别标签的正确与否信息以及第二数据集中每一张图片的特征，对预设的二分类模型进行学习训练，从而得到经过学习训练的模型，即目标模型。For example, after obtaining the correctness information of the category label of each picture in the second data set, the electronic device can obtain the correctness information of the category label of each picture in the second data set and The feature of a picture is learned and trained on the preset two-class model, so as to obtain the learned and trained model, that is, the target model.

例如，预设的二分类模型可以为支持向量机(Support Vector Machine，SVM)。在获取到第二数据集中的200张图片的类别标签正确与否的信息后，电子设备可以将这200张图片的类别标签正确与否的信息以及这200图片的特征作为输入数据，输入到预设的SVM模型中以对该SVM模型进行学习训练，从而得到目标模型。For example, the preset binary classification model may be a Support Vector Machine (SVM). After acquiring the information on whether the category labels of the 200 pictures in the second data set are correct or not, the electronic device can use the information on whether the category labels of the 200 pictures are correct or not and the features of the 200 pictures as input data, and input the information to the pre- In the set SVM model, the SVM model is learned and trained to obtain the target model.

在一种实施方式中，例如，图片P_i为第二数据集中的一张图片，那么图片P_i的特征f_i和该图片P_i的类别标签正确与否的信息b_i可以表示成<f_i,b_i>的形式，那么<f_i,b_i>可以作为SVM模型的一条学习样本数据。In one embodiment, for example, the picture P_i is a picture in the second data set, then the feature f_i of the picture P_i and the information b_i of whether the category label of the picture P_i is correct or not can be expressed as < f_i , b_i > form, then < f_i , b_i > can be used as a piece of learning sample data of the SVM model.

在一些实施方式中，预设的二分类模型还可以是诸如多层感知机(Multi-LayerPerception)、决策树(Decision Tree)或随机森林(Random Forest)等模型。In some embodiments, the preset binary classification model may also be a model such as a multi-layer perceptron (Multi-Layer Perception), a decision tree (Decision Tree) or a random forest (Random Forest).

207、电子设备利用目标模型以及第一数据集中每一数据的特征，获取该第一数据集中类别标签被判定为正确的第一目标数据。207. The electronic device uses the target model and the feature of each data in the first data set to acquire the first target data whose category label is determined to be correct in the first data set.

比如，在得到目标模型后，电子设备可以利用该目标模型以及第一数据集中的每一张图片的特征，获取第一数据集中类别标签被判定为正确的图片，即第一目标数据。For example, after obtaining the target model, the electronic device can use the target model and the features of each picture in the first data set to obtain pictures whose category labels in the first data set are determined to be correct, that is, the first target data.

208、根据第一目标数据以及第二数据集中类别标签正确的数据，电子设备得到第二目标数据。208. According to the first target data and the data with correct category labels in the second data set, the electronic device obtains the second target data.

比如，在得到第一目标数据后，电子设备还可以获取第二数据集中类别标签正确的图片，并根据该第一目标数据和该第二数据集中类别标签正确的图片得到第二目标数据。可以理解的是，该第二目标数据即为数据清洗处理后得到的干净图片。For example, after obtaining the first target data, the electronic device may also obtain a picture with the correct category label in the second data set, and obtain the second target data according to the first target data and the picture with the correct category label in the second data set. It can be understood that the second target data is a clean image obtained after data cleaning.

例如，在104中电子设备获取到第二数据集中200张图片的类别标签的正确与否的信息，这200张图片中有190张图片的类别标签是正确的。在106中，电子设备利用目标模型判定出第一数据集的800张图片中有790张图片的类别标签正确，那么电子设备可以将第二数据集中的上述190张图片和第一数据集中的上述790张图片合并，得到的980张图片。这980张图片可以认为是经过数据清洗后得到的干净图片。For example, in 104, the electronic device obtains information on whether the category labels of 200 pictures in the second data set are correct or not, and among the 200 pictures, the category labels of 190 pictures are correct. Instep 106, the electronic device uses the target model to determine that 790 of the 800 pictures in the first data set have correct category labels, then the electronic device can compare the above 190 pictures in the second data set and the above 190 pictures in the first data set. 790 images were merged, resulting in 980 images. These 980 pictures can be considered as clean pictures after data cleaning.

在一些实施方式中，本申请实施例还可以包括：In some implementation manners, the embodiments of the present application may further include:

电子设备利用目标模型以及第一数据集中每一数据的特征，获取该第一数据集中类别标签被判定为错误的第三目标数据；The electronic device uses the target model and the feature of each data in the first data set to obtain third target data whose category label in the first data set is judged to be wrong;

电子设备将第三目标数据以及第二数据集中类别标签错误的数据删除。The electronic device deletes the third target data and the data with wrong category labels in the second data set.

比如，在得到目标模型后，电子设备可以将第一数据集中的800张图片中每一张图片的特征输入到该目标模型中，由该目标模型根据每一张图片的特征输出该图片的类别标签是否正确的信息，并将判定出的类别标签错误的图片确定为第三目标数据。可以理解的是，第三目标数据可以包括多张图片。For example, after obtaining the target model, the electronic device can input the features of each picture in the 800 pictures in the first data set into the target model, and the target model can output the category of the picture according to the features of each picture information about whether the label is correct, and the determined picture with the wrong category label is determined as the third target data. It can be understood that the third target data may include multiple pictures.

在得到第三目标数据后，电子设备还可以获取第二数据集中类别标签错误的图片。之后，电子设备可以将第三目标数据和第二数据集中类别标签错误的图片删除。可以理解的是，该第三目标数据和第二数据集中类别标签错误的图片即可以认为是数据清洗处理所清洗出来的“脏数据”(Dirty Read)。可以理解的是，这些脏数据被认为是其携带的类别标签与其实际的类别标签不相同。例如，例如，树木的图片被错误地标注为花卉类别，则该树木的图片即是脏数据。After obtaining the third target data, the electronic device may also obtain pictures with wrong category labels in the second data set. Afterwards, the electronic device may delete pictures with wrong category labels in the third target data and the second data set. It can be understood that, the third target data and the pictures with wrong category labels in the second data set can be regarded as "dirty data" (Dirty Read) cleaned by the data cleaning process. Understandably, these dirty data are considered to carry class labels different from their actual class labels. For example, if a picture of a tree is incorrectly labeled as a flower category, the picture of that tree is dirty data.

在一些实施方式中，本申请在划分第一数据集和第二数据集时，该第一数据集和第二数据集可以满足如下条件：In some embodiments, when the present application divides the first data set and the second data set, the first data set and the second data set may satisfy the following conditions:

第一数据集和第二数据集中包含的数据的数量比为预设比值，且第一数据集中包含的数据的数量大于第二数据集。The quantity ratio of the data contained in the first data set and the second data set is a preset ratio, and the quantity of data contained in the first data set is greater than that of the second data set.

比如，第一数据集和第二数据集中包含的数据的数量比可以为预设比值，例如该预设比值可以为8:2或者9:1或者7.5:2.5等等，并且第一数据集中包含的数据的数量大于第二数据集中包含的数据的数量。For example, the quantity ratio of the data contained in the first data set and the second data set may be a preset ratio, for example, the preset ratio may be 8:2 or 9:1 or 7.5:2.5, etc., and the first data set contains The amount of data is greater than the amount of data contained in the second dataset.

在另一种实施方式中，第二数据集中包含的类别标签正确的数据和类别标签错误的数据的数量可以均满足如下数值条件：即第二数据集中包含的类别标签正确的数据和类别标签错误的数据的数量可以均大于或等于100。例如，第二数据集中包含的类别标签正确的图片的数量不低于100张，并且类别标签错误的图片的数量不低于100张。In another embodiment, the number of data with correct category labels and data with incorrect category labels contained in the second dataset may both satisfy the following numerical conditions: that is, the data with correct category labels and the incorrect category labels contained in the second dataset may both satisfy the following numerical conditions: The number of data can be greater than or equal to 100. For example, the number of pictures with correct category labels contained in the second dataset is not less than 100, and the number of pictures with incorrect category labels is not less than 100.

请参阅图5至图7，图5至图7为本申请实施例提供的数据处理方法的场景示意图。Please refer to FIG. 5 to FIG. 7 , FIG. 5 to FIG. 7 are schematic diagrams of scenarios of the data processing method provided by the embodiments of the present application.

比如，如图5所示，用户当前需要对图片集P中的1000张图片进行数据清洗处理，这1000张图片被标注有相同的类别标签。那么，电子设备可以先获取这1000张图片。For example, as shown in Figure 5, the user currently needs to perform data cleaning on 1000 pictures in the picture set P, and these 1000 pictures are marked with the same category label. Then, the electronic device can obtain these 1000 pictures first.

之后，电子设备可以随机地将这1000张图片分为第一图片集和第二图片集，如图6所示。其中，第一图片集包含800张图片，第二图片集包含200张图片。After that, the electronic device may randomly divide the 1000 pictures into a first picture set and a second picture set, as shown in FIG. 6 . Among them, the first picture set contains 800 pictures, and the second picture set contains 200 pictures.

之后，电子设备可以使用预设特征提取模型对第一图片集和第二图片集中的每一张图片进行特征提取。例如，特征F_i是图片P_i的特征，i为大于或等于1的整数。Afterwards, the electronic device may use the preset feature extraction model to perform feature extraction on each picture in the first picture set and the second picture set. For example, the feature F_i is the feature of the picture P_i , and i is an integer greater than or equal to 1.

在提取得到各图片的特征后，可以通过人工检查的方式来确定第二图片集中的200张图片中的每一张图片的类别标签是否正确。若图片的类别标签正确，则检查人员可以通过电子设备为该图片标注相应的表示类别标签正确的信息，如标注数字“1”。若图片的类别标签错误，则检查人员可以通过电子设备为该图片标注相应的表示类别标签错误的信息，如标注数字“0”。这样，电子设备即可以获取到这200张图片的类别标签正确与否的信息。例如，这200张图片中有190张图片被标注有数字“1”，即经人工检查第二图片集中有190张图片的类别标签是正确的。After the features of each picture are extracted, whether the category label of each picture in the 200 pictures in the second picture set is correct can be determined by manual inspection. If the category label of the picture is correct, the inspector can use the electronic device to mark the picture with corresponding information indicating the correct category label, such as marking the number "1". If the category label of the picture is wrong, the inspector can use the electronic device to mark the picture with corresponding information indicating the wrong category label, such as marking the number "0". In this way, the electronic device can obtain information on whether the category labels of the 200 pictures are correct or not. For example, 190 of the 200 pictures are marked with the number "1", which means that the category labels of 190 pictures in the second picture set are correct after manual inspection.

在获取到第二图片集中的200张图片的类别标签正确与否的信息后，电子设备可以将这200张图片的类别标签正确与否的信息以及这200图片的特征作为输入数据，输入到预设的SVM模型中以对该SVM模型进行学习训练，从而得到目标模型。例如，图片P_i为第二数据集中的一张图片，那么图片P_i的特征f_i和该图片P_i的类别标签正确与否的信息b_i可以表示成<f_i,b_i>的形式，那么<f_i,b_i>可以作为SVM模型的一条学习样本数据。After acquiring the information on whether the category labels of the 200 pictures in the second picture set are correct or not, the electronic device can use the information on whether the category labels of the 200 pictures are correct or not and the features of the 200 pictures as input data, and input the information into the pre- In the set SVM model, the SVM model is learned and trained to obtain the target model. For example, if the picture P_i is a picture in the second data set, then the feature f_i of the picture P_i and the information_bi of whether the category label of the picture P_i is correct or not can be expressed in the form of <fi_,_bi > , then <f_i ,b_i > can be used as a piece of learning sample data for the SVM model.

在得到目标模型后，电子设备可以将第一图片集中的800张图片中每一张图片的特征输入到该目标模型中，由该目标模型根据每一张图片的特征输出该图片的类别标签是否正确的信息，并将判定出的类别标签正确的图片确定为第一目标数据。例如，电子设备最终判定出第一图片集中有790张图片的类别标签是正确的。After obtaining the target model, the electronic device can input the feature of each picture in the 800 pictures in the first picture set into the target model, and the target model can output whether the category label of the picture is based on the feature of each picture. correct information, and the determined picture with the correct category label is determined as the first target data. For example, the electronic device finally determines that the category labels of 790 pictures in the first picture set are correct.

之后，电子设备可以将第二图片集中的类别标签正确的190张图片和第一图片集中的类别标签被判定为正确的790张图片合并，得到的980张图片。这980张图片可以认为是经过数据清洗后得到的干净图片。例如，如图7所示，电子设备将这980张图片合成为一个图片集。After that, the electronic device may combine 190 pictures with correct category labels in the second picture set and 790 pictures with correct category labels in the first picture set, to obtain 980 pictures. These 980 pictures can be considered as clean pictures after data cleaning. For example, as shown in FIG. 7 , the electronic device synthesizes the 980 pictures into a picture set.

另请参阅图8，图8为本实施例提供的数据处理方法的第四种流程示意图。Please also refer to FIG. 8 , FIG. 8 is a fourth schematic flowchart of the data processing method provided in this embodiment.

本实施例中，本实施例中，电子设备可以利用经过学习训练的二分类模型来进行数据清洗工作。由于该经过学习训练的二分类模型可以快速地确定出类别标签正确的数据。因此，本实施例可以快速地得到干净数据。相比于相关技术中由人工逐一浏览检查数据的标签信息是否有误的数据清洗方式，本实施例减少了大量的人工工作量，可以提高数据清洗的效率，减少了数据清洗的成本。In this embodiment, in this embodiment, the electronic device can use the learned and trained two-class model to perform data cleaning. Because the learned and trained binary classification model can quickly determine the data with the correct category label. Therefore, in this embodiment, clean data can be obtained quickly. Compared with the data cleaning method of manually browsing and checking whether the label information of the data is wrong in the related art, the present embodiment reduces a large amount of manual workload, which can improve the efficiency of data cleaning and reduce the cost of data cleaning.

另外，本实施例利用二分类的方式进行数据清洗工作，可以达到与人工清洗相近的准确度。并且，本实施例提供的数据清洗方式其数据清洗过程可回溯，其它人员可以通过清洗过程检查数据清洗质量。In addition, this embodiment uses the binary classification method to perform data cleaning, which can achieve an accuracy similar to manual cleaning. In addition, in the data cleaning method provided in this embodiment, the data cleaning process can be traced back, and other personnel can check the data cleaning quality through the cleaning process.

请参阅图9，图9为本申请实施例提供的数据处理装置的结构示意图。数据处理装置300可以包括：第一获取模块301，划分模块302，提取模块303，第二获取模块304，训练模块305，第三获取模块306，处理模块307。Please refer to FIG. 9 , which is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application. Thedata processing apparatus 300 may include: afirst acquisition module 301 , adivision module 302 , anextraction module 303 , asecond acquisition module 304 , atraining module 305 , athird acquisition module 306 , and aprocessing module 307 .

第一获取模块301，用于获取多个数据，所述多个数据携带有相同的类别标签；Thefirst acquisition module 301 is used to acquire a plurality of data, and the plurality of data carry the same category label;

划分模块302，用于将所述多个数据划分为第一数据集和第二数据集；adividing module 302, configured to divide the plurality of data into a first data set and a second data set;

提取模块303，用于提取所述第一数据集和所述第二数据集中每一数据的特征；anextraction module 303, configured to extract the feature of each data in the first data set and the second data set;

第二获取模块304，用于获取所述第二数据集中每一数据的类别标签的正确与否信息；A second obtainingmodule 304, configured to obtain information on whether the category label of each data in the second data set is correct or not;

训练模块305，用于根据所述第二数据集中每一数据的类别标签的正确与否信息以及每一数据的特征，训练预设的二分类模型，得到目标模型；Atraining module 305, configured to train a preset two-class model according to the correctness information of the category label of each data in the second data set and the feature of each data, to obtain a target model;

第三获取模块306，用于利用所述目标模型以及所述第一数据集中每一数据的特征，获取所述第一数据集中类别标签被判定为正确的第一目标数据；A third obtainingmodule 306, configured to obtain the first target data whose category label in the first data set is judged to be correct by using the target model and the feature of each data in the first data set;

处理模块307，用于根据所述第一目标数据以及所述第二数据集中类别标签正确的数据，得到第二目标数据。Theprocessing module 307 is configured to obtain second target data according to the first target data and data with correct category labels in the second data set.

在一种实施方式中，所述第一获取模块301可以用于：In one embodiment, the first obtainingmodule 301 may be used to:

当所述多个数据为图片时，获取第一模型，所述第一模型为根据ImageNet训练得到的ResNet模型；When the multiple pieces of data are pictures, obtain a first model, where the first model is a ResNet model trained according to ImageNet;

利用所述多个数据对所述ResNet模型进行学习训练，得到第二模型；Use the multiple data to learn and train the ResNet model to obtain a second model;

将位于所述第二模型最后一层的全连接层移除得到第三模型，并将所述第三模型确定为预设特征提取模型；removing the fully connected layer at the last layer of the second model to obtain a third model, and determining the third model as a preset feature extraction model;

那么，所述提取模块303可以用于：利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征。Then, theextraction module 303 may be configured to: extract the feature of each data in the first data set and the second data set by using the preset feature extraction model.

在一种实施方式中，所述提取模块303可以用于：In one embodiment, theextraction module 303 may be used to:

当电子设备的计算能力低于预设阈值时，利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征。When the computing capability of the electronic device is lower than a preset threshold, the preset feature extraction model is used to extract features of each data in the first data set and the second data set.

当所述电子设备的计算能力不低于所述预设阈值时，获取第四模型，并利用所述第四模型提取所述第一数据集和所述第二数据集中每一数据的特征，其中所述第四模型的特征提取精度高于所述预设特征提取模型。When the computing capability of the electronic device is not lower than the preset threshold, a fourth model is obtained, and the fourth model is used to extract the features of each data in the first data set and the second data set, The feature extraction accuracy of the fourth model is higher than that of the preset feature extraction model.

在一种实施方式中，所述第一数据集和所述第二数据集中包含的数据的数量比为预设比值，且所述第一数据集中包含的数据的数量大于所述第二数据集。In an embodiment, the ratio of the quantity of data contained in the first data set and the second data set is a preset ratio, and the quantity of data contained in the first data set is greater than that of the second data set .

在一种实施方式中，所述处理模块307还可以用于：In one embodiment, theprocessing module 307 may also be used to:

利用所述目标模型以及所述第一数据集中每一数据的特征，获取所述第一数据集中类别标签被判定为错误的第三目标数据；Using the target model and the feature of each data in the first data set, obtain third target data whose category label in the first data set is judged to be wrong;

将所述第三目标数据以及所述第二数据集中类别标签错误的数据删除。Delete the third target data and the data with wrong category labels in the second data set.

在一种实施方式中，所述预设的二分类模型至少包括支持向量机、多层感知机、决策树或随机森林。In one embodiment, the preset binary classification model includes at least a support vector machine, a multilayer perceptron, a decision tree or a random forest.

本申请实施例提供一种计算机可读的存储介质，其上存储有计算机程序，当所述计算机程序在计算机上执行时，使得所述计算机执行如本实施例提供的数据处理方法中的流程。Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed on a computer, causes the computer to execute the process in the data processing method provided by this embodiment.

本申请实施例还提供一种电子设备，包括存储器，处理器，所述处理器通过调用所述存储器中存储的计算机程序，用于执行本实施例提供的数据处理方法中的流程。An embodiment of the present application further provides an electronic device, including a memory and a processor, where the processor is configured to execute the process in the data processing method provided by the present embodiment by invoking a computer program stored in the memory.

例如，上述电子设备可以是诸如平板电脑或者智能手机等移动终端。请参阅图10，图10为本申请实施例提供的电子设备的结构示意图。For example, the above-mentioned electronic device may be a mobile terminal such as a tablet computer or a smart phone. Please refer to FIG. 10 , which is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

该电子设备400可以包括显示屏401、存储器402、处理器403等部件。本领域技术人员可以理解，图10中示出的电子设备结构并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Theelectronic device 400 may include components such as adisplay screen 401, amemory 402, aprocessor 403, and the like. Those skilled in the art can understand that the structure of the electronic device shown in FIG. 10 does not constitute a limitation on the electronic device, and may include more or less components than the one shown, or combine some components, or arrange different components.

显示屏401可以用于显示图文等信息。Thedisplay screen 401 can be used to display information such as pictures and texts.

存储器402可用于存储应用程序和数据。存储器402存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器403通过运行存储在存储器402的应用程序，从而执行各种功能应用以及数据处理。Memory 402 may be used to store applications and data. The application program stored in thememory 402 contains executable code. Applications can be composed of various functional modules. Theprocessor 403 executes various functional applications and data processing by executing the application programs stored in thememory 402 .

处理器403是电子设备的控制中心，利用各种接口和线路连接整个电子设备的各个部分，通过运行或执行存储在存储器402内的应用程序，以及调用存储在存储器402内的数据，执行电子设备的各种功能和处理数据，从而对电子设备进行整体监控。Theprocessor 403 is the control center of the electronic device, uses various interfaces and lines to connect various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in thememory 402 and calling the data stored in thememory 402. The various functions and processing data of the device are used to monitor the electronic equipment as a whole.

在本实施例中，电子设备中的处理器403会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器402中，并由处理器403来运行存储在存储器402中的应用程序，从而执行：In this embodiment, theprocessor 403 in the electronic device loads the executable code corresponding to the process of one or more application programs into thememory 402 according to the following instructions, and theprocessor 403 executes the execution and stores it in thememory 402 in the application, thus executing:

请参阅图11，电子设备400可以包括显示屏401、存储器402、处理器403、输入单元404、电源405等部件。Referring to FIG. 11 , theelectronic device 400 may include components such as adisplay screen 401 , amemory 402 , aprocessor 403 , aninput unit 404 , and apower supply 405 .

输入单元404可用于接收输入的数字、字符信息或用户特征信息(比如指纹)，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。Theinput unit 404 may be used to receive input numbers, character information or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

电源405可用于为各部件提供电力保证。Power supply 405 may be used to provide power assurance for the various components.

在一种实施方式中，所述处理器403还可以用于：当所述多个数据为图片时，获取第一模型，所述第一模型为根据ImageNet训练得到的ResNet模型；利用所述多个数据对所述ResNet模型进行学习训练，得到第二模型；将位于所述第二模型最后一层的全连接层移除得到第三模型，并将所述第三模型确定为预设特征提取模型；In one embodiment, theprocessor 403 may be further configured to: when the multiple pieces of data are pictures, obtain a first model, where the first model is a ResNet model trained according to ImageNet; The ResNet model is learned and trained by the data to obtain the second model; the fully connected layer located at the last layer of the second model is removed to obtain the third model, and the third model is determined as the preset feature extraction Model;

那么，所述处理器403执行所述提取所述第一数据集和所述第二数据集中每一数据的特征时，可以执行：利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征。Then, when theprocessor 403 executes the extracting the feature of each data in the first data set and the second data set, it may execute: extracting the first data set by using the preset feature extraction model and the characteristics of each data in the second data set.

在一种实施方式中，所述处理器403执行利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征时，可以执行：当电子设备的计算能力低于预设阈值时，利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征。In an implementation manner, when theprocessor 403 executes using the preset feature extraction model to extract the feature of each data in the first data set and the second data set, theprocessor 403 may execute: when the electronic device's When the computing power is lower than the preset threshold, the preset feature extraction model is used to extract the feature of each data in the first data set and the second data set.

在一种实施方式中，所述处理器403还可以执行：当所述电子设备的计算能力不低于所述预设阈值时，获取第四模型，并利用所述第四模型提取所述第一数据集和所述第二数据集中每一数据的特征，其中所述第四模型的特征提取精度高于所述预设特征提取模型。In an implementation manner, theprocessor 403 may further perform: when the computing capability of the electronic device is not lower than the preset threshold, acquire a fourth model, and use the fourth model to extract the first The features of each data in a dataset and the second dataset, wherein the feature extraction accuracy of the fourth model is higher than that of the preset feature extraction model.

在一种实施方式中，所述处理器403还可以执行：利用所述目标模型以及所述第一数据集中每一数据的特征，获取所述第一数据集中类别标签被判定为错误的第三目标数据；将所述第三目标数据以及所述第二数据集中类别标签错误的数据删除。In one embodiment, theprocessor 403 may further execute: by using the target model and the feature of each data in the first data set, obtain a third data set whose category label is determined to be wrong in the first data set target data; delete the third target data and the data with wrong category labels in the second data set.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见上文针对数据处理方法的详细描述，此处不再赘述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the detailed description of the data processing method above, and details are not repeated here.

本申请实施例提供的所述数据处理装置与上文实施例中的数据处理方法属于同一构思，在所述数据处理装置上可以运行所述数据处理方法实施例中提供的任一方法，其具体实现过程详见所述数据处理方法实施例，此处不再赘述。The data processing apparatus provided in the embodiments of the present application and the data processing methods in the above embodiments belong to the same concept, and any method provided in the data processing method embodiments can be executed on the data processing apparatus. The implementation process is detailed in the data processing method embodiment, which is not repeated here.

需要说明的是，对本申请实施例所述数据处理方法而言，本领域普通技术人员可以理解实现本申请实施例所述数据处理方法的全部或部分流程，是可以通过计算机程序来控制相关的硬件来完成，所述计算机程序可存储于一计算机可读取存储介质中，如存储在存储器中，并被至少一个处理器执行，在执行过程中可包括如所述数据处理方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储器(ROM，Read Only Memory)、随机存取记忆体(RAM，Random Access Memory)等。It should be noted that, for the data processing methods described in the embodiments of the present application, those of ordinary skill in the art can understand that all or part of the process of implementing the data processing methods described in the embodiments of the present application can be controlled by computer programs. To complete, the computer program can be stored in a computer-readable storage medium, such as a memory, and executed by at least one processor, and the execution process can include the flow of the embodiment of the data processing method . The storage medium may be a magnetic disk, an optical disk, a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and the like.

对本申请实施例的所述数据处理装置而言，其各功能模块可以集成在一个处理芯片中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中，所述存储介质譬如为只读存储器，磁盘或光盘等。For the data processing apparatus of the embodiments of the present application, each functional module may be integrated in one processing chip, or each module may exist physically alone, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, etc. .

以上对本申请实施例所提供的一种数据处理方法、装置、存储介质以及电子设备进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。A data processing method, device, storage medium, and electronic device provided by the embodiments of the present application have been described in detail above. The principles and implementations of the present application are described with specific examples. The descriptions of the above embodiments are only It is used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there will be changes in the specific embodiments and application scope. In summary, this specification The content should not be construed as a limitation on this application.

Claims

Translated fromChinese

1.一种数据处理方法，其特征在于，包括：1. a data processing method, is characterized in that, comprises:

当确定出所述多个数据为图片时，获取第一模型，所述第一模型为根据ImageNet训练得到的ResNet模型；利用所述多个数据对所述ResNet模型进行学习训练，得到第二模型；将位于所述第二模型最后一层的全连接层移除得到第三模型，并将所述第三模型确定为预设特征提取模型；When it is determined that the plurality of data are pictures, a first model is obtained, and the first model is a ResNet model trained according to ImageNet; the ResNet model is learned and trained by using the plurality of data to obtain a second model ; The fully connected layer at the last layer of the second model is removed to obtain the third model, and the third model is determined as a preset feature extraction model;

利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征；Using the preset feature extraction model, extract the feature of each data in the first data set and the second data set;

将所述第一数据集中每一数据的特征输入到所述目标模型中，利用所述目标模型输出所述第一数据集中每一数据的类别标签的正确与否信息，得到所述第一数据集中类别标签被判定为正确的数据，并将所述第一数据集中类别标签被判定为正确的数据确定为第一目标数据；Input the features of each data in the first data set into the target model, and use the target model to output the correctness information of the category label of each data in the first data set, to obtain the first data The centralized category labels are determined to be correct data, and the category labels in the first data set are determined to be correct data to be determined as the first target data;

2.根据权利要求1所述的数据处理方法，其特征在于，利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征，包括：2. The data processing method according to claim 1, wherein, using the preset feature extraction model to extract features of each data in the first data set and the second data set, comprising:

3.根据权利要求2所述的数据处理方法，其特征在于，所述方法还包括：3. The data processing method according to claim 2, wherein the method further comprises:

4.根据权利要求1所述的数据处理方法，其特征在于，所述第一数据集和所述第二数据集中包含的数据的数量比为预设比值，且所述第一数据集中包含的数据的数量大于所述第二数据集。4 . The data processing method according to claim 1 , wherein the ratio of the quantity of data contained in the first data set and the second data set is a preset ratio, and the data contained in the first data set is The amount of data is greater than the second data set.

5.根据权利要求1所述的数据处理方法，其特征在于，所述方法还包括：5. The data processing method according to claim 1, wherein the method further comprises:

6.根据权利要求1所述的数据处理方法，其特征在于，所述预设的二分类模型至少包括支持向量机、多层感知机、决策树或随机森林。6 . The data processing method according to claim 1 , wherein the preset binary classification model comprises at least a support vector machine, a multilayer perceptron, a decision tree or a random forest. 7 .

7.一种数据处理装置，其特征在于，包括：7. A data processing device, comprising:

第一获取模块，用于获取多个数据，所述多个数据携带有相同的类别标签；其中，当确定出所述多个数据为图片时，获取第一模型，所述第一模型为根据ImageNet训练得到的ResNet模型；利用所述多个数据对所述ResNet模型进行学习训练，得到第二模型；将位于所述第二模型最后一层的全连接层移除得到第三模型，并将所述第三模型确定为预设特征提取模型；The first obtaining module is used to obtain a plurality of data, and the plurality of data carry the same category label; wherein, when it is determined that the plurality of data are pictures, a first model is obtained, and the first model is based on The ResNet model obtained by ImageNet training; use the multiple data to learn and train the ResNet model to obtain a second model; remove the fully connected layer at the last layer of the second model to obtain a third model, and use The third model is determined to be a preset feature extraction model;

提取模块，用于利用所述预设特征提取模型，提取所述第一数据集和所述第二数据集中每一数据的特征；an extraction module, configured to extract the feature of each data in the first data set and the second data set by using the preset feature extraction model;

第三获取模块，用于将第一数据集中每一数据的特征输入到所述目标模型中，利用所述目标模型输出所述第一数据集中每一数据的类别标签的正确与否信息，得到所述第一数据集中类别标签被判定为正确的数据，并将所述第一数据集中类别标签被判定为正确的数据确定为第一目标数据；The third acquisition module is configured to input the feature of each data in the first data set into the target model, and use the target model to output the correctness information of the category label of each data in the first data set, to obtain The category label in the first dataset is determined as correct data, and the category tag in the first dataset determined as correct data is determined as the first target data;

8.一种存储介质，其上存储有计算机程序，其特征在于，当所述计算机程序在计算机上执行时，使得所述计算机执行如权利要求1至6中任一项所述的方法。8. A storage medium on which a computer program is stored, characterized in that, when the computer program is executed on a computer, the computer is caused to execute the method according to any one of claims 1 to 6.

9.一种电子设备，包括存储器，处理器，其特征在于，所述处理器通过调用所述存储器中存储的计算机程序，用于执行如权利要求1至6中任一项所述的方法。9 . An electronic device, comprising a memory and a processor, wherein the processor is configured to execute the method according to any one of claims 1 to 6 by invoking a computer program stored in the memory. 10 .