Movatterモバイル変換


[0]ホーム

URL:


CN117669759A - Data processing method, device, electronic equipment and storage medium - Google Patents

Data processing method, device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN117669759A
CN117669759ACN202211028230.XACN202211028230ACN117669759ACN 117669759 ACN117669759 ACN 117669759ACN 202211028230 ACN202211028230 ACN 202211028230ACN 117669759 ACN117669759 ACN 117669759A
Authority
CN
China
Prior art keywords
data
labeling
result
acquisition
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211028230.XA
Other languages
Chinese (zh)
Inventor
郭思敏
刘权
陈志刚
王士进
刘聪
胡国平
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co LtdfiledCriticaliFlytek Co Ltd
Priority to CN202211028230.XApriorityCriticalpatent/CN117669759A/en
Publication of CN117669759ApublicationCriticalpatent/CN117669759A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: s11, acquiring first data; s12, preprocessing the first data to obtain second data; s13, the second data are sent to a labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results; s14, sampling the third data to obtain fourth data; s15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received; s16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data; and if the labeling check result is that the labeling is unqualified, repeating S13-S15. The invention improves the effectiveness of data processing and ensures the data quality.

Description

Data processing method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium.
Background
The data is used as an important part of an artificial intelligence (Artificial Intelligence, AI) system, applied to a training link before the deployment of the artificial intelligence system, or a dynamic training link of the artificial intelligence system in the use process.
For training methods such as supervised, semi-supervised, unsupervised, and reinforcement learning, data quality may be a primary consideration in creating and using data to train and evaluate artificial intelligence systems. It has been shown that the results of analysis and Machine Learning (ML) can be more practical and reliable using more accurate and richer data. Therefore, how to ensure the data quality becomes a problem to be solved urgently.
Disclosure of Invention
The invention provides a data processing method, a data processing device, electronic equipment and a storage medium, which are used for solving the defect that the data quality applied to an artificial intelligent system in the prior art needs to be guaranteed.
The invention provides a data processing method, which comprises the following steps:
s11, acquiring first data, wherein the first data comprises at least one of audio, images and texts;
s12, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling;
S13, the second data are sent to a labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results;
s14, sampling the third data to obtain fourth data;
s15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received;
s16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data;
and if the labeling check result is that the labeling is unqualified, repeating S13-S15.
According to the data processing method provided by the invention, the preprocessing of the first data comprises the following steps:
in the case where the first data includes audio, data combining the first data based on a sampling rate;
in the case where the first data includes an image, the first data is data-combined based on at least one of an image capturing time, an image capturing position, image capturing apparatus information, and a timing relationship of a plurality of images captured for the same subject.
According to the data processing method provided by the invention, when the first data comprises audio, the labeling result comprises at least one of a speaker role labeling result, an environment scene labeling result, a language labeling result, a rhythm labeling result, a system labeling result, an emotion labeling result and a noise labeling result;
when the first data comprises an image, the labeling result comprises at least one of a point labeling result, a frame labeling result, a region labeling result, a 3D labeling result and a classification labeling result;
and when the first data comprises a text, the labeling result comprises at least one of sentence segmentation labeling result, semantic judgment labeling result, text translation labeling result, emotion labeling result, double-tone word labeling result and digital character labeling result.
According to the data processing method provided by the invention, the fourth data is sent to the checking end, and then the method further comprises the steps of;
receiving an acquisition and test result returned by the test end;
and performing at least one of data filtering, data enhancement and annotation modification on the third data under the condition that the annotation checking result is that the annotation is qualified, wherein the method comprises the following steps:
And under the condition that the labeling inspection result is qualified in labeling and the acquisition inspection result is qualified in acquisition, performing at least one of data filtering, data enhancement and labeling modification on the third data.
According to the data processing method provided by the invention, the acquisition and test result is determined based on at least one of voice definition and background noise ratio of the audio in the fourth data, and resolution, definition, light and color of the image in the fourth data.
According to the data processing method provided by the invention, in the case that the first data comprises audio, the data enhancement comprises at least one of noise superposition, voice rate interference and reverberation;
in the case where the first data comprises an image, the data enhancement comprises at least one of scaling, cropping, flipping, rotation;
where the first data comprises text, the data enhancements include at least one of entity substitution, reverse translation, synonym substitution, random insertion, random exchange, random deletion, scrambling sentence order, generating sentences using a generative model.
The invention provides a data processing method, which comprises the following steps:
S21, acquiring first data, wherein the first data comprises at least one of audio, images and texts;
s22, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling;
s23, sampling the second data to obtain third data;
s24, the third data are sent to the checking end, and the acquisition checking result returned by the checking end is received;
s25, under the condition that the acquisition and test result is that the acquisition is qualified, carrying out at least one of data filtering and data enhancement on the second data to obtain target data;
and if the acquisition test result is that the acquisition is not qualified, repeating S21-S24.
The invention provides a data processing method, which comprises the following steps:
s31, acquiring first data, wherein the first data comprises behaviors of an intelligent agent, an input data format, a data acquisition mode and an evaluation rule;
s32, inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent, and obtaining feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent;
S33, determining reward data corresponding to the behavior based on the feedback data and the evaluation rule;
and S34, determining target data based on the behavior, the second data, the feedback data and the rewards data.
The invention also provides a data processing device, comprising:
a data acquisition unit configured to acquire first data including at least one of audio, image, and text;
the preprocessing unit is used for preprocessing the first data to obtain second data, and the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling;
the marking unit is used for sending the second data to a marking end and receiving third data returned by the marking end, wherein the third data carries a marking result;
the sampling unit is used for sampling the third data to obtain fourth data;
the checking unit is used for sending the fourth data to a checking end and receiving a labeling checking result returned by the checking end;
the optimizing unit is used for carrying out at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data under the condition that the labeling inspection result is qualified; and repeatedly executing the labeling unit, the sampling unit and the checking unit under the condition that the labeling check result is that the labeling is unqualified.
The invention also provides a data processing device, comprising:
a data acquisition unit configured to acquire first data including at least one of audio, image, and text;
the preprocessing unit is used for preprocessing the first data to obtain second data, and the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling;
the sampling unit is used for sampling the second data to obtain third data;
the checking unit is used for sending the third data to a checking end and receiving a collection checking result returned by the checking end;
the optimizing unit is used for carrying out at least one of data filtering, data enhancement and labeling modification on the second data to obtain target data under the condition that the acquisition and inspection result is that the acquisition is qualified; and repeatedly executing the data acquisition unit, the preprocessing unit, the sampling unit and the inspection unit under the condition that the acquisition inspection result is that the acquisition is unqualified.
The invention also provides a data processing device, comprising:
the data acquisition unit is used for acquiring first data, wherein the first data comprises behaviors of an intelligent agent, an input data format, a data acquisition mode and an evaluation rule;
The interaction unit is used for inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent to obtain feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent;
a reward determining unit, configured to determine reward data corresponding to the behavior based on the feedback data and the evaluation rule;
and a data determining unit configured to determine target data based on the behavior, the second data, the feedback data, and the bonus data.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a data processing method as described in any of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a data processing method as described in any of the above.
According to the data processing method, the device, the electronic equipment and the storage medium, the data processing flow including preprocessing, marking, sampling, checking and data optimization is executed for various types of data, the effective quality control is carried out on each link in the data life cycle, the effectiveness of data processing is improved, and the data quality is guaranteed.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method according to the present invention;
FIG. 2 is one of the schematic frame diagrams of the data processing flow of supervised learning provided by the present invention;
FIG. 3 is a schematic flow chart of the data annotation provided by the invention;
FIG. 4 is a second flow chart of the data processing method according to the present invention;
FIG. 5 is a third flow chart of the data processing method according to the present invention;
FIG. 6 is a flow chart of a data processing method according to the present invention;
FIG. 7 is a schematic diagram of mapping a data lifecycle framework to a framework of a data processing flow provided by the present invention;
FIG. 8 is a schematic diagram of a data processing flow provided by the present invention;
FIG. 9 is a schematic diagram of data quality optimization and data quality process verification provided by the present invention;
FIG. 10 is a diagram illustrating a relationship between a data lifecycle framework and a data processing flow provided by the present invention;
FIG. 11 is a second schematic diagram of a data processing flow for supervised learning provided by the present invention;
FIG. 12 is a schematic diagram of a data processing apparatus according to the present invention;
FIG. 13 is a second schematic diagram of a data processing apparatus according to the present invention;
FIG. 14 is a third schematic diagram of a data processing apparatus according to the present invention;
fig. 15 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In machine learning based artificial intelligence systems, data quality may be a primary consideration in creating and using data to train and evaluate artificial intelligence systems for training methods such as supervised, semi-supervised, unsupervised, and reinforcement learning, as well as in system-based data computation analysis. High quality data is one of the very important resources in the artificial intelligence industry. How to acquire high-quality data becomes a problem to be solved in the artificial intelligence industry.
In this regard, the present invention provides a data processing method for ensuring the quality of data obtained therefrom by performing data quality control within a data quality lifecycle framework (Data Quality Life Cycle Framework, DQLCF). The data obtained here, i.e. data meeting the requirements for a specific environment. The application of the data processing method can ensure that the data generated by the artificial intelligence system is reliable, and the related machine learning model can meet requirements such as reliability, correlation with the intended use and the like. In particular, the effectiveness of a machine learning model depends primarily on the quality of the training data, and if the training data is unreliable or even invalid, any predictions made by the machine learning model are suspicious or even potentially misleading. Regardless of the type of data application to which the data specifically corresponds, data processing in a machine learning scenario requires compliance with common rules within the data lifecycle framework, including:
The data is applicable to a specified machine learning or analysis task;
using a data processing flow based on data quality characteristics;
validating the data processing flow using a quantifiable data quality assessment indicator, where the data quality assessment indicator can be implemented by a target set for a particular correlation metric;
verifying at each stage whether the process meets the above objectives and other requirements;
ensuring correctness and robustness of the test, including techniques such as resistance testing intended to identify errors;
meets the requirements of organizations on safety, privacy, fairness and morality;
protecting the health and well-being of annotators and other personnel participating in the data processing flow;
record progress and comply with regulatory guidelines and requirements.
It should be noted that the above general rule is applicable not only to training data but also to test data and verification data.
Fig. 1 is a schematic flow chart of a data processing method provided by the present invention, as shown in fig. 1, the data processed by the method can be applied to a training link of subsequent supervised learning, where the supervised learning (supervised machine learning) is a machine learning task of deducing a function from the labeled training data. The training data for supervised learning comprises one or more sets of training examples, wherein each example consists of an input object and a desired output value, and the input object can be acquired through data, and the desired output value, namely a label, is realized through a manual or automatic labeling mode. For example, supervised learning for face recognition may be performed by face images labeled with person names or person numbers, or supervised learning for text translation may be performed by text representing the same semantics in two languages.
The method comprises the following steps:
s11, acquiring first data, wherein the first data comprises at least one of audio, images and texts.
Here, the first data may be acquired through data acquisition, where the acquired first data may include any one of audio, image and text, or a combination of multiple types, and the specific data type of the first data is related to a task of supervised learning actually applied by the data processing method. For example, under the supervised learning task applied to face recognition, the process of data acquisition may be to acquire face images from the internet or to intercept face images from pre-recorded videos, and for example, under the supervised learning task of language classification, the process of data acquisition may be to acquire texts of corresponding languages from websites of various languages, and for example, under the supervised learning task of speaker recognition, the process of data acquisition may be to cut out audio of different speakers from conversation audio or pre-recorded conference audio.
S12, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling.
Specifically, after the data preparation stage is finished, the first data can be preprocessed after the first data is obtained. The data preprocessing for the first data may include at least one of data combining, data filtering, normalizing, scaling and interpolating, data cleaning, data de-identifying, and data sampling. It will be appreciated that the pre-processing method employed may be different for different supervised learning tasks and different first data types.
The data combination is to combine first data from different sources, and perform data inspection on the combined first data, where the data inspection may be a data integrity rule and/or a data accuracy rule.
Here, the data integrity rule is used to check whether the key element of the data has missing data content, i.e. check whether the data value and the data content are missing; for example, environmental data at one time is collected, where the environmental data needs to include temperature, illumination, and humidity, and it needs to be determined whether the environmental data at one time includes three elements of temperature, illumination, and humidity during integrity check; the data accuracy rules are used to check whether the data is in compliance with objective reality, error free, and to ensure that the data does not deviate from the context and field in which the actual data is intended to be used. For example, the human body temperature is detected at intervals, and in consideration of the actual situation, if the detected human body temperature is 25 ℃, the detected human body temperature deviates significantly from objective reality, and it is necessary to delete data and adjust the acquisition device.
The application of the data combination can enable the first data of each color of each shape acquired by each source to be integrated and unified into second data of the same specification, for example, under the supervision and learning task applied to face recognition, the first data are face images of each color of each shape acquired directly, the second data after the data combination can be face images with unified type JPG and unified resolution of 300dpi, and in the process, the images with low resolution and the images with non-faces included are screened out.
The data filtering is used for filtering non-compliant or useless data in the first data, for example, the first data used for voice separation should be audio containing human voice, and a supervised learning model can be trained in advance for judging whether the audio contains human voice or not, and accordingly the audio containing no human voice in the first data is filtered.
The scaling means scaling the image outwards and inwards to create a new image, and by scaling the image in the first data, the size of each image in the second data obtained by the scaling process can be a predetermined size, which provides convenience for the subsequent application stage.
Wherein the data cleansing is used to identify incomplete, incorrect, inaccurate or irrelevant portions of the first data, and then replace, modify or delete dirty data, coarse data, incomplete and useless data objects;
The data cleaning method may include:
cases of duplicates or incompleteness were deleted: if the data comes from multiple sources, or is in error in retrieving the data, the acquired data may contain duplicate cases. In addition, the data may lack partial values, so it is necessary to delete these duplicate or incomplete cases;
deleting meaningless answers and unreadable data: the data may include meaningless answers, such as those containing invalid values that are meaningless in the ML context. Importing data from multiple files may also result in some data being unreadable. These cases should be removed. Since the acquired second data may contain personal identification information (personally identifiable information, PII), the privacy of the data body may be compromised. Examples of the personal identification information include: name, address, IP address, location, biometric information, demographic information, etc. Training, validation and test data including PII should be eliminated as much as possible while achieving the goal of machine learning projects. PII may be required to be used when production data is used to infer an individual. Data de-identification is used to identify sensitive information in the first data, and sensitive attributes including personal identification information in the second data are de-sensitized or de-identified by substitution, filtering, encryption, masking or deletion.
For example, a retail organization may wish to offer three new stores in a metropolitan area based on the location of its current customers. The organization has a customer database of thousands of customers and plans to use machine learning clustering algorithms to determine the best locations of stores. The organization runs the query on the customer database, selecting only the zip code, and not any other information available in the database.
Data de-identification refers to the process of deleting, altering, or restricting access to personally identifying information in acquired data from linking with one or more persons. Examples of data de-identification methods include: anonymization, pseudo anonymization, cancellation of links, aggregation, and differential privacy, etc. For more information on de-identification of data, see ISO/IEC 27559. Data sampling is used to select smaller but representative data samples from the first data in the event that the first data is oversized, to construct and run the model faster, while producing accurate results.
By the above data preprocessing, damaged or inaccurate data in the first number can be detected and corrected, thereby obtaining second data.
S13, the second data are sent to the labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results.
Specifically, after the second data is obtained, the second data can be sent to the labeling end, the labeling end performs manual or automatic labeling on the second data, so that third data carrying a labeling result is generated, and the third data is returned. For example, under the supervised learning task applied to face recognition, the second data is a face image, and the labeling for the second data may be labeling each face image with a name or number of the face contained therein.
It will be appreciated that the labeling method applied by the labeling side depends on the type of data contained in the first data, and may include object labeling, bounding boxes, keypoint labeling, instance segmentation, and semantic segmentation. The data annotation object comprises:
image marking: image annotation is currently widely used. The main labeling method comprises point labeling, frame labeling, region labeling, 3D labeling and classification labeling.
Image annotation is applied in many scenarios such as security, education, autopilot;
audio annotation: audio tone labeling is used for speech recognition, voiceprint recognition, and speech synthesis. The data may be annotated as speaker roles, environmental scene annotations, multilingual annotations, prosody annotations, system annotations, emotion annotations, noise annotations, and the like.
Text labeling: text labeling is an important aspect of natural language processing. In order to provide high-precision prediction, text data may be labeled by sentence segmentation labeling, semantic judgment labeling, text translation labeling, emotion labeling, polyphonic word labeling, digital character labeling, and the like.
For example, for data labeling of audio data, three labeling layers may be divided, a first layer may be used to label events, space-time, speaker information, etc., a second layer may be used to label the content of speech signals, and a third layer may be used to label invalid speech signals.
For another example, for the data annotation of the video data, three annotation layers can be divided, and the object and scheme of each layer of annotation can be different; wherein, one layer can mark people and objects in the video through key frames; a layer can provide labeling attributes for behaviors, object types, sexes and the like in the video data; yet another layer may mark the start and end times for standard behavior.
In addition, a training scheme for data annotation can be created prior to data annotation so that both manual annotators and automatic annotation systems can learn how to apply the scheme with acceptable accuracy. The concepts and terms used for data annotation in the training scheme should be consistent.
Moreover, the labels should have well-defined semantics so that both manual and automatic labeling system labels can be understood.
In the labeling process, the data can be manually labeled by a manual labeling person or can be automatically labeled by an automatic labeling system, and the automatic labeling system can realize automatic labeling in a pseudo labeling mode. The pseudo labeling is a process of predicting labeling of unlabeled data by using a model trained on labeling data, so that automatic labeling is realized simply and quickly, and the pseudo labeled data can be used as a reference of a manual labeling person, thereby being beneficial to improving the efficiency of labeling tasks.
S14, sampling the third data to obtain fourth data.
Specifically, after the third data returned by the labeling end is obtained, the data quality inspection needs to be performed on the third data, but considering that the data size of the third data is huge, a large amount of computing resources are required for performing the data quality inspection on the whole third data, so that the third data needs to be sampled, and the sampled data is fourth data. The sampling may be random sampling or hierarchical sampling, where the chance of each sample in the third data being selected is equal under random sampling, and hierarchical sampling is to divide the third data into sub-groups according to correlation properties (such as gender and age range), where the sampling is to ensure that each sub-group is properly represented.
S15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received;
specifically, after the sampled fourth data is obtained, the fourth data can be sent to the inspection end, and the inspection end carries out labeling quality evaluation on the fourth data, so that a labeling inspection result is generated and returned.
Here, the labeling quality evaluation is used for quality evaluation of a portion obtained by data labeling in the fourth data, and the evaluation of the labeling quality evaluation is aimed at ensuring that the result of the data labeling is valuable and meets the specific objective of machine learning. In the labeling quality evaluation process, the accuracy, precision, integrity and consistency of labeling should be considered.
The labeling quality assessment may further include labeling outcome verification and/or labeling process control:
the labeling result inspection is quality inspection of the data labeling result, and specifically may be comprehensive quality inspection of the result or quality inspection of the sample of the result. The quality inspection can inspect the result of the data annotation, and also can inspect the annotation process of the result backtracking of the data annotation. The labeling result inspection may include peer review and administrator review, and specifically, peer review may be performed first, i.e., by multiple inspectors, and then administrator review, i.e., by an administrator evaluating a sample of the result.
In particular, labeling results from crowd-sourced labeling require comprehensive quality inspection.
The quality evaluation of the labeling process control can be divided into two steps: initial data annotation and data quality inspection. In the process of marking initial data, a marker should strictly mark the data according to marking description, and submit a quality report after finishing evaluation of self marking results. In the process of data quality inspection, an inspector can inspect the quality of the marked data, and if the quality of the marked data is found to be unsatisfactory, the marked data with the unsatisfactory quality can be returned to the inspector for modification.
S16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data;
and if the labeling check result is that the labeling is unqualified, repeating S13-S15.
Specifically, after the labeling inspection result is obtained, whether the labeling is qualified or not can be judged through the labeling inspection result, and the subsequent operation is executed.
In the case of qualified annotation, the third data may be subjected to data optimization, where specific operations of data optimization may include at least one of data filtering, data enhancement, and annotation modification, thereby obtaining target data that is ultimately applied to supervised learning.
Wherein data filtering may be achieved by scoring the third data by a supervised learning model and by eliminating portions of the third data with lower confidence. For example, the third data for speech separation should be audio containing human voice, and the supervised learning model may be trained in advance to determine whether the audio contains human voice, and filter the audio containing no human voice in the third data accordingly.
Wherein the data enhancement aims at increasing the total amount of data by introducing new annotation data into existing annotation data, i.e. achieving an expansion of the third data size. Different data enhancement methods exist for different machine learning tasks and different data types.
For example, where the first data includes text, the method of data augmentation for text may include entity substitution, reverse translation, synonym substitution, random insertion, random exchange, random deletion, scrambling sentence order, generating sentences by using a generative model, and so forth.
Wherein entity replacement refers to replacing entities in sentences with entities of the same class; reverse translation refers to translating a sentence into another language and then translating the sentence in the other language back into the original language; synonym replacement refers to replacing words in a sentence with synonyms; random insertion means that a synonym of a random word is found in a sentence, and the synonym is inserted into a random position in the sentence; random exchange refers to randomly selecting two words in a sentence and exchanging positions; randomly deleting words in the sentences by probability; disturbing sentence order refers to changing the order of sentences in a paragraph to create a new paragraph; generating sentences by using the generation model refers to generating new sentences using the text generation model.
When the first data includes an image, the method of data enhancement for the image may include: scaling, clipping, flipping, rotating, etc.
Wherein scaling refers to scaling the image outward and inward to create a new image; the cropping refers to selecting a part of an image, cropping, and then adjusting the size of the image to the original image size; the turning refers to horizontally and vertically turning the image; rotation refers to rotating the image between 0 and 360 degrees.
When the first data includes audio, a method of data enhancement for audio may include: noise superposition, speech rate interference, reverberation, etc.
The annotation modification aims at modifying for quality problems that may exist for the part annotated in the third data. The process of detecting the error label in the third data may be implemented in the label quality evaluation process, and the manner for detecting the error label in the third data may be various, for example, may be implemented through cross-validation.
The first step of cross validation is to divide the sampled fourth data into n parts, for each of the n parts, model training can be performed by first applying the data of n-1 parts, the model obtained by training is applied to the labeling prediction of the data of the part which is excluded, and if the result obtained by prediction is different from the labeling of the part, the probability of the labeling error of the part is higher. By executing the above operation on each of the n parts, it can be determined which parts of the n parts have labeling errors, so that examples of the labeling errors are filtered, and the filtered examples can be returned to the labeling personnel for inspection and modification.
In addition, when modifying the annotation data, both the original data and the modified data should be stored.
And under the condition of unqualified labeling, the second data is required to be sent to the labeling end again, the labeling end performs self-checking and re-labeling, the re-labeled third data is returned, sampling and labeling inspection are performed again until the labeling inspection result detected by the inspection end is qualified, and data optimization is performed again, so that the target data is obtained.
The method provided by the embodiment of the invention executes the data processing flow including preprocessing, labeling, sampling, checking and data optimization aiming at the data under various types, and performs effective quality control on each link in the data life cycle, thereby improving the effectiveness of data processing and guaranteeing the data quality of supervised learning.
Based on any of the above embodiments, S12 includes:
in the case where the first data includes audio, data combining the first data based on a sampling rate;
in the case where the first data includes an image, the first data is data-combined based on at least one of an image capturing time, an image capturing position, image capturing apparatus information, and a timing relationship of a plurality of images captured for the same subject.
Specifically, in S12, preprocessing is performed for the first data, including data combining is performed for the first data. Here, the data combination may be performed according to a type, an attribute, and a relationship between data in the first data, and the type, the attribute, and the relationship between data in the first data may be determined based on at least one of an acquisition environment, an acquisition object, and an acquisition requirement of the data acquisition.
The first data may be obtained through data collection, the data collection is performed according to a preset data source, the source can be various, and various sources can correspond to various types of data in various formats, for example, the data types of supervised learning can be text, images, audio and the like.
Before data acquisition, an acquisition scheme can be preset, wherein the acquisition scheme can be determined based on the application field of supervised learning, analysis targets, problems, learning tasks, use modes and the like, and the use modes can be model training or model evaluation. By the acquisition scheme, the data type, coverage, magnitude, format, acquisition method and the like during data acquisition can be determined.
In the process of data acquisition, the reliability and the validity of the acquired first data are ensured, and meanwhile, the data consistency of the first data needs to be ensured, wherein the data consistency can be verified by whether the value of each attribute in the first data is recorded in the same way or not, and in addition, the data consistency can be used for verifying the attribute represented by the data range in the first data. For example, when the first data of the image type is acquired, it may be checked whether the shooting date of the image is unified in the form of "year/month/day".
After the data acquisition preparation is completed to obtain the first data, the data combination can be performed to obtain combined data, wherein the data combination can be to combine the first data and related information thereof based on the first data and related information of the first data, the related information of the first data can be data type, attribute of the data, relationship among the data and the like, for example, the first data is an image itself, the related information of the first data can include image type, image shooting time, image shooting position, image shooting equipment information, time sequence relationship of a plurality of images shot for the same target and the like.
Further, data combination can be performed based on the acquisition environment, the object and the requirement, and the obtained combined data can be embodied as data under each specific acquisition environment, for example, considering the influence of the complex environment on the quality of the image acquired by the camera, the data under the rainy day environment, the snowy day environment and the heavy fog environment can be respectively combined; the data can be combined based on the acquisition plan, and the obtained combined data can embody the pushing conditions of different time periods in the acquisition plan; the data combination can be performed based on the acquisition requirements, so that the obtained combined data can embody different data under different acquisition requirements, for example, when the audio data is acquired, the data combination for the audio can be performed according to the sampling rate of the audio itself, for example, part of the audio in the first data is acquired at the sampling rate of 16kHz, the other part of the audio is acquired at the sampling rate of 8kHz, and the audio with the sampling rate of 16kHz can be combined with the audio with the original sampling rate of 8kHz after the audio with the sampling rate of 8kHz is subsampled.
Further, the data combination may include various forms, such as:
-supplementing data from multiple sources;
-selecting data based on specific properties in the acquired data;
splitting attributes (e.g., split date attributes into day, month, and year attributes);
-supplementing the artificially synthesized data;
subsampling the acquired data (e.g., subsampling a 16kHz audio stream to 8 kHz);
random sampling, where each sample in the first data is equally opportunistically selected;
hierarchical sampling, in particular the data may be divided into sub-groups according to relevant properties (such as gender and age range), the sampling being to ensure that each group is appropriately representative.
The second data, after the data combination, may have a different format determined by the data provider. To ensure that a certain machine learning tool can consistently process first data in different formats, the first data can be converted, serialized, and stored in the form of data packets, where well organized metadata, data samples, and tags can improve the quality of the data when used. The common components and their semantics can then be abstracted and specified.
By the method, the reuse, exchange, storage, access and comparison of the first data can be facilitated. First data in the form of data packets, common components of which include:
-index: components that can be used to facilitate data browsing and access, including information related to the catalogs, names and offsets of sample and tag files;
-a data header: a component operable to record organization information in the data, including information related to volumes and locations of scalar and block data, and information related to correspondence, statistics, partitioning, data types, and dimensions of the block data;
-page: a page is a segment of a data file that stores actual scalar data (e.g., data of primitive types such as integer, string, floating point, etc.) or block data (e.g., image, video, and audio) of samples and tags. After the data combination is completed, the combined data may be directly used as the second data, or the combined data may be subjected to data preparation quality inspection, and the combined data passing the inspection may be used as the second data.
Based on any of the above embodiments, the third data received in S13 carries the labeling result.
And when the first data comprises audio, the labeling result comprises at least one of a speaker role labeling result, an environment scene labeling result, a language labeling result, a rhythm labeling result, a system labeling result, an emotion labeling result and a noise labeling result.
When the first data comprises an image, the labeling result comprises at least one of a point labeling result, a frame labeling result, a region labeling result, a 3D labeling result and a classification labeling result;
and when the first data comprises a text, the labeling result comprises at least one of sentence segmentation labeling result, semantic judgment labeling result, text translation labeling result, emotion labeling result, double-tone word labeling result and digital character labeling result.
Based on any of the above embodiments, after the second data is sent to the labeling end in S13, the labeling end performs data labeling on the second data. In the process of data annotation, a plurality of data annotations with different contents can be deployed in different annotation layers, and each annotation layer can be realized by an independent representation format. For the same data object, multiple annotation layers corresponding to the data object can coexist in a unified cross-reference mode among layers. That is, the data annotation process allows not only the use of mutually inconsistent layers, but also the use of alternative annotations that employ different annotation schemes for the same data object.
Fig. 2 is a schematic diagram of a framework of a data processing flow for supervised learning provided by the present invention, in which, as in fig. 2, a solid arrow is used to connect a data flow in the data processing flow, i.e. a data flow, and a dashed arrow is used to represent a flow of labeling work to be executed in the data processing flow, i.e. a workflow. Because the data required by supervised learning needs to be marked, human participation, namely human resource cooperation, is possibly required in the marking work, and the human resource can be particularly spent as three identities of an administrator, a mark person and an inspector. The annotators carry out actual data annotation; verifying all labels or labeled samples of each batch of data labeled by the label by the inspector; the administrator is responsible for the assignment of labeling jobs, the roles of the label and the management of the inspector, and designates a quality team consisting of inspectors and responsible persons.
In fig. 2, in the data preparation stage, in consideration of the need for labeling of data required for supervised learning, labeling work may be performed synchronously, and specifically, labeling task establishment and labeling task allocation may be included.
Here, for different machine learning tasks and model training requirements, the data annotation task may include classification, recognition and segmentation, such as image classification, face recognition, object segmentation, and so forth. Before starting the data annotation, the main task of the data annotation, namely the annotation task establishment, should be defined explicitly, and the annotation task may include scene description, machine learning model and its use, data information, etc. By building labeling tasks, the performance of generating a trained machine learning model therefrom can be ensured.
After the labeling task is established, task allocation can be performed, wherein the allocation can be allocated to different teams according to different organization methods. Generally, the method of organizing data annotations is chosen depending on the complexity and size of the data, and the degree of understanding of the organization of the machine learning task that implements the data annotations. Depending on the nature of the organization method, different organization methods may be suitable for different data annotation tasks, as shown in Table 1.
TABLE 1 applicability of different tissue methods
In the above organization method, crowd sourcing labels provide incentive for workers and quality control is important. Crowd sourcing requires a large number of labels in a short time and is therefore suitable for tasks with a low level of confidentiality.
After the second data is obtained through data preprocessing, the second data can be subjected to data marking, and marking can be performed by referring to the data marking related information in the process of data marking.
Here, in the data annotation related information, the metadata may include at least one of a training task, a source, a data format, a date and time, and information about the data collector.
The data annotation method may include at least one of object annotation, bounding box, keypoint annotation, instance segmentation, and semantic segmentation.
The data annotation task may comprise at least one of classification, identification, and segmentation, and the allocation policy of the data annotation task may be at least one of organizational.
The annotation specification requirements may include an annotation specification and an annotation requirement, wherein the annotation specification may include three parts, description, correct examples, and comments:
the method comprises the steps of describing annotation definition for defining data objects, defining all operations used in an annotation component, an annotation type and an annotation platform;
Correct examples correct labeling methods or results can be demonstrated by examples, which should cover special or difficult cases;
the annotation requires details of the errors to be avoided, the labeling method and the additional processing method, wherein the errors can be classified into different types, such as errors in which no object to be labeled is detected or detected during the examination, errors in which an object to be excluded is labeled, and errors that are not reached or exceeded during the labeling;
the labeling requirements may include at least one of stakeholders and their responsibilities, expected lead time, accuracy requirements, data security policy requirements.
The annotation platform tool may comprise an annotation platform and/or an annotation tool, the selection of which should be made during the training of the machine learning model, the selection of which should be in compliance with data security requirements. Further, the basic function of the labeling platform and the labeling tool is to reduce performance bias between a wide variety of labeling platforms and labeling tools, as well as human error by labeling personnel. In addition, the basic functions of the labeling platform and the labeling tool may also include managing labeling records for each labeling person and retrospectively labeling the corresponding labeling person, providing relevant user manuals and guidelines, and taking into account the ability to use cloud services (PaaS, saaS).
Based on any of the above embodiments, fig. 3 is a schematic flow chart of the data labeling provided by the present invention, and as shown in fig. 3, the data labeling can be divided into three stages of labeling preparation, labeling execution and label output.
In the annotation preparation stage, an annotation specification needs to be created, and the annotation specification includes a description, a label example and a label annotation. Wherein the description is used to clarify the tag definition, specify the tag component, tag type, and all operations used in the tag tool or platform. The label examples are used to demonstrate the correct label method or result by way of example. Such label examples should cover special or difficult situations. Tag notes are used to list errors that should be avoided, details in the tag method, and other processing methods. Errors can be classified into different types, for example:
failure to detect (undetected) errors in the objects to be labeled during inspection;
labeling should exclude errors of the object (overdetection);
an under-or out-of-range error (bad label range) that occurs during labeling.
In addition, the labeling specification should also include defining the relevant roles and their responsibilities, expected lead time, accuracy requirements, data security policy requirements.
During the annotation preparation phase, the participant roles and annotation tools need to be determined as well. Here, roles that participate in data annotation include administrator, annotator, and inspector. The selection of the labeling tool or platform is typically performed prior to machine learning model training and should meet data security requirements. The labeling tool or platform should support the basic functionality of labeling to reduce performance bias between the various platforms and tools and human error by the labeling personnel. These functions also include managing the records and traceability of each annotator, providing related user manuals and guidelines, and accounting for the use of cloud services.
In the annotation execution stage, annotation task establishment, annotation task allocation and annotation process control are needed to be sequentially carried out, wherein for different machine learning tasks and model training requirements, the data annotation tasks can comprise classification, identification and segmentation. Before starting the data annotation, main tasks of the data annotation should be defined explicitly, including scene description, machine learning model and application mode thereof, data information and the like. The goal of the build and manage data annotation task is to ensure the performance of the machine learning model that is generated.
The data annotation tasks can then be assigned to different teams according to different organization methods. In general, the method of organizing data annotations is selected depending on the complexity and size of the training data, and the degree of understanding of the machine learning task by the team designated to conduct the data annotations. Different organization methods can be used for different data marking tasks, and specific organization methods are shown in table 1, and are not described herein.
The labeling process control can be divided into two steps: initial data annotation and data quality inspection. In the initial data labeling step, the labeling personnel should label the data strictly according to the labeling specification. The annotator should submit a quality report after completing the evaluation of its annotation result. In the data quality checking step, the inspector can check the quality of the labeling data. If the annotation data is found to be unsatisfactory, the inspector can return the batch of annotation data for revision.
In the label output stage, the quality inspection and the revision of the labeling result are needed to be sequentially carried out. The quality inspection of the labeling results aims at ensuring that the data labeling results are valuable and meeting the specific aim of machine learning training. The accuracy, precision, integrity and consistency of the labels should be considered in the inspection. The labeling result quality check may comprise a label check.
The quality inspection of the labeling result can be a complete inspection or an inspection of the result sample. Peer and manager reviews may be included. Peer review may be preceded by multiple inspectors and thereafter the result samples may be reviewed by a manager.
And corresponds to the organization method in table 1, namely, a method for determining the quality inspection of the labeling result:
For internal labeling, label quality is guaranteed because the data is labeled by an internal professional labeling person. Thus, quality checks are typically performed by a manager through sampling checks. The sampling rate is determined based on the volume and quality requirements of the marker data. The sampling rate is generally 5% -10%.
For outsourcing labeling, this means that the data is outsourced to an external entity for labeling, and thus the quality control operation of the labeling process is difficult. Therefore, the data inspection method before acceptance is relatively strict. There are generally three inspection methods:
random sampling check: the inspector randomly extracts a certain proportion of tag data and then checks. The sampling rate may be referred to the AQL table defined in ISO 2859-1:1999.
-hierarchical sampling inspection: the inspector randomly extracts a proportion of the label data from each of the annotators. The sampling scale should also refer to the AQL table.
-comprehensive inspection: for tasks where data quality requirements are very high, the inspector should examine all of the annotated data.
For crowd-sourced annotations, since the annotators are external part-time personnel, quality inspection of the crowd-sourced annotations should also be relatively strict in order to guarantee the quality of the data annotations. Typically, the inspector should fully examine the annotation data and the manager should examine the proportion of the annotation data returned by the inspector, where the proportion may also refer to the AQL table.
Whichever inspection method is employed, the inspection criteria should be kept consistent. The inspection should be performed strictly in accordance with the established labeling specifications and requirements. The quality check index may measure the accuracy of the tag, the tag format, the data integrity, etc.
Due to quality problems, the labeling data provided by the labeling personnel can be modified, i.e. the labeling results are revised. There are many examples of ways in which a marker error in a dataset may be detected. A general approach may be cross-validation. In order to manage changes in the data, the original data and the revised data should be stored while the tag data is revised. In addition, the time stamp and the name of the person who changed each version revision should be recorded.
Based on any of the above embodiments, S14 further comprises;
and receiving an acquisition and test result returned by the test end.
Specifically, the inspection end is not only used for evaluating the labeling quality of the fourth data, but also used for evaluating the acquisition quality of the fourth data, generating an acquisition inspection result and returning. The collection quality evaluation is used for performing quality evaluation on a part obtained through data collection in the fourth data, and the evaluation content of the collection quality evaluation is related to the data attribute of the fourth data, for example, when the fourth data comprises an image, the collection quality evaluation can evaluate the resolution, the definition, the light, the color and the like of the image, and when the fourth data comprises an audio, the collection quality evaluation can evaluate the background noise contained in the audio.
Accordingly, S15 includes:
and under the condition that the labeling inspection result is qualified in labeling and the acquisition inspection result is qualified in acquisition, performing at least one of data filtering, data enhancement and labeling modification on the third data.
Specifically, for data processing for supervised learning, the inspection end performs both labeling quality evaluation and acquisition quality evaluation, and accordingly, labeling inspection results and acquisition inspection results returned by the inspection end can be obtained.
It can be understood that, only if the labeling test result and the collection test result are both qualified, the third data compliance is confirmed, and further data optimization can be performed on the third data, so as to obtain the target data.
Based on any of the above embodiments, the acquisition test result is determined based on at least one of a speech intelligibility and a background noise ratio of audio in the fourth data, a resolution, an intelligibility, a light, and a color of an image in the fourth data.
For example, under the supervised learning task of speech recognition, the speech intelligibility of the audio in the fourth data may directly affect the accuracy of speech recognition, so that the speech intelligibility may be used as a key data attribute for the acquisition quality evaluation, and if the speech intelligibility of the audio in the fourth data sent to the inspection end is lower than a preset threshold, an unqualified acquisition inspection result is generated by the inspection end.
Based on any of the above embodiments, in the case where the first data comprises audio, the data enhancement comprises at least one of noise superposition, speech rate disturbance, and reverberation;
in the case where the first data comprises an image, the data enhancement comprises at least one of scaling, cropping, flipping, rotation;
where the first data comprises text, the data enhancements include at least one of entity substitution, reverse translation, synonym substitution, random insertion, random exchange, random deletion, scrambling sentence order, generating sentences using a generative model.
Based on any of the above embodiments, fig. 4 is a second schematic flow chart of the data processing method according to the present invention, in the flow chart of the data processing method shown in fig. 4, the solid arrows are used to connect the data flow direction, i.e. the data flow, in the data processing flow, and the dotted arrows are used to represent the flow direction, i.e. the quality flow, of the quality control in the data processing flow.
The data processing flow can be divided into a data preparation stage, a data processing stage, a data quality evaluation stage and a data optimization stage, wherein the data preparation stage is used for planning data quality, the data processing stage is used for controlling data quality, the data processing stage is used for data preprocessing and data labeling, the data quality evaluation stage is used for guaranteeing data quality, namely corresponding quality evaluation, the data quality evaluation stage is used for acquiring quality evaluation and labeling quality evaluation, and the data optimization stage is used for improving data quality, namely corresponding data optimization. The data preparation stage is started after various sources of data are determined, and after the data optimization stage is completed, that is, after the data processing flow is completed, the data obtained by the data processing can be used to realize training or evaluation of the machine learning model.
Based on any of the above embodiments, in the case where the data application type includes supervised learning, the first step in establishing the data processing flow is to learn about the problem to be solved. Both the machine learning task and the model should be defined based on the problem to be solved. Then, in order to collect training data, a scene of data collection should be determined according to the application field of the problem. It should be ensured that the collected data meets the requirements of the machine learning model. After determining the data acquisition scenario, the data processing flow may begin.
Taking the face recognition task as an example, in order to solve the problem that face recognition fails when a person wears various face masks, the scene of data acquisition can be set as a scene that the person wears a mask, a hat, glasses, earrings and other objects with the possibility of face shielding.
Corresponding to the data preparation stage, images of different face shields worn by different people can be collected as first data;
in the data processing stage, data combination can be carried out on the acquired images according to personnel or the types of the face shields, and a preprocessing mode such as scaling is applied to acquire second data. And then, the second data can be sent to the labeling end, and the labeling end labels the personnel identifications corresponding to the images in the second data, so that third data are obtained.
In the data quality evaluation stage, the third data can be sampled, the sampled fourth data is sent to the inspection end, the inspection end evaluates the image definition, the face shooting angle, the light and the like in the fourth data, evaluates the correctness of the personnel mark marked in the fourth data, and returns an acquisition inspection result and a marking inspection result;
in the data optimization stage, if the acquired test result and the marked test result are qualified, the third data can be subjected to scale expansion through data enhancement, and the person mark marked with the error in the third data can be adjusted through marked modification, so that the optimized data is used as the finally obtained data to be applied to training of the face recognition model.
Based on any of the above embodiments, fig. 5 is a third flow chart of the data processing method provided by the present invention, as shown in fig. 5, the data processed by the method can be applied to a training link of subsequent unsupervised learning, where unsupervised learning (unsupervised machine learning) is a machine learning task that performs inference from data composed of input data without labels. The most common unsupervised learning methods include cluster analysis, principal component analysis, and the like. Unlike supervised learning, the training data for unsupervised learning includes training examples that include only input objects and there is no expected output value of the input objects, i.e., the training data for unsupervised learning is unlabeled. Unsupervised learning for speaker separation may be performed, for example, by audio interspersed with multi-person voices, or may be performed by shopping recordings of a large number of users for user clustering.
The method comprises the following steps:
s21, acquiring first data, wherein the first data comprises at least one of audio, images and texts;
s22, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling;
s23, sampling the second data to obtain third data;
s24, the third data are sent to the checking end, and the acquisition checking result returned by the checking end is received;
s25, under the condition that the acquisition and test result is that the acquisition is qualified, carrying out at least one of data filtering and data enhancement on the second data to obtain target data;
and if the acquisition test result is that the acquisition is not qualified, repeating S21-S24.
Specifically, since the training data required for the unsupervised learning is unlabeled, the data processing flow of the unsupervised learning does not need to perform the related operation of the data annotation, compared to the data processing flow of the supervised learning. The data processing flow of the unsupervised learning also includes a data preparation stage, a data processing stage, a data quality evaluation stage, and a data optimization stage, i.e., the data processing flow applied to the unsupervised learning does not need to execute S13, as compared with the data processing flow applied to the supervised learning including S11-S16.
It is understood that S21 and S22 are identical to S11 and S12 in the above embodiments, and will not be described herein. And S23, directly sampling the second data to obtain third data, wherein the third data does not carry labeling results. Since the third data does not carry the labeling result, after the third data is sent to the inspection end, the inspection end does not perform labeling quality evaluation on the third data, but performs acquisition quality evaluation on only the third data, and returns only the acquisition inspection result. Therefore, under the condition that the acquisition and test result is determined to be qualified, the second data can be subjected to data optimization to acquire target data which can be used for unsupervised learning, and under the condition that the acquisition and test result is determined to be unqualified, the S21 is required to be returned to carry out the acquisition and subsequent preprocessing of the first data again, so that the acquisition quality can be ensured to be qualified.
Note that, since the data of the unsupervised learning is not required to be labeled, the label modification is not required to be performed in the data optimization stage, that is, in S25.
The method provided by the embodiment of the invention executes the data processing flow including preprocessing, sampling, inspection and data optimization aiming at the data under various types, and performs effective quality control on each link in the data life cycle, thereby improving the effectiveness of data processing and guaranteeing the data quality of unsupervised learning.
Based on any of the above embodiments, when performing user classification by using unsupervised learning, a large amount of user information in text form may be collected as first data, such as account information registered by the user, purchase records of the user account, and browse records of the user account, during the data preparation stage;
in the data processing stage, the collected information can be subjected to grouping preprocessing according to the user account, so that second data can be obtained.
In the data quality evaluation stage, the second data can be sampled to obtain third data, the third data is sent to the inspection end, the inspection end evaluates the acquisition path of the user information and the acquisition time of the user information in the third data, and an acquisition inspection result is generated and returned;
in the data optimization stage, the second data can be scaled and expanded through data enhancement under the condition that the received acquisition and test result is qualified, and the data which does not meet the requirements can be filtered through data filtering, so that the optimized data can be used as the finally obtained data to be applied to training of an unsupervised model of user classification.
Based on any of the above embodiments, the data processing methods including S11 to S16 and the target data generated by the data processing methods including S21 to S25 in the above embodiments may be applied to the training link of semi-supervised learning. Here, semi-supervised learning (semi-supervised machine learning) is a machine learning task in which supervised learning is combined with unsupervised learning. Semi-supervised learning uses a large amount of unlabeled data, and simultaneously uses the labeled data to perform pattern recognition tasks. That is, the training data of the semi-supervised learning includes training data of the supervised learning and training data of the unsupervised learning. For example, semi-supervised learning of authors for distinguishing handwriting images may be performed by a part of handwriting images with authors annotated and a part of handwriting images without authors annotated.
It will be appreciated that the labeled data required for semi-supervised learning may be obtained by the data processing method including S11-S16 in the above embodiment, and the unlabeled data required for semi-supervised learning may be obtained by the data processing method including S21-S25 in the above embodiment.
Based on any of the above embodiments, training data for supervised learning requires labeling, which can be a costly process when processing large amounts of training data. The disadvantage of unsupervised learning is its limited range of application. Semi-supervised learning does not suffer from these drawbacks. The data required for semi-supervised learning includes data required for supervised learning and data required for unsupervised learning, i.e., the data required for semi-supervised learning can be divided into two parts, one part is unlabeled data and the other part is labeled data, and the semi-supervised learning generally requires a small amount of labeled data and a large amount of unlabeled data. The data plan for semi-supervised learning should include annotated data and unlabeled data. The data requirement process of semi-supervised learning should include determining the proportions and characteristics of the annotated data and the non-annotated data, and in addition to the standard process of noting the annotated data, noting the quality of the annotated data, the data tag balance between different categories, and the distribution of the non-annotated data.
In addition, in the process of data planning, attention is paid to data acquisition source planning, quality of inspection annotation data planning, and annotation and non-annotation data planning.
Therefore, the data processing flow of semi-supervised learning needs to include operations performed by both in the data processing flow, as compared to the data processing flow of supervised learning and unsupervised learning. The data processing flow of the unsupervised learning also comprises a data preparation stage, a data processing stage, a data quality evaluation stage and a data optimization stage.
In addition, under semi-supervised learning, it is very important to control the amount of unlabeled data, and if the amount of unlabeled data is too large, it is necessary to sample the unlabeled data and select a more representative sample of the unlabeled data.
Based on any of the above embodiments, when the semi-supervised learning is applied to perform language classification, audio of each language may be collected as the first data.
On the basis, the first data can be subjected to data processing, particularly, the audio with unified audio format and longer cut-off time length can be performed on the audio of each language, the audio with the long cut-off time length is screened out, the noise of the audio of each language in the first data is reduced, the first data is divided into the audio of two parts, and the second data divided into the two parts is obtained. For the second data of one part of the audio, the specific language to which the part of the audio belongs is manually marked, and for the second data of the other part of the audio, whether the part of the audio has the audio of the same language or the audio of different languages can be classified only by whether the audio is intercepted from the same long audio or whether the acquired sources are the same, and the third data can be obtained according to the third data.
In the process of obtaining the third data, the third data can be sampled to obtain fourth data, the fourth data is sent to a checking end, and the checking end evaluates the voice definition of the whole audio of the fourth data, namely evaluates the acquisition quality, so that an acquisition checking result is obtained; in addition, the labeling quality evaluation can be performed on the part of the audio with the specific language labeled in the fourth data, namely, whether the specific language labeled by the part of the audio is accurate or not is judged, so that a labeling inspection result is obtained.
Under the condition that the acquired test result and the marked test result are both qualified, data optimization can be performed on the third data, for example, the part with inaccurate language marking can be adjusted through the marked test result.
After the execution of the flow is completed, the third data after data optimization can be applied to semi-supervised learning of language classification.
Based on any of the above embodiments, fig. 6 is a schematic flow chart of the data processing method provided by the present invention, as shown in fig. 6, the data processed by the method may be applied to a training link of subsequent reinforcement learning, where reinforcement learning (reinforcement machine learning) is used to describe and solve a problem that an agent (agent) achieves maximum return or a specific objective through a learning strategy in an interaction process with an environment, and the reinforcement learning task generally does not need to acquire data in advance, but dynamically generates data for training through continuous interaction between the agent and the environment in the training process. For example, the behavior of straight running, steering and the like in the automatic driving scene can be set, and rewards can be set on the principle of reaching the destination in the shortest time, so that the reinforcement learning of the automatic driving scene is realized.
Reinforcement learning is used as a training way for describing and solving the problem that an agent achieves the maximization of return or a specific target through learning strategies in the process of interaction with an environment, and data used for training is dynamically generated through continuous interaction between the agent and the environment in the training process, and the data obtained by the method comprises the state change of the environment, the action set adopted by the agent and corresponding rewarding values, so that additional manual labeling is not needed.
The data acquired in the reinforcement learning process is considered to be real data generated by interaction between the intelligent agent and the environment, so that no additional data quality evaluation is needed. Moreover, since the data acquired by the reinforcement task does not generally need manual labeling, data optimization may not be needed. Correspondingly, the data processing flow corresponding to reinforcement learning can only comprise a data preparation stage and a data processing stage, and does not comprise a data quality evaluation stage or a data optimization stage; alternatively, the data processing flow corresponding to reinforcement learning may include a data preparation stage, a data processing stage, and a data optimization stage, excluding a data quality evaluation stage.
The data requirements for reinforcement learning should include: learning object of reinforcement learning, data amount to be generated, and data generation rule. For reinforcement-learned data planning, it is important that the planning create environments that generate data, such as rules, agents, and features of the environment, rewards functions. In addition, when reinforcement learning is performed, there is no data planning process because data is generated through the learning process.
The method comprises the following steps:
s31, acquiring first data, wherein the first data comprise behaviors of an agent, input data formats, data acquisition modes and evaluation rules.
Specifically, the data preparation of reinforcement learning may include four operations, which are respectively the behavior of the agent, the input data format, the data acquisition mode, and the design of the evaluation rule, and through the operations, the first data covering various data required for the execution of the reinforcement learning task can be obtained. The above-mentioned agent's behavior, input data format, data acquisition means, and evaluation rules may differ when directed to different reinforcement learning tasks.
The behavior of the agent, i.e. the interaction behavior between the agent and the environment, may be designed to build a behavior set into the first data. For example, where the reinforcement learning task is autopilot, the set of behaviors of the agent, i.e., the autopilot, may include acceleration, deceleration, braking, steering, and other vehicle control behaviors.
The input data format is for the input data of the agent, and the input data format may be put into the first data. The input of agents may be different for different tasks, as may the format of the input data. For example, for an autonomous car, the input is a time-varying signal acquired by a sensor on the car, such as a camera image signal, a lidar signal, and the like.
The data acquisition mode is designed for sequentially acquired data, and specifically may relate to the length of a sequence formed by sequentially acquired data and the occurrence times of various events in the sequence, and the data acquisition mode may be set in the first data. For example, the data acquisition mode may include that the automatic driving automobile needs the longest acquisition time.
The evaluation rules are used to describe how an agent is evaluated to take a series of actions and determine rewards received from the set of actions, and the evaluation rules may be placed into the first data. For example, for an autonomous car, the agent may receive rewards based on whether it is safely arrived, the amount of time used, whether the route selected is optimal, and whether an accident has occurred.
In view of the first data obtained by preparing the data, the data format may not be formed according to a preset rule, so that after the first data is obtained, each part of the behavior of the agent, the input data format, the data acquisition mode and the evaluation rule contained in the first data may be subjected to format conversion, thereby obtaining the first data with regular and readable format.
S32, inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent, and obtaining feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent;
S33, determining reward data corresponding to the behavior based on the feedback data and the evaluation rule;
and S34, determining target data based on the behavior, the second data, the feedback data and the rewards data.
Specifically, in the data processing stage of reinforcement learning, according to the behavior, the input data format, the data acquisition mode and the evaluation rule of the intelligent agent, which are determined in the data preparation stage, the intelligent agent is controlled to continuously interact with the environment, and the complete interaction process is recorded, so that target data, namely, data for reinforcement learning, is formed.
In this process, the behavior currently input into the model corresponding to the agent may be determined as the interaction based on the behavior contained in the first data, and the ambient environment data, that is, the second data, when the agent conforming to the input data format executes the behavior, may be determined based on the input data format stored in the first data. Accordingly, feedback data output by the model corresponding to the intelligent agent can be obtained. On the basis, the evaluation rule stored in the first data is applied, namely, the reward data corresponding to the behavior can be determined based on the feedback data, and thus the target data is formed.
In the target data, behaviors executed by the intelligent agent in the interaction process are recorded, and feedback data received for each behavior, for example, in an automatic driving automobile scene, feedback is the change of road environment perceived by a sensor; in addition, rewards data for each of the activities are also recorded. It should be noted that there may be a large difference in the total rewards that an agent obtains from the environment for different types of reinforcement learning tasks.
The data recorded in the interaction process can be stored as target data in a sequence form for subsequent model training, and each interaction process in the sequence needs to contain three elements of behavior, feedback and rewarding.
In addition, in the above process, the recorded data may be subjected to data preprocessing, where the operation of data preprocessing is consistent with the data preprocessing operation in the above embodiment, which is not described herein.
The method provided by the embodiment of the invention has the advantages that the effective quality control is carried out on each link in the data life cycle, the effectiveness of data processing is improved, and the data quality of reinforcement learning is ensured.
Based on any of the above embodiments, after obtaining the target data for reinforcement learning, further comprising:
And carrying out data correction on the target data for reinforcement learning.
Here, it can be understood that data correction is performed on the data for reinforcement learning in the data optimization stage. Under reinforcement learning, for some specific tasks, the agent may take some unreasonable action, which may lead to accidents and results. Therefore, data corrections are required, for example, by manual intervention to correct the behavior of the agent to prevent such incidents.
During the interaction with the environment, the agent needs to continually explore new behaviors in order to find the best decision strategy for these behaviors. Thus, for each step of interaction, the agent needs to explore some behavior randomly, in addition to the learned strategy. In addition, as agents continue to iterate, agents' behavior decision strategies need to be updated regularly.
The data resulting from the data processing may also be applied to data analysis based on any of the embodiments described above. Data analysis refers to systematic computational analysis of data to discover, interpret, and communicate meaningful information in the data. The data used for data analysis needs to meet various requirements of data analysis, where the requirements may depend on the object, purpose and analysis mode of data analysis, and the embodiment of the present invention is not limited in particular. For example, the user representation may be performed by listening to song data of the user, so that song recommendation is performed based on the user representation.
Based on any of the above embodiments, the data processing flow for data analysis also includes a data preparation phase, a data processing phase, a data quality evaluation phase, and a data optimization phase. The data processing flow for data analysis may be used to discover useful information, provide conclusions, and support decision making.
The data requirements under data analysis include at least the data source and business scenario used for analysis, and the method of deleting personal identification information. And, the data planning under the data analysis should also pay attention to the planning and deleting of the personal identification information. Data collection under data analysis should define data analysis task targets to determine the source of the data collection and the attributes that need to be recorded. The data collection may include data loading and data storage. Data of the business scenario activities should be collected, the collection process should be standardized, and the data should be properly stored for subsequent analysis. The data processing method for data analysis comprises the following steps:
based on the data analysis corresponding scenes, loading and storing the acquired data to obtain fifth data;
performing at least one of data cleaning, data conversion and data aggregation on the fifth data to obtain sixth data;
Performing source quality evaluation and storage quality evaluation on the sixth data and/or performing data enhancement and data mining to obtain data for data analysis;
in particular, the operations of the data preparation phase for data analysis may include data loading and data storage, and the data after completing the data loading and data storage is denoted as fifth data.
Wherein the type of data and the amount of data loaded may be quite different for different data analysis scenarios. Therefore, in the case of preparing multi-modal data, it is necessary to consider in advance which types of data may be related, whether the accuracy of image data in different formats may affect the subsequent data processing, for example, recommending for similar questions, only text may be required, text containing a question may be required, and a legend corresponding to the question may be required.
Also, for data loading, a data analysis target, i.e., a scene, needs to be determined, whereby data loading is a basis for ensuring data analysis quality. The data loading mode can be to collect buried points through the Internet or collect data through hardware such as a camera and a microphone. And the data loading mode should be normalized to reduce standard difference among different batches of data and improve the reusability of the data.
The storage file format also needs to be pre-planned for different data types, such as whether the audio data is stored per frame or phonemic would affect the data analysis effect. In addition, when the data amount is excessively large, various file compression methods need to be considered without losing data quality.
The operations of the data processing stage for data analysis may include at least one of data cleansing, data conversion and data aggregation, whereby normalization of the fifth data is achieved, resulting in sixth data.
In this process, there may be an error or missing value of the fifth data acquired due to the difference in data source and the difference in data quality level. In this case, the data cleansing may check the consistency of the fifth data according to a unified standard format, process invalid data and lost data during storage, and correct identifiable errors in the data storage file. And under the condition of large data volume, sampling and cleaning are carried out, so that the requirement of data analysis can be better met. For example, text data may be stripped of nonsensical stop words based on analysis requirements, and image data may be cropped and stored in a standard size and format.
As the amount of data increases, the type of data acquired may vary greatly. As the variety of data increases, the original storage structure and storage size may become unable to meet the requirements of various tasks. To cope with more data analysis scenarios, it is necessary to convert the data into a unified data storage form. For example, for acquired image data of different resolutions, a scale transformation may be applied to change all images to a uniform scale. Correctly converting data is critical to the performance of data analysis. For example, if the analysis model cannot directly process the string "dog" or "cat", the string needs to be converted into values 0 and 1 in order to perform the numerical calculation.
When the amount of data acquired becomes very large, the data may be grouped prior to data analysis. Data aggregation incorporates strongly correlated data that can be dispersed among different data sets to obtain a more complete data description. Data packets are a form of data aggregation that may divide raw data into different groups based on certain characteristics to meet analysis requirements. The main purpose of the data packet is to observe the distribution characteristics of the data. After data grouping, the frequency distribution table is plotted by calculating the frequency of each set of data to help observe the underlying distribution of data.
Operations of the data quality assessment phase for data analysis may include source quality assessment and storage quality assessment.
In the data quality evaluation, the source quality evaluation is used for investigation and evaluation of acquisition of sixth data, and the storage quality evaluation is used for investigation and evaluation of sixth data storage, so that a real-time monitoring and evaluation system is established. In this process, once any data quality problems are identified, they can be improved over time to reduce the risk of data abuse.
In order to evaluate the data quality, an appropriate evaluation index should be selected. The data quality assessment indicators may include data integrity, accuracy, validity, and timeliness. The data visualization may be used to examine data in a graphical format to obtain additional insight regarding the data.
Further, operations of the data optimization phase for data analysis may include data enhancement and data mining.
The data enhancement aims at increasing the total data amount by introducing new annotation data into the existing annotation data, and the data enhancement under the data analysis is consistent with the data enhancement operations under the supervised learning, the unsupervised learning and the semi-supervised learning in the above embodiments, and is not repeated here.
Data mining can discover potential knowledge and reveal previously unknown data patterns. Data mining may include correlation analysis, time series analysis, and cluster analysis. Moreover, the data quality can also be verified by its implementation in data mining and whether conclusions can be interpreted in the traffic scenario. Continuously correcting the direction of data mining also helps to improve data quality. The results of the data mining may be displayed by a data visualization, which may help evaluate the quality of the data. Thus, the visual display of the data also helps to identify problems within the data and provides guidance for improving the data quality process.
Wherein a correlation analysis is used to determine the correlation between different events, i.e. when one event occurs, another event occurs frequently. For example, correlation analysis of multiple parameters from a climate dataset; time series analysis determines the change in acquired data over time through a series of data point records with consistent time intervals, e.g., data-driven analysis of time series from a network physical system; cluster analysis is a process of grouping data items into different clusters such that items in each cluster are more similar to each other than items in different clusters, e.g., clustering documents into documents under various topics.
In the case where the data application type includes data analysis, both quality evaluation and data optimization may be performed for the sixth data, and in the case where the quality evaluation and data optimization are sequentially performed, the sixth data after the quality evaluation may be the processing target of the data optimization or the sixth data after the data optimization may be the processing target of the quality evaluation, which is not particularly limited in the embodiment of the present invention.
Based on any of the above embodiments, when the correlation analysis between each index of the device and the device fault is performed by applying the data analysis, index values of each index in the normal state and the fault state of the device may be collected to construct initial data. After the initial data is obtained, the initial data may be loaded and stored in a preset form, thereby obtaining fifth data conforming to a preset format.
In obtaining the fifth data, the values of the indexes in the fifth data can be subjected to data cleaning to filter out the values of the outliers which do not accord with the actual values of the outliers due to acquisition errors, and in addition, data aggregation can be performed to integrate the values reflecting the unified indexes obtained through different acquisition ways, so that sixth data capable of reflecting the values of the indexes of the equipment in a normal state and an abnormal state can be obtained.
Then, the source quality evaluation can be carried out on the sixth data so as to ensure that the value sources of all indexes in the sixth data are reliable; in addition, the storage quality evaluation can be performed on the sixth data so as to ensure that the values of all indexes in the sixth data are not tampered.
After the above operation is completed, correlation analysis between various indexes of the apparatus and the apparatus failure can be performed based on the sixth data.
Based on any of the above embodiments, fig. 7 is a schematic diagram of mapping a data lifecycle frame and a frame of a data processing flow provided by the present invention, in fig. 7, a portion below a dotted line is a data quality control flow frame (data quality process framework, DQPF), that is, a data processing flow frame, a portion above the dotted line is a data lifecycle frame (data life cycle framework, DLCF), and various sources in the data processing flow frame correspond to data collection in the data lifecycle frame; a data preparation stage in the data processing flow frame corresponds to data preparation in the data lifecycle frame; the data processing stage, the data quality evaluation stage and the data optimization stage in the data processing flow frame correspond to the construction of the model and the evaluation and verification of the model in the data life cycle frame; the data usage phase and machine learning model in the data processing flow framework corresponds to system deployment, system operation, and data deletion or archiving.
It should be noted that the above embodiments are applicable to training and evaluation of data from different sources, including data acquisition, data preprocessing, data labeling, evaluation, and data use. Moreover, embodiments of the present invention are not limited to a particular implementation of a service, platform, or tool.
Further, the effectiveness of machine learning models depends to a large extent on the quality of the training data. To improve training data quality, two principles should be followed in the data quality flow:
ensuring the reliability of the training data (i.e., the training data should be as close as possible to the actual situation);
the validity of the training data is ensured by ensuring that the collected data is sufficient to complete the machine learning task.
Any predictions made by the machine learning model for the target phenomenon may be suspicious and misleading if the training data is unreliable and valid.
Based on any of the above embodiments, fig. 8 is a schematic diagram of a framework of a data processing flow provided in the present invention, as shown in fig. 8, a data processing flow, i.e. a data quality control flow framework DQPF can provide a quality management mechanism to ensure validity targets of a machine learning model.
The results of DQPF are intended to guarantee the quality of data, which is performed under the data lifecycle DLC model according to the standard. The results of the DQPF may include an organized data quality strategy, a data quality implementation plan, or a series of rules that may affect the quality of the data in the DLC model.
As shown in fig. 8, the DQPF includes:
planning data quality: collecting a data quality management method by analyzing data quality requirements and a data life cycle, and making a data quality management plan;
checking data quality: measuring and monitoring data quality in the data link control model and providing results for a data quality plan;
data quality optimization: improving training data quality by correcting errors or risks in the data;
and (3) verifying a data quality process: evaluate and verify the data quality plan and provide feedback for improvement.
Further, in the data quality control flow, the data quality plan may implement the following activities and provide the following results:
a) Activity:
-analyzing data quality requirements of stakeholders in the data lifecycle model;
-establishing a data quality model described in ISO/IEC 5259-1;
-quality measurements according to ISO/IEC 5259-2;
reflecting the requirements described in ISO/IEC 5259-3.
b) The achievement is as follows:
-data quality model described in ISO/IEC 5259-1;
-a quality measurement method as described in ISO/IEC 5259-2;
-a data quality management procedure as described in ISO/IEC 5259-3;
the products required in ISO/IEC 5259-3.
The data quality check may perform the following activities and provide the following results:
a) Activity:
-measuring data quality according to ISO/IEC 5259-2;
-implementing a data quality management flow.
b) The achievement is as follows:
-a measurement of data quality;
-risk or error in the data;
-monitoring records of the data quality process.
Data quality optimization may implement the following activities and provide the following results:
a) Activity:
-processing data, improving data quality;
-analyzing risk or error in the data;
-improving data quality by data optimization.
b) The achievement is as follows:
-a data quality improvement method;
-a method of handling or controlling data errors.
The data quality process verification may implement the following activities and provide the following results:
a) Activity:
-validating the data quality management flow;
-validating the data quality model and the measurement method.
b) The achievement is as follows:
-fault reporting of the data quality measurement procedure;
-guidance for improving the data quality management flow;
data quality management procedures approved with stakeholders.
Note that: data quality process verification is validated or carried out by data quality stakeholders (experts) through consultation or organizational experience.
FIG. 9 is a schematic diagram of data quality optimization and data quality process verification provided by the present invention, wherein as shown in FIG. 9, data quality optimization is performed on bad data, so that data quality can be improved in a quality control flow; and (3) verifying the data quality control flow in the quality control flow, namely improving the quality and the credibility of the flow in the quality control flow. The optimized data quality control result obtained by the method is the target 'result' of ISO/IEC 5259-4.
FIG. 10 is a schematic diagram showing the relationship between a data lifecycle framework and a data processing flow, as shown in FIG. 10, in which the data processing flow, i.e., the data quality control flow DQPF, can be used throughout the data lifecycle DLC framework to manage data quality.
Based on any of the above embodiments, the data processing flow is intended to provide guidance and good practice for achieving machine-learned data quality. The data processing flow is based on the DQPF implementation described above, where the particular data processing flow used depends on machine learning tasks, applications, and methods.
The elements involved in the data processing flow may include: data requirements, data planning, data acquisition, data preparation, data provisioning, and data deactivation.
Fig. 11 is a second schematic diagram of a framework of a data processing flow of supervised learning provided by the present invention, as shown in fig. 11, in the DLC of supervised learning, data requirements are developed for the context of machine learning tasks, applications, and methods, and lay a foundation for the data processing flow, i.e., the rest of the data quality control flow. To achieve data quality, the data requirements should include at least the following aspects of determining and recording:
-necessary features in the data;
-the necessary data volume;
no prejudice;
-statistical properties;
-machine learning model representatives of subjects in terms of behaviour, demographics and geographic location;
-a data quality model based on the selected data quality features;
-appropriate data quality measures;
-a data quality metric target;
legal requirements.
It should be noted that the selection of the data quality features and the data quality metrics establishes an additional level of detail for the data requirements.
The data planning is established on the basis of data requirements, and the planning and the resources can be ensured to be in place so as to successfully execute the data quality control flow. The data planning process should take into account at least the following elements:
-a data model or data architecture required to fulfill data requirements;
-obtaining a plan of necessary data for which the data requirements determine;
-roles, skills and personnel required to perform a data quality control procedure;
-IT and other resources required to perform a data quality control procedure;
-time and budget required for performing a data quality control procedure;
-making a data acquisition plan based on data requirements;
-performing a plan of data quality measures according to a data quality control model;
-a program meeting legal requirements;
-a plan meeting other data requirements;
-a plan that obeys the data quality flow principle.
The data used to develop the machine learning model may come from different sources (e.g., internet of things systems, transactions, surveys, still images, video, sound, web forms, etc.), have different data types (e.g., digital, text, binary, etc.), data formats (e.g., XML, JSON, separator values, JPEG, MPEG, etc.), and schemas. The organization can acquire data from existing data according to data requirements, and can collect new data. In some cases, the data may come from streaming or near real-time sources (e.g., social media sources, search engines) and may be used to continually refine the artificial intelligence model.
The data acquisition process should take at least the following factors into account:
-adhering to elements determined in the data planning process;
-obeying data quality flow principles;
-key data attributes determined by the data requirement flow, for example: provenance, bias, consistency, reliability, validity period, data type, schema, and format;
data context in machine learning model development, for example: training, verifying, testing, producing and covering;
wherein, for still images and videos, the key data attributes include: resolution, sharpness, brightness, color, background noise.
It will be appreciated that once the data is obtained, the quality of the data should be further assessed.
The goal of data preparation is to bring the data to a state that can be successfully used to develop a machine learning model, and the performance of the model meets the requirements of the organization.
The data preparation should take into account at least the following elements:
-data combination;
-data annotation;
-data annotation;
-data quality assessment in relation to a data quality metric target established in the data requirements;
-data quality improvement, comprising: data cleaning, data standardization, data interpolation, data de-identification and data encoding.
Data annotations are typically represented as metadata of data or annotations of data, and may include, in particular, a series of information that introduces the status and usage of the data. The user, manager, annotator and inspector can all select metadata and annotation data according to business logic needs. Wherein the metadata may include:
-compiling relevant metadata including data source, human resource information for data annotation, date and time of relevant operations and transmissions in the data sharing process;
metadata related to content, including business and technical fields, data formats, data volumes, data category numbers, examples, data characteristics, and statistics related to data distribution, bounding boxes, segments, keypoints, and files;
-quality related metadata comprising data quality measurements as defined in ISO/IEC 5259-2.
Metadata can be used for data introduction, searching, recommendation, tracking, and sharing. Before selecting and using the data, if the user is unfamiliar with the data, the metadata of the data should be checked.
For example, to train a machine learning model for vehicle identification, a user needs to find appropriate data from several available example pictures. This is a common situation where a user has only taken a few pictures of the vehicle in the real world, requiring detailed inspection of each candidate data using limited resources. In this case, the metadata may be used in combination with a feature similarity-based comparison method for data selection for machine learning model training. The comparison of metadata (e.g., business fields, data formats, data characteristics) can help the user quickly determine the most appropriate data.
An important use of metadata is to aggregate the content of each sample in the data. In this case, the metadata may be presented in some form, such as a schema, to describe the meta-information and corresponding tags of the example file.
For example, the modes of the image classification data include:
-a "filename" field of the string type;
-a "tag" field of integer type;
-a "data" field of byte type.
For a complete definition, metadata may be provided to specify metadata for the field. In this case, the most basic semantics can be considered, for example:
-the name of the field;
-data type of field (e.g. float 32);
-dimensions of the fields.
The purpose of the data quality assessment is to determine whether the data meets the data quality requirements. It is necessary to repeat the data quality assessment whenever data is converted.
The elements of the data quality assessment should include at least the following:
-applying data quality features and data quality metrics for the target determined in the data requirements process. Data quality features and data quality metrics are described in ISO/IEC 5259-2;
-a file of evaluation processes and results;
-if the data does not meet the data requirements, the organization should consider the following possible action schemes: improving the data; stopping using the data; alternatively, new data is acquired.
The results of the data quality assessment often indicate that the data does not meet the data quality requirements. In many cases, data may be modified to meet data quality requirements. The data quality improvement should include at least the following processes: cleaning, filtering, normalizing, scaling, interpolating and enhancing.
Wherein information about normalization, scaling and interpolation of data can be found in ISO/IEC 5259-2, clause 6. For information on data encoding see ISO/IEC 5259-2.
Data provision refers to presenting prepared data in a machine-learned form for use. The organization should ensure that the data meets all specified requirements before being supplied. The data provision may include transferring or moving data from one system to another. If such transmission or movement is necessary, the organization should ensure that the quality of the data is maintained.
Under supervised learning, training data is applied to machine learning algorithms to create machine learning models. The production data is then applied to the machine learning model to create the inference.
Under unsupervised learning, while some unsupervised learning methods include training (e.g., K-means clustering algorithms), models are normally created from production data, which are then inferred (e.g., data records assigned to a cluster center).
Semi-supervised learning is a mix of supervised and unsupervised learning that will use unlabeled training data in addition to labeled training data.
When the machine learning no longer requires data, it may be deactivated, i.e., data deactivation occurs. If the data is intended to be reused in the future, the organization should ensure that the quality of the data is maintained, including its security and the privacy of any data bodies.
In addition, a full backup should be created and verified before shutdown. A shutdown plan should then be formulated to determine the people involved in the shutdown and their roles and responsibilities. Before the deactivation plan begins, the plan should be reviewed with the relevant stakeholders.
Based on any of the above embodiments, the implementation of the data quality control procedure requires participants with multiple roles, specifically including:
data planner:
a data planner is an organization or entity that formulates rules or oversees the behavior of data processing, data usage, and data sharing. Representative data planners may include:
-data quality standard formulator: an entity that normalizes the overall data lifecycle process and provides quality metrics and assessment means;
-data legal requirements formulator: an entity providing legal requirements for controlling possible operations in the data lifecycle. In particular, legal requirements regarding data usage and data sharing may affect data ownership and benefit. The supervisor may apply such legal requirements.
Data collector:
the data collector may be an organization or entity that collects data from a set of designated data sources and collates the collected data into an accessible form. Where data collection may be performed in accordance with legal agreements.
Data processor:
a data processor is an organization or entity that processes a particular piece of data according to particular requirements, including data quality, data usage, data sharing, and security. The data processor does not necessarily have data. A data processor may use data processing tools, including hardware and software, and may integrate multiple related tools into one platform.
In the context of machine learning, a data processor may perform data tagging, data cleaning, data normalization, data enhancement, and any other operations required by a machine learning method.
Data provider:
a data provider refers to an organization or entity that provides data for a particular use or further processing. The data provider may have the data it provides or may act on behalf of the actual owner of the data it provides.
It should be noted that the data owner and the data provider should be distinguished as two roles, namely possession and responsibility of the data ownership. While the data owner has the right to control who can access, use, modify and share particular data, the person who owns the data has the right to use and share the data. The data owner is also responsible for the quality and protection of the data. The data owner may delegate data processing tasks to the data processor.
The data user:
a data consumer is an organization or entity that uses a particular piece of data for a particular purpose. The data user need not process the data.
In the machine learning context, the data user may be an organization or entity that trains a machine learning model, an application that tests machine learning, or the like.
Based on any of the above embodiments, fig. 12 is a schematic structural diagram of a data processing apparatus according to the present invention, as shown in fig. 12, the apparatus includes:
a data acquisition unit 1210 for acquiring first data including at least one of audio, image, and text;
a preprocessing unit 1220, configured to preprocess the first data to obtain second data, where the preprocessing includes at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification, and data sampling;
the labeling unit 1230 is configured to send the second data to a labeling end, and receive third data returned by the labeling end, where the third data carries a labeling result;
a sampling unit 1240, configured to sample the third data to obtain fourth data;
a checking unit 1250, configured to send the fourth data to a checking end, and receive a labeling checking result returned by the checking end;
An optimizing unit 1260, configured to perform at least one of data filtering, data enhancement, and annotation modification on the third data to obtain target data when the annotation test result is that the annotation is qualified; and repeatedly executing the labeling unit, the sampling unit and the checking unit under the condition that the labeling check result is that the labeling is unqualified.
The data processing device provided by the invention performs data processing flows including preprocessing, labeling, sampling, checking and data optimization on various types of data, performs effective quality control on each link in the data life cycle, improves the effectiveness of data processing, and ensures the data quality of supervised learning.
Based on any of the above embodiments, fig. 13 is a second schematic structural diagram of a data processing apparatus according to the present invention, as shown in fig. 13, the apparatus includes:
a data acquisition unit 1310 for acquiring first data including at least one of audio, image, and text;
a preprocessing unit 1320, configured to preprocess the first data to obtain second data, where the preprocessing includes at least one of data combining, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification, and data sampling;
A sampling unit 1330, configured to sample the second data to obtain third data;
the checking unit 1340 is configured to send the third data to a checking end, and receive a collected checking result returned by the checking end;
an optimizing unit 1350, configured to perform at least one of data filtering, data enhancement, and labeling modification on the second data to obtain target data when the acquisition test result is that the acquisition is qualified; and repeatedly executing the data acquisition unit, the preprocessing unit, the sampling unit and the inspection unit under the condition that the acquisition inspection result is that the acquisition is unqualified.
The data processing device provided by the invention executes data processing flows including preprocessing, sampling, checking and data optimization aiming at data under various types, and performs effective quality control on each link in the data life cycle, so that the effectiveness of data processing is improved, and the data quality of unsupervised learning is ensured.
Based on any of the above embodiments, fig. 14 is a third schematic structural diagram of a data processing apparatus according to the present invention, as shown in fig. 14, the apparatus includes:
a data obtaining unit 1410, configured to obtain first data, where the first data includes a behavior of an agent, an input data format, a data acquisition mode, and an evaluation rule;
An interaction unit 1420, configured to input the behavior and the second data in the input data format to a model corresponding to the agent, and obtain feedback data corresponding to the behavior output by the model corresponding to the agent;
a reward determination unit 1430 configured to determine reward data corresponding to the behavior based on the feedback data and the evaluation rule;
a data determining unit 1440, configured to determine target data based on the behavior, the second data, the feedback data, and the reward data.
The data processing device provided by the invention has the advantages that the effective quality control is carried out on each link in the data life cycle, the effectiveness of data processing is improved, and the data quality of reinforcement learning is ensured.
Fig. 15 illustrates a physical structure diagram of an electronic device, as shown in fig. 15, which may include: a processor 1510, a communication interface (Communications Interface) 1520, a memory 1530, and a communication bus 1540, wherein the processor 1510, the communication interface 1520, and the memory 1530 communicate with each other via the communication bus 1540. The processor 1510 may invoke logic instructions in the memory 1530 to perform data processing methods comprising: s11, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s12, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s13, the second data are sent to a labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results; s14, sampling the third data to obtain fourth data; s15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received; s16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data; and if the labeling check result is that the labeling is unqualified, repeating S13-S15.
The processor 1510 may also invoke logic instructions in the memory 1530 to perform data processing methods including: s21, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s22, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s23, sampling the second data to obtain third data; s24, the third data are sent to the checking end, and the acquisition checking result returned by the checking end is received; s25, under the condition that the acquisition and test result is that the acquisition is qualified, carrying out at least one of data filtering and data enhancement on the second data to obtain target data; and if the acquisition test result is that the acquisition is not qualified, repeating S21-S24.
The processor 1510 may also invoke logic instructions in the memory 1530 to perform data processing methods including: s31, acquiring first data, wherein the first data comprises behaviors of an intelligent agent, an input data format, a data acquisition mode and an evaluation rule; s32, inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent, and obtaining feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent; s33, determining reward data corresponding to the behavior based on the feedback data and the evaluation rule; and S34, determining target data based on the behavior, the second data, the feedback data and the rewards data.
Further, the logic instructions in the memory 1530 described above may be implemented in the form of software functional units and may be stored on a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the data processing method provided by the methods described above, the method comprising: s11, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s12, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s13, the second data are sent to a labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results; s14, sampling the third data to obtain fourth data; s15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received; s16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data; and if the labeling check result is that the labeling is unqualified, repeating S13-S15.
The computer is also capable of executing the data processing method provided by the methods, the method comprises the following steps: s21, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s22, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s23, sampling the second data to obtain third data; s24, the third data are sent to the checking end, and the acquisition checking result returned by the checking end is received; s25, under the condition that the acquisition and test result is that the acquisition is qualified, carrying out at least one of data filtering and data enhancement on the second data to obtain target data; and if the acquisition test result is that the acquisition is not qualified, repeating S21-S24.
The computer is also capable of executing the data processing method provided by the methods, the method comprises the following steps: s31, acquiring first data, wherein the first data comprises behaviors of an intelligent agent, an input data format, a data acquisition mode and an evaluation rule; s32, inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent, and obtaining feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent; s33, determining reward data corresponding to the behavior based on the feedback data and the evaluation rule; and S34, determining target data based on the behavior, the second data, the feedback data and the rewards data.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the data processing method provided by the above methods, the method comprising: s11, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s12, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s13, the second data are sent to a labeling end, and third data returned by the labeling end are received, wherein the third data carry labeling results; s14, sampling the third data to obtain fourth data; s15, the fourth data are sent to the checking end, and a labeling checking result returned by the checking end is received; s16, under the condition that the labeling inspection result is qualified, performing at least one of data filtering, data enhancement and labeling modification on the third data to obtain target data; and if the labeling check result is that the labeling is unqualified, repeating S13-S15.
The computer program is further implemented when executed by a processor to perform the data processing method provided by the methods described above, the method comprising: s21, acquiring first data, wherein the first data comprises at least one of audio, images and texts; s22, preprocessing the first data to obtain second data, wherein the preprocessing comprises at least one of data combination, data filtering, normalization, scaling, interpolation, data cleaning, data de-identification and data sampling; s23, sampling the second data to obtain third data; s24, the third data are sent to the checking end, and the acquisition checking result returned by the checking end is received; s25, under the condition that the acquisition and test result is that the acquisition is qualified, carrying out at least one of data filtering and data enhancement on the second data to obtain target data; and if the acquisition test result is that the acquisition is not qualified, repeating S21-S24.
The computer program is further implemented when executed by a processor to perform the data processing method provided by the methods described above, the method comprising: s31, acquiring first data, wherein the first data comprises behaviors of an intelligent agent, an input data format, a data acquisition mode and an evaluation rule; s32, inputting the behaviors and the second data in the input data format to the model corresponding to the intelligent agent, and obtaining feedback data corresponding to the behaviors output by the model corresponding to the intelligent agent; s33, determining reward data corresponding to the behavior based on the feedback data and the evaluation rule; and S34, determining target data based on the behavior, the second data, the feedback data and the rewards data.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (13)

CN202211028230.XA2022-08-252022-08-25Data processing method, device, electronic equipment and storage mediumPendingCN117669759A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211028230.XACN117669759A (en)2022-08-252022-08-25Data processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211028230.XACN117669759A (en)2022-08-252022-08-25Data processing method, device, electronic equipment and storage medium

Publications (1)

Publication NumberPublication Date
CN117669759Atrue CN117669759A (en)2024-03-08

Family

ID=90075604

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211028230.XAPendingCN117669759A (en)2022-08-252022-08-25Data processing method, device, electronic equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN117669759A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118174806A (en)*2024-05-102024-06-11创意银航(山东)技术有限公司Full-standard base station scanner equipment based on embedded
CN118673011A (en)*2024-06-212024-09-20中国人民公安大学Data quality inspection method, device, medium and equipment based on data sampling

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN118174806A (en)*2024-05-102024-06-11创意银航(山东)技术有限公司Full-standard base station scanner equipment based on embedded
CN118673011A (en)*2024-06-212024-09-20中国人民公安大学Data quality inspection method, device, medium and equipment based on data sampling

Similar Documents

PublicationPublication DateTitle
US11847574B2 (en)Systems and methods for enriching modeling tools and infrastructure with semantics
US20170109668A1 (en)Model for Linking Between Nonconsecutively Performed Steps in a Business Process
US20210158221A1 (en)Methods and systems for facilitating analysis of a model
US20170109667A1 (en)Automaton-Based Identification of Executions of a Business Process
US20170109636A1 (en)Crowd-Based Model for Identifying Executions of a Business Process
US8339410B2 (en)Computer-aided methods and systems for pattern-based cognition from fragmented material
KR101953190B1 (en)A multidimensional recursive learning process and system used to discover complex dyadic or multiple counterparty relationships
CN117669759A (en)Data processing method, device, electronic equipment and storage medium
CN117829291A (en)Whole-process consultation knowledge integrated management system and method
CN119359252A (en) A method, device and storage medium for implementing a business intelligent flow engine
CN120353898A (en)Data integration method and system of consultation service platform based on cloud computing
Bustamante et al.Predictive academic performance model to support, prevent and decrease the university dropout rate
CN118966181B (en)Personalized privacy policy generation method, system equipment and medium
CN113222471B (en) Asset risk control method and device based on new media data
CN119669203A (en) Multimodal data management system, method, device and medium
Spada et al.WHAT USERS WANT: A NATURAL LANGUAGE PROCESSING APPROACH TO DISCOVER USERS'NEEDS FROM ONLINE REVIEWS
EP4091064A1 (en)Methods and systems for facilitating analysis of a model
CN120410761B (en) Intelligent financial risk early warning method based on multimodal data fusion
CN120216706B (en) Cross-scenario question-answering framework for non-performing assets based on knowledge graph
Shanmugarajah et al.WoKnack–A Professional Social Media Platform for Women Using Machine Learning Approach
DominPrinciples for Facial Recognition Technology: A Content Analysis of Ethical Guidance
CN118332407B (en)Method and system for automatically carrying out data identification, classification and classification
HilmarssonUsing machine learning for predicting the likelihood of upper secondary school student dropout
CN119940900A (en) Pharmaceutical compliance risk warning and decision-making recommendation method and system based on deep learning
CN120578731A (en)Intelligent service method and system based on deep reasoning and knowledge enhancement

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp