Movatterモバイル変換


[0]ホーム

URL:


CN110991657A - Abnormal sample detection method based on machine learning - Google Patents

Abnormal sample detection method based on machine learning
Download PDF

Info

Publication number
CN110991657A
CN110991657ACN201911157400.2ACN201911157400ACN110991657ACN 110991657 ACN110991657 ACN 110991657ACN 201911157400 ACN201911157400 ACN 201911157400ACN 110991657 ACN110991657 ACN 110991657A
Authority
CN
China
Prior art keywords
sample
samples
abnormal
machine learning
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911157400.2A
Other languages
Chinese (zh)
Inventor
柴磊
许靖
尹帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Magic Digital Intelligent Artificial Intelligence Co Ltd
Original Assignee
Shenzhen Magic Digital Intelligent Artificial Intelligence Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Magic Digital Intelligent Artificial Intelligence Co LtdfiledCriticalShenzhen Magic Digital Intelligent Artificial Intelligence Co Ltd
Priority to CN201911157400.2ApriorityCriticalpatent/CN110991657A/en
Publication of CN110991657ApublicationCriticalpatent/CN110991657A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention relates to an abnormal sample detection method based on machine learning, which does not directly use a probability statistics method to find abnormal points in original data, but uses a supervised machine learning method to repeatedly construct different classification models for sample data with labels mainly aiming at sample data of two classes, then carries out classification prediction on the original sample data, and finally uses the variance or standard deviation of the predicted probability value as the standard for abnormal sample detection. The application provides a method for detecting a new abnormal sample, which enriches the existing abnormal sample detection method, can quickly and effectively detect the abnormal sample to a certain extent, eliminates the adverse effect of the abnormal sample on the normal sample, and has a certain degree of influence on the stability and the prediction precision of a lifting model.

Description

Abnormal sample detection method based on machine learning
Technical Field
The invention belongs to the technical field of abnormal sample detection, and particularly relates to an abnormal sample detection method based on machine learning.
Background
With the development of emerging technologies such as internet, mobile internet, cloud computing and the like, people gradually enter a big data era, a large amount of customer behavior data are generated on the network almost every day, and under many scenes, the data need to be analyzed and predicted, for example, in the internet financial field, credit risk of a customer is evaluated, and whether the customer is a normal customer or an abnormal customer needs to be judged by using the behavior data of the customer, so that financial risk is prevented better, and loss of investors is reduced. Because the sample data dimension of a client is possibly large, the quality of the client is difficult to distinguish by manual experience, and more business scenes begin to introduce a machine learning model to predict and classify the samples.
In general, most machine learning is supervised learning, and a classifier is trained by using labeled samples and then classified by unknown samples. However, in the sample data preparation process, a certain number of abnormal samples are caused by a small amount of data quality problems caused by errors such as input, labeling and the like, or abnormal information existing in the data. In many cases, the number of abnormal samples per se is usually small, and on the other hand, the abnormal samples are often very hidden and difficult to be found, and data such as an abnormal application credit is usually difficult to be perceived. Therefore, the difficulty of detecting an abnormal sample is increased, and supervised learning is difficult to perform. The invention mainly aims at the two classification problems in supervised machine learning, and hopefully provides an effective method for detecting abnormal samples, and trains the model after filtering the detected abnormal samples so as to improve the stability of the model and the accuracy of model prediction.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for detecting abnormal samples based on machine learning, which mainly aims at the two categories of classification problems, adopts machine learning to detect abnormal samples, and has accurate and efficient detection effect.
The invention also aims to provide an abnormal sample detection method based on machine learning, which has simple and efficient detection process, accurate judgment and scientific and reasonable judgment basis.
In order to achieve the above object, the technical solution of the present invention is as follows.
The invention relates to an abnormal sample detection method based on machine learning, which is characterized by comprising the following specific steps:
s1: sampling an original sample to generate a training set and a test set, wherein the training set is set to be more than one;
s2: constructing a model by using a machine learning algorithm, training each training set by using the model to test the testing set, and correspondingly generating a classification model for each training set after training;
s3: each classification model predicts an original sample in turn, and obtains a prediction score of each sample in the original sample in the training process, wherein the prediction score is a probability value of the sample belonging to a normal sample;
s4: the scores of the same sample after prediction are divided into a group by each classification model, and the variance or standard deviation of the prediction scores of each group is calculated;
s5: sorting the samples according to the variance or standard deviation from small to large, setting an initial threshold value, and regarding the samples with the variance or standard deviation exceeding the initial threshold value as abnormal samples;
s6: removing abnormal samples from the original samples, repeatedly training the remaining samples, comparing performance evaluation indexes under different models, and determining an optimal threshold value;
s7: and taking the optimal threshold as a segmentation point, and regarding the original samples higher than the optimal threshold as final abnormal samples.
The original samples are divided into more than one training set, and each training set is trained to generate more than one classification model, so that the effectiveness and the accuracy of the classification models are guaranteed. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step.
The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples.
Further, the original samples in the step S1 are labeled sample sets including normal samples and abnormal samples. The arrangement of the samples with the labels facilitates training of machine learning, and also provides a basis for sequencing of subsequent samples.
Further, the sampling method of the original sample comprises random sampling and k-fold cross validation, wherein the random sampling refers to applying a random method to divide the original sample into a training receiving and testing set according to a proportion; the k-fold cross validation refers to that samples are divided into k parts in equal proportion, one part is used as a test set, and the rest is used as a training set. And the randomness and the representativeness of samples in a training set and a test set are ensured by selecting sample sampling methods such as random sampling, k-fold cross validation and the like.
Further, the machine learning algorithm in step S2 includes logistic regression, GBDT or support training vector machine. Logistic regression, GBDT, support training vector machines and the like are classification algorithms commonly used in the field of machine learning, and can be realized by calling corresponding algorithms in a python open-source machine learning library such as Sklearn and setting empirical parameters to iteratively train a classification model.
Further, the prediction score in the step S3 is a probability value that the sample belongs to a normal sample. For example, the sample label is represented by a binary value of 0 and 1, wherein the label of class 1 is a normal sample, and the prediction score refers to a probability value that the sample belongs to class 1, but not to an abnormal sample.
Further, the performance evaluation index in step S6 includes an AUC value, a KS value, and a Lift value. The AUC value is often used for evaluating a common index of classification performance of a binary classifier, and the larger the AUC value is, the more possible the current classification model is to arrange positive samples in front of negative sample values, namely better classification can be realized; the KS value is used for evaluating the risk discrimination capability of the model, and the index measures the difference value between the cumulative distribution of good and bad samples. The greater the accumulated difference of the good and bad samples is, the greater the KS value is, the stronger the risk distinguishing capability of the model is; the Lift value is measured by how much the prediction capability of the model is better compared with that of the model which is not used, and the larger the Lift (Lift index) is, the better the operation effect of the model is; the abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
In summary, the present invention is an abnormal sample detection method based on machine learning, which is characterized in that the effectiveness and accuracy of a classification model are ensured by dividing an original sample into more than one training set and training each training set to generate more than one classification model. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step. The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples. The abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
Drawings
Fig. 1 is a flowchart of an abnormal sample detection method based on machine learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to achieve the above object, the technical solution of the present invention is as follows.
Referring to fig. 1, the present invention is an abnormal sample detection method based on machine learning, which is characterized in that the method comprises the following specific steps:
s1: sampling an original sample to generate a training set and a test set, wherein the training set is set to be more than one;
s2: constructing a model by using a machine learning algorithm, training each training set by using the model to test the testing set, and correspondingly generating a classification model for each training set after training;
s3: each classification model predicts an original sample in turn, and obtains a prediction score of each sample in the original sample in the training process, wherein the prediction score is a probability value of the sample belonging to a normal sample;
s4: the scores of the same sample after prediction are divided into a group by each classification model, and the variance or standard deviation of the prediction scores of each group is calculated;
s5: sorting the samples according to the variance or standard deviation from small to large, setting an initial threshold value, and regarding the samples with the variance or standard deviation exceeding the initial threshold value as abnormal samples;
s6: removing abnormal samples from the original samples, repeatedly training the remaining samples, comparing performance evaluation indexes under different models, and determining an optimal threshold value;
s7: and taking the optimal threshold as a segmentation point, and regarding the original samples higher than the optimal threshold as final abnormal samples.
The original samples are divided into more than one training set, and each training set is trained to generate more than one classification model, so that the effectiveness and the accuracy of the classification models are guaranteed. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step.
The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples.
In this embodiment, the original samples in the step S1 are labeled sample sets including normal samples and abnormal samples. The arrangement of the samples with the labels facilitates training of machine learning, and also provides a basis for sequencing of subsequent samples.
In this embodiment, the sampling method of the original sample includes random sampling and k-fold cross validation, where the random sampling refers to applying a random method to divide the original sample into training and testing sets in proportion; the k-fold cross validation refers to that samples are divided into k parts in equal proportion, one part is used as a test set, and the rest is used as a training set. And the randomness and the representativeness of samples in a training set and a test set are ensured by selecting sample sampling methods such as random sampling, k-fold cross validation and the like.
In this embodiment, the machine learning algorithm in step S2 includes logistic regression, GBDT or support training vector machine. Logistic regression, GBDT, support training vector machines and the like are classification algorithms commonly used in the field of machine learning, and can be realized by calling corresponding algorithms in a python open-source machine learning library such as Sklearn and setting empirical parameters to iteratively train a classification model.
In this embodiment, the prediction score in step S3 is a probability value that the sample belongs to a normal sample. For example, the sample label is represented by a binary value of 0 and 1, wherein the label of class 1 is a normal sample, and the prediction score refers to a probability value that the sample belongs to class 1, but not to an abnormal sample.
In the present embodiment, the performance evaluation index in step S6 includes an AUC value, a KS value, and a Lift value. The AUC value is often used for evaluating a common index of classification performance of a binary classifier, and the larger the AUC value is, the more possible the current classification model is to arrange positive samples in front of negative sample values, namely better classification can be realized; the KS value is used for evaluating the risk discrimination capability of the model, and the index measures the difference value between the cumulative distribution of good and bad samples. The greater the accumulated difference of the good and bad samples is, the greater the KS value is, the stronger the risk distinguishing capability of the model is; the Lift value is measured by how much the prediction capability of the model is better compared with that of the model which is not used, and the larger the Lift (Lift index) is, the better the operation effect of the model is; the abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

CN201911157400.2A2019-11-222019-11-22Abnormal sample detection method based on machine learningPendingCN110991657A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911157400.2ACN110991657A (en)2019-11-222019-11-22Abnormal sample detection method based on machine learning

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911157400.2ACN110991657A (en)2019-11-222019-11-22Abnormal sample detection method based on machine learning

Publications (1)

Publication NumberPublication Date
CN110991657Atrue CN110991657A (en)2020-04-10

Family

ID=70085962

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911157400.2APendingCN110991657A (en)2019-11-222019-11-22Abnormal sample detection method based on machine learning

Country Status (1)

CountryLink
CN (1)CN110991657A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111814548A (en)*2020-06-032020-10-23中铁第四勘察设计院集团有限公司Abnormal behavior detection method and device
CN111898829A (en)*2020-08-042020-11-06电子科技大学Electric quantity prediction method under edge equipment based on sparse abnormal perception
CN112115129A (en)*2020-09-162020-12-22浪潮软件股份有限公司Retail terminal sample sampling method based on machine learning
CN112215398A (en)*2020-04-272021-01-12深圳华工能源技术有限公司Power consumer load prediction model establishing method, device, equipment and storage medium
CN112489809A (en)*2020-12-182021-03-12内蒙古卫数数据科技有限公司Clinical laboratory abnormal data monitoring method based on artificial intelligence algorithm
CN112529209A (en)*2020-12-072021-03-19上海云从企业发展有限公司Model training method, device and computer readable storage medium
CN112699943A (en)*2020-12-312021-04-23平安科技(深圳)有限公司Method for eliminating abnormal samples and computer equipment
CN112931915A (en)*2021-04-302021-06-11河南中烟工业有限责任公司Blending uniformity detection method for tobacco components in leaf group
CN113033694A (en)*2021-04-092021-06-25深圳亿嘉和科技研发有限公司Data cleaning method based on deep learning
CN113516398A (en)*2021-07-222021-10-19北京淇瑀信息科技有限公司Risk equipment identification method and device based on hierarchical sampling and electronic equipment
CN113537285A (en)*2021-06-082021-10-22内蒙古卫数数据科技有限公司Novel clinical mismatching sample identification method based on machine learning technology by utilizing patient historical comparison data
CN113555124A (en)*2021-01-192021-10-26内蒙古卫数数据科技有限公司Blood routine sample difference checking method based on machine learning
CN113570398A (en)*2021-02-022021-10-29腾讯科技(深圳)有限公司Promotion data processing method, model training method, system and storage medium
CN113919844A (en)*2021-09-282022-01-11陕西师范大学 Multi-view network transaction risk identification method based on data Petri net
CN117313900A (en)*2023-11-232023-12-29全芯智造技术有限公司Method, apparatus and medium for data processing
CN117313899A (en)*2023-11-232023-12-29全芯智造技术有限公司Method, apparatus and medium for data processing
CN118886796A (en)*2024-10-092024-11-01内蒙古卫数数据科技有限公司 A method and system for reviewing and homologous matching of blood routine test results

Cited By (20)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112215398A (en)*2020-04-272021-01-12深圳华工能源技术有限公司Power consumer load prediction model establishing method, device, equipment and storage medium
CN111814548A (en)*2020-06-032020-10-23中铁第四勘察设计院集团有限公司Abnormal behavior detection method and device
CN111898829A (en)*2020-08-042020-11-06电子科技大学Electric quantity prediction method under edge equipment based on sparse abnormal perception
CN112115129A (en)*2020-09-162020-12-22浪潮软件股份有限公司Retail terminal sample sampling method based on machine learning
CN112115129B (en)*2020-09-162024-05-10浪潮软件股份有限公司 A retail terminal sampling method based on machine learning
CN112529209A (en)*2020-12-072021-03-19上海云从企业发展有限公司Model training method, device and computer readable storage medium
CN112489809A (en)*2020-12-182021-03-12内蒙古卫数数据科技有限公司Clinical laboratory abnormal data monitoring method based on artificial intelligence algorithm
CN112699943A (en)*2020-12-312021-04-23平安科技(深圳)有限公司Method for eliminating abnormal samples and computer equipment
CN113555124A (en)*2021-01-192021-10-26内蒙古卫数数据科技有限公司Blood routine sample difference checking method based on machine learning
CN113570398A (en)*2021-02-022021-10-29腾讯科技(深圳)有限公司Promotion data processing method, model training method, system and storage medium
CN113033694A (en)*2021-04-092021-06-25深圳亿嘉和科技研发有限公司Data cleaning method based on deep learning
CN112931915A (en)*2021-04-302021-06-11河南中烟工业有限责任公司Blending uniformity detection method for tobacco components in leaf group
CN113537285A (en)*2021-06-082021-10-22内蒙古卫数数据科技有限公司Novel clinical mismatching sample identification method based on machine learning technology by utilizing patient historical comparison data
CN113516398A (en)*2021-07-222021-10-19北京淇瑀信息科技有限公司Risk equipment identification method and device based on hierarchical sampling and electronic equipment
CN113919844A (en)*2021-09-282022-01-11陕西师范大学 Multi-view network transaction risk identification method based on data Petri net
CN117313900A (en)*2023-11-232023-12-29全芯智造技术有限公司Method, apparatus and medium for data processing
CN117313899A (en)*2023-11-232023-12-29全芯智造技术有限公司Method, apparatus and medium for data processing
CN117313899B (en)*2023-11-232024-02-23全芯智造技术有限公司Method, apparatus and medium for data processing
CN117313900B (en)*2023-11-232024-03-08全芯智造技术有限公司Method, apparatus and medium for data processing
CN118886796A (en)*2024-10-092024-11-01内蒙古卫数数据科技有限公司 A method and system for reviewing and homologous matching of blood routine test results

Similar Documents

PublicationPublication DateTitle
CN110991657A (en)Abnormal sample detection method based on machine learning
CN108985214B (en) Image data annotation method and device
CN113742387A (en)Data processing method, device and computer readable storage medium
US12174914B2 (en)Image data classification method, device and system
CN111796957B (en)Transaction abnormal root cause analysis method and system based on application log
CN112037222B (en)Automatic updating method and system of neural network model
CN106529545B (en)A kind of speckle image quality Identification method and system based on characteristics of image description
TW201514472A (en)Method of optical defect detection through image analysis and data mining integrated
CN114707571B (en)Credit data anomaly detection method based on enhanced isolation forest
CN104484602A (en)Intrusion detection method and device
CN113448955B (en)Data set quality evaluation method and device, computer equipment and storage medium
CN107315647A (en)Outlier detection method and system
CN113918471A (en)Test case processing method and device and computer readable storage medium
CN116451081A (en) Data drift detection method, device, terminal and storage medium
CN118037137A (en)Method for determining product quality accident number based on convolutional neural network
CN110111311B (en)Image quality evaluation method and device
CN116416445A (en)Method, system and storage medium based on pseudo-tag telecommunication anti-fraud identification
CN117994524A (en)Evaluation method, device, equipment and medium of semantic segmentation model
CN109376619B (en) A kind of cell detection method
CN117216682A (en)Method and device for processing perception data, electronic equipment and storage medium
CN116071558A (en)Processing method and device and electronic equipment
CN115392787A (en) Enterprise risk assessment methods, devices, equipment, storage media and program products
CN117349050A (en) Database fault diagnosis method, device and storage medium
CN113868416A (en) Detection method, device, computer equipment and medium for abnormal short message
CN111209567A (en) Knowability judgment method and device for improving robustness of detection model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication

Application publication date:20200410

RJ01Rejection of invention patent application after publication

[8]ページ先頭

©2009-2025 Movatter.jp