Abnormal sample detection method based on machine learningTechnical Field
The invention belongs to the technical field of abnormal sample detection, and particularly relates to an abnormal sample detection method based on machine learning.
Background
With the development of emerging technologies such as internet, mobile internet, cloud computing and the like, people gradually enter a big data era, a large amount of customer behavior data are generated on the network almost every day, and under many scenes, the data need to be analyzed and predicted, for example, in the internet financial field, credit risk of a customer is evaluated, and whether the customer is a normal customer or an abnormal customer needs to be judged by using the behavior data of the customer, so that financial risk is prevented better, and loss of investors is reduced. Because the sample data dimension of a client is possibly large, the quality of the client is difficult to distinguish by manual experience, and more business scenes begin to introduce a machine learning model to predict and classify the samples.
In general, most machine learning is supervised learning, and a classifier is trained by using labeled samples and then classified by unknown samples. However, in the sample data preparation process, a certain number of abnormal samples are caused by a small amount of data quality problems caused by errors such as input, labeling and the like, or abnormal information existing in the data. In many cases, the number of abnormal samples per se is usually small, and on the other hand, the abnormal samples are often very hidden and difficult to be found, and data such as an abnormal application credit is usually difficult to be perceived. Therefore, the difficulty of detecting an abnormal sample is increased, and supervised learning is difficult to perform. The invention mainly aims at the two classification problems in supervised machine learning, and hopefully provides an effective method for detecting abnormal samples, and trains the model after filtering the detected abnormal samples so as to improve the stability of the model and the accuracy of model prediction.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for detecting abnormal samples based on machine learning, which mainly aims at the two categories of classification problems, adopts machine learning to detect abnormal samples, and has accurate and efficient detection effect.
The invention also aims to provide an abnormal sample detection method based on machine learning, which has simple and efficient detection process, accurate judgment and scientific and reasonable judgment basis.
In order to achieve the above object, the technical solution of the present invention is as follows.
The invention relates to an abnormal sample detection method based on machine learning, which is characterized by comprising the following specific steps:
s1: sampling an original sample to generate a training set and a test set, wherein the training set is set to be more than one;
s2: constructing a model by using a machine learning algorithm, training each training set by using the model to test the testing set, and correspondingly generating a classification model for each training set after training;
s3: each classification model predicts an original sample in turn, and obtains a prediction score of each sample in the original sample in the training process, wherein the prediction score is a probability value of the sample belonging to a normal sample;
s4: the scores of the same sample after prediction are divided into a group by each classification model, and the variance or standard deviation of the prediction scores of each group is calculated;
s5: sorting the samples according to the variance or standard deviation from small to large, setting an initial threshold value, and regarding the samples with the variance or standard deviation exceeding the initial threshold value as abnormal samples;
s6: removing abnormal samples from the original samples, repeatedly training the remaining samples, comparing performance evaluation indexes under different models, and determining an optimal threshold value;
s7: and taking the optimal threshold as a segmentation point, and regarding the original samples higher than the optimal threshold as final abnormal samples.
The original samples are divided into more than one training set, and each training set is trained to generate more than one classification model, so that the effectiveness and the accuracy of the classification models are guaranteed. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step.
The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples.
Further, the original samples in the step S1 are labeled sample sets including normal samples and abnormal samples. The arrangement of the samples with the labels facilitates training of machine learning, and also provides a basis for sequencing of subsequent samples.
Further, the sampling method of the original sample comprises random sampling and k-fold cross validation, wherein the random sampling refers to applying a random method to divide the original sample into a training receiving and testing set according to a proportion; the k-fold cross validation refers to that samples are divided into k parts in equal proportion, one part is used as a test set, and the rest is used as a training set. And the randomness and the representativeness of samples in a training set and a test set are ensured by selecting sample sampling methods such as random sampling, k-fold cross validation and the like.
Further, the machine learning algorithm in step S2 includes logistic regression, GBDT or support training vector machine. Logistic regression, GBDT, support training vector machines and the like are classification algorithms commonly used in the field of machine learning, and can be realized by calling corresponding algorithms in a python open-source machine learning library such as Sklearn and setting empirical parameters to iteratively train a classification model.
Further, the prediction score in the step S3 is a probability value that the sample belongs to a normal sample. For example, the sample label is represented by a binary value of 0 and 1, wherein the label of class 1 is a normal sample, and the prediction score refers to a probability value that the sample belongs to class 1, but not to an abnormal sample.
Further, the performance evaluation index in step S6 includes an AUC value, a KS value, and a Lift value. The AUC value is often used for evaluating a common index of classification performance of a binary classifier, and the larger the AUC value is, the more possible the current classification model is to arrange positive samples in front of negative sample values, namely better classification can be realized; the KS value is used for evaluating the risk discrimination capability of the model, and the index measures the difference value between the cumulative distribution of good and bad samples. The greater the accumulated difference of the good and bad samples is, the greater the KS value is, the stronger the risk distinguishing capability of the model is; the Lift value is measured by how much the prediction capability of the model is better compared with that of the model which is not used, and the larger the Lift (Lift index) is, the better the operation effect of the model is; the abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
In summary, the present invention is an abnormal sample detection method based on machine learning, which is characterized in that the effectiveness and accuracy of a classification model are ensured by dividing an original sample into more than one training set and training each training set to generate more than one classification model. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step. The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples. The abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
Drawings
Fig. 1 is a flowchart of an abnormal sample detection method based on machine learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In order to achieve the above object, the technical solution of the present invention is as follows.
Referring to fig. 1, the present invention is an abnormal sample detection method based on machine learning, which is characterized in that the method comprises the following specific steps:
s1: sampling an original sample to generate a training set and a test set, wherein the training set is set to be more than one;
s2: constructing a model by using a machine learning algorithm, training each training set by using the model to test the testing set, and correspondingly generating a classification model for each training set after training;
s3: each classification model predicts an original sample in turn, and obtains a prediction score of each sample in the original sample in the training process, wherein the prediction score is a probability value of the sample belonging to a normal sample;
s4: the scores of the same sample after prediction are divided into a group by each classification model, and the variance or standard deviation of the prediction scores of each group is calculated;
s5: sorting the samples according to the variance or standard deviation from small to large, setting an initial threshold value, and regarding the samples with the variance or standard deviation exceeding the initial threshold value as abnormal samples;
s6: removing abnormal samples from the original samples, repeatedly training the remaining samples, comparing performance evaluation indexes under different models, and determining an optimal threshold value;
s7: and taking the optimal threshold as a segmentation point, and regarding the original samples higher than the optimal threshold as final abnormal samples.
The original samples are divided into more than one training set, and each training set is trained to generate more than one classification model, so that the effectiveness and the accuracy of the classification models are guaranteed. Dividing prediction scores of the same sample in an original sample in each classification model into a group, and respectively taking variance and standard deviation for each group of data to determine the accuracy of the sample prediction, wherein the smaller the variance or standard deviation value is, the better the predicted consistency of each classification model to the same sample is, the sample is easy to predict, otherwise, the information of the sample is difficult to capture, and the possibility of noise or abnormal samples is higher; by setting an initial threshold value, partial samples are removed, the purpose is to remove samples which are obviously difficult to predict, and the accuracy of data obtained in the subsequent training process is ensured; and determining a final threshold value through evaluating the performance evaluation indexes in the subsequent training process, and comparing the original samples with the threshold value to remove final abnormal samples, wherein the original samples comprise all the original samples removed in the step.
The variance or standard deviation of the prediction scores is used for measuring the stability of the prediction result of each sample by different models, and the smaller the variance or standard deviation is, the more consistent the prediction types of the samples by the models are, and the better the prediction effect is. The larger the variance or standard deviation, the less effective the model can capture true information from it, making the sample difficult to predict accurately, with a higher probability of having noise or anomalous samples.
In this embodiment, the original samples in the step S1 are labeled sample sets including normal samples and abnormal samples. The arrangement of the samples with the labels facilitates training of machine learning, and also provides a basis for sequencing of subsequent samples.
In this embodiment, the sampling method of the original sample includes random sampling and k-fold cross validation, where the random sampling refers to applying a random method to divide the original sample into training and testing sets in proportion; the k-fold cross validation refers to that samples are divided into k parts in equal proportion, one part is used as a test set, and the rest is used as a training set. And the randomness and the representativeness of samples in a training set and a test set are ensured by selecting sample sampling methods such as random sampling, k-fold cross validation and the like.
In this embodiment, the machine learning algorithm in step S2 includes logistic regression, GBDT or support training vector machine. Logistic regression, GBDT, support training vector machines and the like are classification algorithms commonly used in the field of machine learning, and can be realized by calling corresponding algorithms in a python open-source machine learning library such as Sklearn and setting empirical parameters to iteratively train a classification model.
In this embodiment, the prediction score in step S3 is a probability value that the sample belongs to a normal sample. For example, the sample label is represented by a binary value of 0 and 1, wherein the label of class 1 is a normal sample, and the prediction score refers to a probability value that the sample belongs to class 1, but not to an abnormal sample.
In the present embodiment, the performance evaluation index in step S6 includes an AUC value, a KS value, and a Lift value. The AUC value is often used for evaluating a common index of classification performance of a binary classifier, and the larger the AUC value is, the more possible the current classification model is to arrange positive samples in front of negative sample values, namely better classification can be realized; the KS value is used for evaluating the risk discrimination capability of the model, and the index measures the difference value between the cumulative distribution of good and bad samples. The greater the accumulated difference of the good and bad samples is, the greater the KS value is, the stronger the risk distinguishing capability of the model is; the Lift value is measured by how much the prediction capability of the model is better compared with that of the model which is not used, and the larger the Lift (Lift index) is, the better the operation effect of the model is; the abnormal samples are removed and then a class II classification model is retrained, and compared with the model before the samples are removed, the evaluation indexes such as AUC or KS are improved to a certain extent.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.