Disclosure of Invention
Aiming at the defects in the prior art, the data processing method and the data processing device for optimizing the credit evaluation model can optimize the credit evaluation model and improve the evaluation precision.
In a first aspect, the present invention provides a data processing method for optimizing a credit assessment model, comprising:
acquiring relevant information of a borrower as sample data;
dividing the sample data into a training set and a test set;
carrying out data modeling by using the training set to obtain a preliminary evaluation model;
testing the preliminary evaluation model by using the test set;
if the test result does not meet the evaluation standard, the training set and the test set are divided again, and the training of the divided training set and the test set is utilized to carry out data modeling and testing;
and if the test result meets the evaluation standard, finishing the training and determining a final evaluation model.
The data processing method for optimizing the credit evaluation model, provided by the invention, divides sample data into a training set and a test set, constructs the evaluation model through the training set, tests the prediction capability of the evaluation model through the test set, reclassifies variables to obtain new model characteristic values by reclassifying the training set and the test set when the test is unqualified, realizes the optimization of the evaluation model through the cross validation method, and improves the evaluation precision. In addition, the cross validation method can effectively utilize all information in the sample data, deeply excavate the characteristics of the borrower, improve the evaluation precision of the model and solve the over-fitting problem.
Preferably, the modeling data by using the training set to obtain a preliminary evaluation model includes:
performing segmentation processing on the continuous variable in the training set by adopting a decision tree algorithm, and converting the continuous variable into a discrete variable;
classifying the discrete variables in the training set by adopting a clustering algorithm;
combining the variables according to the classification result, and determining a preliminary model characteristic value;
and performing logistic regression on the sample data of the model characteristic value to establish a preliminary evaluation model.
Preferably, before performing the logistic regression, the method further comprises:
if the model characteristic value of the borrower lacks data, the data of the model characteristic value is supplemented.
Preferably, if the model characteristic value of the borrower lacks data, the data of the model characteristic value is supplemented, and the method comprises the following steps:
if the model characteristic value of the borrower lacks data, searching a replacement variable of the model characteristic value;
and completing the data of the model characteristic value according to the searched data of the replacement variable.
Preferably, the method of determining the replacement variable comprises:
calculating Euclidean distances between variables;
the two variables with Euclidean distance smaller than the threshold value are mutual replacement variables.
Preferably, if the model characteristic value of the borrower lacks data, the data of the model characteristic value is supplemented, and the method comprises the following steps:
if the model characteristic values of the borrowers lack data, calculating the mean value or the median value of the model characteristic values of all the borrowers;
and completing the model characteristic value of the missing data of the borrower according to the calculated mean value or the calculated median value.
Preferably, the method further comprises the following steps: acquiring external statistical data;
if the model characteristic value of the borrower lacks data, the data of the model characteristic value is supplemented, and the method comprises the following steps:
and if the model characteristic value of the borrower lacks data, supplementing the model characteristic value of the borrower lacking data according to the external statistical data.
Preferably, before performing the logistic regression, the method further comprises:
calculating the information value of each variable;
checking according to a preset value threshold value, and judging whether the variable is effective or not;
no logistic regression was involved for the invalid variables.
In a second aspect, the present invention provides a data processing apparatus for optimizing a credit assessment model, comprising:
the data acquisition module is used for acquiring the related information of the borrower as sample data;
the sample dividing module is used for dividing the sample data into a training set and a test set;
the model training module is used for carrying out data modeling by utilizing the training set to obtain a preliminary evaluation model;
the model testing module is used for testing the preliminary evaluation model by utilizing the test set; if the test result does not meet the evaluation standard, the training set and the test set are divided again, and the training of the divided training set and the test set is utilized to carry out data modeling and testing; and if the test result meets the evaluation standard, finishing the training and determining a final evaluation model.
In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs any of the methods described above in the first aspect.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
As shown in fig. 1, the present embodiment provides a data processing method for optimizing a credit evaluation model, including:
in step S1, the information related to the borrower is acquired as sample data.
Wherein the sample data includes a continuous variable and a discrete variable. The borrower-related information, i.e., all information that may reveal a specific behavioral characteristic of the borrower, may include, but is not limited to, the following: age, payroll income, marital status, house purchase status, employment status, insurance purchase status, education status, etc., which may affect the ability of the borrower to make a loan repayment, which may affect the variable of the loan assessment. According to the type of sample data, the sample data can be divided into a continuous variable and a discrete variable, for example: data with specific numerical values and in a continuous distribution state such as age and wage income are continuous variables, and data with non-specific numerical values or discretization distribution such as education conditions are discrete variables.
The sample data of each borrower also comprises default conditions of the borrower, namely, the borrower with default is a bad client, and the borrower without default is a good client.
Step S2, dividing the sample data into training set and testing set.
Preferably, the sample data may be divided into training and test sets on a 7:3 scale.
And step S3, performing data modeling by using the training set to obtain a preliminary evaluation model.
And step S4, testing the preliminary evaluation model by using the test set.
And inputting the preliminary evaluation model according to the sample data in the test set to obtain whether the borrower is a good client or a bad client.
And step S5, if the test result does not meet the evaluation standard, the training set and the test set are re-divided, and the data modeling and testing are performed by using the re-divided training set and the test set.
And step S6, if the test result meets the evaluation standard, ending the training and determining the final evaluation model.
The method for evaluating the test result comprises the following steps: and (4) comparing the credit predicted value of the borrower output in the step (S4) with the default condition of the borrower in the sample data, judging whether the prediction is correct, counting the accuracy of the test set, and judging whether the accuracy reaches the evaluation standard.
The data processing method for optimizing the credit evaluation model provided by this embodiment divides sample data into a training set and a test set, constructs an evaluation model through the training set, tests the prediction capability of the evaluation model through the test set, reclassifies variables by repartitioning the training set and the test set when the test is not qualified, obtains a new model characteristic value, optimizes the evaluation model through the cross validation method, and improves the evaluation precision. In addition, the cross validation method can effectively utilize all information in the sample data, deeply excavate the characteristics of the borrower, improve the evaluation precision of the model and solve the over-fitting problem.
Wherein, the preferred embodiment of the step S4 includes:
step S401, a decision tree algorithm is adopted to conduct segmentation processing on the continuous variables in the training set, and the continuous variables are converted into discrete variables.
Wherein, when the default possibility prediction of the borrower and the subdivision difference between the characteristics of the borrower are large, the variable is divided into a plurality of sections, the sections are analyzed and counted respectively, and the characteristics of the borrower are more suitable for being analyzed than a single variable so as to optimize the category of the characteristics of the borrower. The continuous variable is segmented through a decision tree algorithm, the continuous variable is discretized, and the borrowers can be divided into different homogeneous subgroups so as to improve the expression of logistic regression. The decision tree algorithm may be implemented by using an existing decision tree algorithm, which is not described herein again. The embodiment preferably adopts chi-square automatic interaction detection (CHAID), which is a non-parametric decision tree method, and is effectively applied to various research fields, such as customer consumption trend in marketing, human behavior in psychology and geological landslide, and can segment continuous variables well to optimize the types of characteristics of borrowers, and when the CHAID is applied to logistic regression, the defect of nonlinearity can be overcome.
And S402, classifying the discrete variables in the training set by adopting a clustering algorithm.
The discrete variable in step S3 includes the original discrete variable in the sample data, and the discrete variable obtained through the conversion in step S2.
Wherein clustering is an unsupervised learning classifier that combines data with similar features into cluster groups, which can correlate the same features in sample data to reduce the effect of misclassification between variables. The clustering in this embodiment refers to variable clustering (also referred to as R-type clustering), which classifies variables according to sample data of each lender and finds representative elements (i.e., model feature values) in each class. By separating heterogeneous borrowers, the clustered variables can improve the prediction efficiency. Therefore, in the embodiment, the variables are classified and combined by using the clustering technology, and the feature partition of the variables can be improved to adapt to the logistic regression so as to improve the performance of the credit default prediction. The clustering algorithm can be implemented by using the existing clustering algorithm, and is not described herein again. In the implementation, clustering is performed by adopting a Ward minimum variance hierarchical method, correlation among small sample variables is found according to the minimum variance, and the small sample variables are classified into one class, so that the problem that the small sample variables in regression can hardly participate in statistical calculation is solved. For example, for some small sample categories, such as the "major" educational background, "scholars" are grouped together as a new category "above this subject".
And S403, combining the variables according to the classification result, and determining a preliminary model characteristic value.
The merging of the variables according to the classification result can be realized in the following way: and calculating the correlation among the variables in the same class, finding out a variable with the maximum correlation with other variables, using the variable as the model characteristic quantity of the class to replace other variables in the same class, and simplifying and evaluating the input variables of the model.
The model feature value is an important feature of the borrower found out to possibly cause loan default.
And S404, performing logistic regression on the sample data of the model characteristic value to establish a preliminary evaluation model.
The logistic regression has strong prediction capability and simple operability, and can conveniently realize the prediction target. The independent variable of the logistic regression is a model characteristic value, and the binary dependent variable of the logistic regression is default conditions of the borrower, namely 'good customers' and 'bad customers'. The evaluation model can be obtained by finding the relationship between the independent variable and the dependent variable through logistic regression, and the process is a general training process of logistic regression and is not repeated here.
According to the method, the continuous variables can be well segmented through decision tree classification so as to optimize the categories of characteristics of the borrowers, and the nonlinear defect can be overcome when the method is applied to logistic regression; the problem that small sample data can hardly participate in statistical calculation in logistic regression is solved through clustering, the small sample data is fully utilized, and the estimation precision of the model is improved; by combining the various algorithms, a proper model characteristic value can be mined, and the evaluation precision of the credit evaluation model is improved.
Because the source of the sample data is complex, the integrity of the sample data is difficult to guarantee, and in order to still effectively utilize the sample data for analysis when the sample data is missing, the method of the embodiment further comprises a step S405 before performing logistic regression, and if the model characteristic value of the borrower lacks data, the data of the model characteristic value is completed.
The preferred embodiment of step S405 specifically includes:
in step S511, if the model feature value of the borrower lacks data, the replacement variable of the model feature value is found.
The replacement variables have certain correlation, and the data of the replacement variables can be used for replacing under the condition that the data of one variable cannot be used, so that the sample data is supplemented, and the utilization rate of the sample data is improved.
And S512, complementing the data of the model characteristic value according to the searched data of the replacement variable.
The method for determining the replacement variable comprises the following steps:
calculating Euclidean distances between variables;
the two variables with Euclidean distance smaller than the threshold value are mutual replacement variables.
The threshold value can be determined according to actual conditions, the threshold value is not too large or too small, the substitute variable cannot be found when the threshold value is too small, and the substitute variable is not suitable when the threshold value is too large. In addition, the two variables with the minimum euclidean distance may be used as the alternative variables of the other party. When data of one variable is missing, the data pair of the variable can be replaced by the data of the variable
Wherein, another preferred embodiment of step S405 specifically includes:
in step S521, if the model feature value of the borrower lacks data, the mean value or the median value of the model feature value of all the borrowers is calculated.
And S522, complementing the model characteristic value of the borrower lacking data according to the calculated mean value or the calculated median value.
Another preferred embodiment of step S405 specifically includes: and if the model characteristic value of the borrower lacks data, supplementing the model characteristic value of the borrower lacking data according to the external statistical data.
And acquiring external statistical data in the stage of acquiring the sample data. The external statistical data refers to statistical class data, such as Shenzhen market employment rate, Shenzhen market average payroll and the like.
Not all variables will affect the final evaluation result, and in order to reduce the data throughput, it is necessary to filter out the variables that are invalid for the evaluation result before performing logistic regression, which specifically includes:
calculating the information value of each variable;
checking according to a preset value threshold value, and judging whether the variable is effective or not;
no logistic regression was involved for the invalid variables.
Judging whether the variables are valid is a step, and the variables can be evaluated before being classified so as to reduce the variables participating in clustering; or, the validity judgment can be carried out on only the variables determined as the model characteristic values, and the independent variables participating in the model establishment are further reduced.
In practical applications, the evidence weight is a logarithmic calculation where the proportion of "good" borrower features corresponds to the proportion of "bad" borrower features for assessing and comparing the relative risk of different classes of variables. The concrete calculation formula of the evidence weight is as follows:
here, WOE represents the proof weight of a certain characteristic variable, distgoods represents the distribution proportion of "good" borrowers in the sample data to the characteristic variable, and distbats represents the distribution proportion of "bad" borrowers in the sample data to the characteristic variable. The higher the positive value of the WOE, the lower the risk of credit default for the customer's activity, and the higher the negative value of the WOE, the higher the risk of credit default for the customer's activity. WOE can convert variables into a format of rules and information, which allows different types of variables to be in the same way. Variables can be transferred into WOE, and the freedom of small sample problems can be protected more effectively. Therefore, WOE is employed to compare different variables in a small sample data set. The information value can evaluate the prediction capability of the characteristic variables, and the specific calculation formula is as follows:
IV=(DistrGoods-DistrBads)*WOE,
wherein, IV represents the information value of a certain characteristic variable, DistrGoods represents the distribution proportion of "good" borrowers in the sample data in the characteristic variable, distbats represents the distribution proportion of "bad" borrowers in the sample data in the characteristic variable, and WOE represents the evidence weight of the characteristic variable.
As shown in fig. 2, the present embodiment provides a data processing apparatus for optimizing a credit evaluation model, based on the same inventive concept as the above-described data mining method for credit evaluation, including:
the data acquisition module is used for acquiring the related information of the borrower as sample data;
the sample dividing module is used for dividing the sample data into a training set and a test set;
the model training module is used for carrying out data modeling by utilizing the training set to obtain a preliminary evaluation model;
the model testing module is used for testing the preliminary evaluation model by utilizing the test set; if the test result does not meet the evaluation standard, the training set and the test set are divided again, and the training of the divided training set and the test set is utilized to carry out data modeling and testing; and if the test result meets the evaluation standard, finishing the training and determining a final evaluation model.
Preferably, as shown in fig. 3, the model training module specifically includes:
the first classification module is used for performing segmentation processing on the continuous variables in the training set by adopting a decision tree algorithm and converting the continuous variables into discrete variables;
the second classification module is used for classifying the discrete variables in the training set by adopting a clustering algorithm;
the variable merging module is used for merging the variables according to the classification result and determining a preliminary model characteristic value;
and the logistic regression module is used for carrying out logistic regression on the sample data of the model characteristic value to establish a preliminary evaluation model.
Preferably, the system further comprises an alternative variable module for:
calculating Euclidean distances between variables;
the two variables with Euclidean distance smaller than the threshold value are mutual replacement variables.
Preferably, the data completing module is further included for: and before carrying out logistic regression, if the model characteristic value of the borrower lacks data, complementing the data of the model characteristic value.
Preferably, the data completion module is specifically configured to:
if the model characteristic value of the borrower lacks data, searching a replacement variable of the model characteristic value;
and completing the data of the model characteristic value according to the searched data of the replacement variable.
Preferably, the data completion module is configured to:
if the model characteristic values of the borrowers lack data, calculating the mean value or the median value of the model characteristic values of all the borrowers;
and completing the model characteristic value of the missing data of the borrower according to the calculated mean value or the calculated median value.
Preferably, the data acquisition module may be further configured to acquire external statistical data; correspondingly, the data completion module is specifically configured to: and if the model characteristic value of the borrower lacks data, supplementing the model characteristic value of the borrower lacking data according to the external statistical data.
Preferably, the variable cleaning module is further included for: calculating the information value of each variable before performing logistic regression; checking according to a preset value threshold value, and judging whether the variable is effective or not; no logistic regression was involved for invalid feature variables.
The data mining device for credit evaluation provided by the embodiment and the data mining method for credit evaluation have the same inventive concept and the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the above-described data mining method for credit evaluation, the present implementation provides a computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the method as described in any of the method embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.