【技术领域】[Technical field]
本发明涉及金融技术领域,尤其涉及一种个人信用评分方法及其系统、电子设备及存储介质。The present invention relates to the field of financial technology, and in particular to a personal credit scoring method and system, electronic equipment and storage medium.
【背景技术】[Background technology]
近年来,随着普惠金融的深入,个人信贷业务得到了快速发展。信贷业务大部分已转为线上进行,可以快速及时的满足客户的贷款需求,但由于无抵押,逾期率也不断攀升。在这种场景下,如何高效、精准的评估个人信用状况,识别违约风险,显得尤为重要。现有技术中对个人的信用评估方法,大部分方法是基于互联网数据构建信用评估模型,使用数据包括在特定应用软件上的操作行为数据(如贷款产品介绍页面浏览时长)、社交网络数据、历史信贷记录数据等。但是这类方法主要有以下几点缺陷:In recent years, with the deepening of inclusive finance, personal credit business has developed rapidly. Most of the credit business has been transferred to online, which can quickly and timely meet the loan needs of customers, but due to the lack of collateral, the overdue rate has continued to rise. In this scenario, how to efficiently and accurately assess personal credit status and identify default risks is particularly important. Most of the existing methods for personal credit assessment are based on Internet data to build a credit assessment model, using data including operational behavior data on specific application software (such as the browsing time of the loan product introduction page), social network data, historical credit record data, etc. However, this type of method has the following main defects:
(1)依赖互联网数据,数据真实性存疑(数据通过网络爬虫等方式获取,真实性未经核实);(1) Reliance on Internet data, the authenticity of which is questionable (the data is obtained through web crawlers and other means, and the authenticity has not been verified);
(2)过分依赖历史信用记录,难以评估征信白户信用状况;(2) Over-reliance on historical credit records makes it difficult to assess the credit status of credit-free households;
(3)构建单一评估模型,难以根据具体业务进行调整,灵活性较低。(3) Building a single evaluation model is difficult to adjust according to specific business needs and has low flexibility.
【发明内容】[Summary of the invention]
有鉴于此,本发明实施例提供了一种个人信用评分方法及其系统、电子设备及存储介质。通过从多个维度出发,建立多个子评估模型,可以结合多种业务场景综合对个人的信用进行评估,提高了个人信用评分方法的灵活性以及准确性。In view of this, the embodiments of the present invention provide a personal credit scoring method and system, electronic device and storage medium. By establishing multiple sub-assessment models from multiple dimensions, a comprehensive assessment of personal credit can be conducted in combination with multiple business scenarios, thereby improving the flexibility and accuracy of the personal credit scoring method.
作为本发明实施例的一方面,本发明实施例提供一种个人信用评分方法,包括:对原样本数据进行预处理,生成新样本数据,所述新样本数据的数据格式符合逻辑回归算法的格式要求;将所述新样本数据划分为训练数据集以及测试数据集;构建个人信用评分模型,所述个人信用评分模型包括四个子模型;将所述训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逻辑回归,计算每个所述子模型的评分规则;获取每个所述子模型的权重;根据每个所述子模型的权重以及每个所述子模型的评分规则,计算所述个人信用评分模型的评分规则;以及接收个人信用数据,并将所述个人信用数据输入所述个人信用评分模型中,根据所述个人信用模型的评分规则计算所述个人的信用评分。As an aspect of an embodiment of the present invention, an embodiment of the present invention provides a personal credit scoring method, comprising: preprocessing original sample data to generate new sample data, wherein the data format of the new sample data meets the format requirements of a logistic regression algorithm; dividing the new sample data into a training data set and a test data set; constructing a personal credit scoring model, wherein the personal credit scoring model includes four sub-models; inputting the variables in the training data set into each sub-model according to a preset corresponding relationship for logistic regression, and calculating the scoring rule of each sub-model; obtaining the weight of each sub-model; calculating the scoring rule of the personal credit scoring model according to the weight of each sub-model and the scoring rule of each sub-model; and receiving personal credit data, inputting the personal credit data into the personal credit scoring model, and calculating the credit score of the individual according to the scoring rule of the personal credit model.
其中,所述原样本数据包括四类子样本数据,所述四类子样本数据分别一一对应所述四个子模型。The original sample data includes four types of sub-sample data, and the four types of sub-sample data correspond to the four sub-models respectively.
在本发明一实施例中,所述四类子样本数据包括为:身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据;所述四个子模型包括:身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型。In one embodiment of the present invention, the four categories of sub-sample data include: identity information sample data, asset status sample data, credit history sample data and consumer behavior sample data; the four sub-models include: identity information sub-model, asset status sub-model, credit history sub-model and consumer behavior sub-model.
在本发明一实施例中,所述对原样本数据进行预处理进一步包括:对所述原样本数据进行清洗,获取清洗后的样本数据;将所述清洗后的样本数据中的变量分类为多个连续变量以及多个名义变量;对所述多个连续变量一一进行分箱处理,生成每个所述连续变量的WOE值和信息值;以及对所述多个名义变量进行降基处理,生成至少一个新名义变量,对所述至少一个新名义变量进行计算,生成每个所述新名义变量的WOE值和信息值;其中,所述新样本数据包括所述每个所述连续变量的WOE值和信息值、以及所述新名义变量的WOE值和信息值。In one embodiment of the present invention, the preprocessing of the original sample data further includes: cleaning the original sample data to obtain cleaned sample data; classifying the variables in the cleaned sample data into multiple continuous variables and multiple nominal variables; binning the multiple continuous variables one by one to generate the WOE value and information value of each of the continuous variables; and reducing the basis of the multiple nominal variables to generate at least one new nominal variable, calculating the at least one new nominal variable to generate the WOE value and information value of each of the new nominal variables; wherein the new sample data includes the WOE value and information value of each of the continuous variables, and the WOE value and information value of the new nominal variable.
在本发明一实施例中,所述对所述原样本数据进行清洗进一步包括:对所述原样本数据中的每组数据进行识别,判断所述每组数据中的每个类别的数据值是否缺失,当第一组数据中第一类别的数据值缺失时,将所述第一组数据中的所述第一类别的数据值采用零或者所述第一类别的平均值代替;或对所述原样本数据中的每组数据进行识别,判断所述每组数据中的每个类别的数据值是否异常,当第二组数据中第二类别的数据值存在异常时,将所述第二组数据剔除。In one embodiment of the present invention, the cleaning of the original sample data further includes: identifying each group of data in the original sample data, determining whether the data value of each category in each group of data is missing, and when the data value of the first category in the first group of data is missing, replacing the data value of the first category in the first group of data with zero or the average value of the first category; or identifying each group of data in the original sample data, determining whether the data value of each category in each group of data is abnormal, and when the data value of the second category in the second group of data is abnormal, eliminating the second group of data.
在本发明一实施例中,所述将所述新样本数据划分为训练数据集以及测试数据集进一步包括:对所述新样本数据进行下采样处理,生成标准样本数据;以及将所述标准样本数据分为训练数据集以及测试数据集。In one embodiment of the present invention, dividing the new sample data into a training data set and a test data set further includes: downsampling the new sample data to generate standard sample data; and dividing the standard sample data into a training data set and a test data set.
在本发明一实施例中,对所述新样本数据进行下采样处理进一步包括:将所述新样本数据分为好样本数据和坏样本数据;以及从所述好样本数据中无放回的随机抽取多个好样本数据,其中抽取的好样本数据的数量为坏样本数据的数量的2~4倍;所述标准样本数据包括抽取的所述好样本数据以及全部所述坏样本数据。In one embodiment of the present invention, downsampling the new sample data further includes: dividing the new sample data into good sample data and bad sample data; and randomly extracting a plurality of good sample data from the good sample data without replacement, wherein the number of the extracted good sample data is 2 to 4 times the number of the bad sample data; and the standard sample data includes the extracted good sample data and all the bad sample data.
在本发明一实施例中,所述将所述训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逻辑回归,计算每个所述子模型的评分规则,进一步包括:将所述训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逐步回归,生成每个子模型中每个变量的最初逻辑回归系数;根据每个子模型中的每个变量的最初逻辑回归系数,剔除每个子模型中的干扰变量;在每个子模型中剩余变量中,选择性的对所述剩余变量进行至少一次分箱处理,生成剩余变量的WOE值和信息值;将每个子模型中经过至少一次分箱处理后的剩余变量分别对应输入至每个子模型中进行逻辑回归,获取剩余变量的逻辑回归系数;以及根据所述每个子模型中每个剩余变量的逻辑回归系数计算每个子模型的评分规则。In one embodiment of the present invention, the variables in the training data set are input into each sub-model according to a preset corresponding relationship for logistic regression, and the scoring rule of each sub-model is calculated, which further includes: inputting the variables in the training data set into each sub-model according to a preset corresponding relationship for stepwise regression to generate an initial logistic regression coefficient of each variable in each sub-model; eliminating interference variables in each sub-model according to the initial logistic regression coefficient of each variable in each sub-model; selectively performing at least one binning process on the remaining variables in each sub-model to generate WOE values and information values of the remaining variables; inputting the remaining variables in each sub-model after at least one binning process into each sub-model for logistic regression to obtain the logistic regression coefficients of the remaining variables; and calculating the scoring rule of each sub-model according to the logistic regression coefficient of each remaining variable in each sub-model.
在本发明一实施例中,所述根据每个子模型中的每个变量的最初逻辑回归系数,剔除每个子模型中的干扰变量,进一步包括:判断每个子模型中的每个变量的最初逻辑回归系数是否显著,当第一子模型中的第一变量的最初逻辑回归系数不显著时,将所述变量剔除;和/或判断每个子模型中的每个变量的最初逻辑回归系数符号是否符合预设系数符号,当第一子模型中的第一变量的最初逻辑回归系数符号不符合预设系数符号,将所述变量剔除;和/或判断每个子模型中的多个变量之间的相关性,当每个子模型中的N个变量之间的相关性大于预设相关性,剔除所述N个变量中的N-1个变量,其中所述N为大于一的整数。In one embodiment of the present invention, the elimination of interference variables in each sub-model according to the initial logistic regression coefficient of each variable in each sub-model further includes: judging whether the initial logistic regression coefficient of each variable in each sub-model is significant, and when the initial logistic regression coefficient of the first variable in the first sub-model is not significant, the variable is eliminated; and/or judging whether the sign of the initial logistic regression coefficient of each variable in each sub-model meets the preset coefficient sign, and when the sign of the initial logistic regression coefficient of the first variable in the first sub-model does not meet the preset coefficient sign, the variable is eliminated; and/or judging the correlation between multiple variables in each sub-model, and when the correlation between N variables in each sub-model is greater than the preset correlation, N-1 variables among the N variables are eliminated, wherein N is an integer greater than one.
在本发明一实施例中,所述将所述训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逐步回归,生成每个子模型中每个变量的最初逻辑回归系数,进一步包括:根据所述训练数据集中的多个变量的信息值以及预设经验获取入模变量,以及所述入模变量与每个所述子模型的预设对应关系;将所述样本数据中的变量按照所述预设对应关系分别输入至每个子模型中进行逻辑回归训练,获取所述每个子模型中每个剩余变量的最初逻辑回归系数。In one embodiment of the present invention, the variables in the training data set are input into each sub-model according to a preset correspondence relationship for stepwise regression to generate an initial logistic regression coefficient for each variable in each sub-model, which further includes: obtaining model input variables according to the information values of multiple variables in the training data set and preset experience, and a preset correspondence between the model input variables and each sub-model; the variables in the sample data are input into each sub-model according to the preset correspondence relationship for logistic regression training to obtain the initial logistic regression coefficient of each remaining variable in each sub-model.
在本发明一实施例中,所述训练数据集中的变量包括所述入模变量与未入模变量,其中,根据所述每个子模型中每个剩余变量的逻辑回归系数计算每个子模型的评分规则,进一步包括:评估所述未入模变量是否具备评分规则,当所述未入模变量具备评分规则时,赋予所述未入模变量的系数;以及根据所述每个子模型中每个剩余变量的逻辑回归系数以及所述未入模变量的系数计算每个子模型的评分规则。In one embodiment of the present invention, the variables in the training data set include the modeled variables and the non-modeled variables, wherein the scoring rule of each sub-model is calculated based on the logistic regression coefficient of each remaining variable in each sub-model, further comprising: evaluating whether the non-modeled variable has the scoring rule, and when the non-modeled variable has the scoring rule, assigning the coefficient of the non-modeled variable; and calculating the scoring rule of each sub-model based on the logistic regression coefficient of each remaining variable in each sub-model and the coefficient of the non-modeled variable.
在本发明一实施例中,所述获取每个所述子模型的权重进一步包括:将所述测试数据集按照所述预设对应关系分别输入至每个子模型中进行测试,获取每个子模型的AUC值;根据每个所述子模型的AUC值以及每个子模型的预设权重计算每个所述子模型的权重。In one embodiment of the present invention, obtaining the weight of each sub-model further includes: inputting the test data set into each sub-model for testing according to the preset correspondence relationship, and obtaining the AUC value of each sub-model; calculating the weight of each sub-model according to the AUC value of each sub-model and the preset weight of each sub-model.
第二方面,本发明实施例提供了一种个人信用评分系统,包括:预处理单元,用于对原样本数据进行预处理,生成新样本数据,所述新样本数据的数据格式符合逻辑回归算法的格式要求;数据划分单元,用于将所述新样本数据划分为训练数据集以及测试数据集;子模型构建单元,用于构建四个子模型;子模型评分规则获取单元,用于将所述训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逻辑回归,计算每个所述子模型的评分规则;子模型权重获取单元,用于获取每个所述子模型的权重;以及信用评分单元,用于根据每个所述子模型的权重以及每个所述子模型的评分规则,计算个人信用评分模型的评分规则,并根据接收到的个人信用数据以及所述个人信用评分模型的评分规则输出所述个人的信用评分;其中,所述原样本数据包括四类子样本数据,所述四类子样本数据分别一一对应所述四个子模型。In a second aspect, an embodiment of the present invention provides a personal credit scoring system, comprising: a preprocessing unit, used to preprocess original sample data to generate new sample data, wherein the data format of the new sample data meets the format requirements of the logistic regression algorithm; a data partitioning unit, used to partition the new sample data into a training data set and a test data set; a sub-model construction unit, used to construct four sub-models; a sub-model scoring rule acquisition unit, used to input the variables in the training data set into each sub-model according to a preset corresponding relationship for logistic regression, and calculate the scoring rule of each sub-model; a sub-model weight acquisition unit, used to obtain the weight of each sub-model; and a credit scoring unit, used to calculate the scoring rule of the personal credit scoring model according to the weight of each sub-model and the scoring rule of each sub-model, and output the credit score of the individual according to the received personal credit data and the scoring rule of the personal credit scoring model; wherein the original sample data includes four categories of sub-sample data, and the four categories of sub-sample data correspond to the four sub-models one by one.
在本发明一实施例中,所述四类子样本数据包括为:身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据;所述四个子模型包括:身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型。In one embodiment of the present invention, the four categories of sub-sample data include: identity information sample data, asset status sample data, credit history sample data and consumer behavior sample data; the four sub-models include: identity information sub-model, asset status sub-model, credit history sub-model and consumer behavior sub-model.
第三方面,本发明实施例提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述所述的个人信用评分方法。In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is used to execute the personal credit scoring method described above.
第四方面,本发明实施例提供了一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于执行上述所述的个人信用评分方法。In a fourth aspect, an embodiment of the present invention provides an electronic device, comprising: a processor; a memory for storing instructions executable by the processor; and the processor for executing the personal credit scoring method described above.
本发明实施例提供的个人信用评分方法,通过从多个维度出发,建立多个子评估模型,可以更全面地对个人的信用进行评估,避免了过分依赖某一项信用记录,提高了个人信用评分方法的准确性,另外,建立多个子评估模型,可以结合多种业务场景综合对个人的信用评估进行调整,提高了个人信用评分方法的灵活性。The personal credit scoring method provided by the embodiment of the present invention can evaluate the credit of an individual more comprehensively by establishing multiple sub-evaluation models from multiple dimensions, avoiding over-reliance on a certain credit record, and improving the accuracy of the personal credit scoring method. In addition, by establishing multiple sub-evaluation models, the personal credit evaluation can be comprehensively adjusted in combination with various business scenarios, thereby improving the flexibility of the personal credit scoring method.
【附图说明】【Brief Description of the Drawings】
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.
图1所示为本发明一实施例提供的个人信用评分方法的流程示意图;FIG1 is a schematic diagram of a process flow of a personal credit scoring method provided by an embodiment of the present invention;
图2所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG2 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图3所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG3 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图4所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG4 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图5所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG5 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图6所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG6 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图7所示为本发明另一实施例提供的个人信用评分方法的流程示意图;FIG7 is a schematic flow chart of a personal credit scoring method provided by another embodiment of the present invention;
图8所示为本发明一实施例提供的个人信用评分系统的结构示意图。FIG8 is a schematic diagram showing the structure of a personal credit scoring system provided by an embodiment of the present invention.
【具体实施方式】[Specific implementation method]
为了更好的理解本发明的技术方案,下面结合附图对本发明实施例进行详细描述。In order to better understand the technical solution of the present invention, the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
应当明确,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。It should be clear that the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terms used in the embodiments of the present invention are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. The singular forms "a", "said" and "the" used in the embodiments of the present invention and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings.
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" used in this article is only a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship.
图1所示为本发明一实施例提供的一种个人信用评分方法的流程示意图,该个人信用评分方法具体包括如下步骤:FIG1 is a flow chart of a personal credit scoring method provided by an embodiment of the present invention. The personal credit scoring method specifically includes the following steps:
步骤S101:对原样本数据进行预处理,生成新样本数据,新样本数据的数据格式符合逻辑回归算法的格式要求;Step S101: pre-processing the original sample data to generate new sample data, the data format of the new sample data meets the format requirements of the logistic regression algorithm;
步骤S102:将新样本数据划分为训练数据集以及测试数据集;Step S102: Divide the new sample data into a training data set and a test data set;
步骤S103:构建个人信用评分模型,所述个人信用评分模型包括四个子模型;Step S103: constructing a personal credit scoring model, wherein the personal credit scoring model includes four sub-models;
步骤S104:将训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逻辑回归,计算每个子模型的评分规则;Step S104: input the variables in the training data set into each sub-model according to the preset corresponding relationship to perform logistic regression and calculate the scoring rules of each sub-model;
步骤S105:获取每个子模型的权重;以及Step S105: Obtain the weight of each sub-model; and
步骤S106:根据每个子模型的权重以及每个子模型的评分规则,计算个人信用评分模型的评分规则;Step S106: Calculate the scoring rule of the personal credit scoring model according to the weight of each sub-model and the scoring rule of each sub-model;
步骤S107:接收个人信用数据,并将个人信用数据输入个人信用评分模型中,根据个人信用模型的评分规则计算个人的信用评分。Step S107: receiving personal credit data, inputting the personal credit data into a personal credit scoring model, and calculating the personal credit score according to the scoring rules of the personal credit model.
其中,原样本数据包括四类子样本数据,四类子样本数据分别一一对应四个子模型。本发明实施例提供的个人信用评分方法,可以将四类子样本数据分别对应输入至四个子模型中进行逻辑回归,然后结合每个子模型的权重以及每个子模型的评分规则计算个人信用评分,通过从多个维度出发,建立多个子评估模型,可以更全面地对个人的信用进行评估,避免了过分依赖某一项信用记录,提高了个人信用评分方法的准确性,另外,建立多个子评估模型,可以结合多种业务场景综合对个人的信用评估进行调整,提高了个人信用评分方法的灵活性。Among them, the original sample data includes four types of sub-sample data, and the four types of sub-sample data correspond to four sub-models one by one. The personal credit scoring method provided by the embodiment of the present invention can input the four types of sub-sample data into four sub-models for logistic regression, and then calculate the personal credit score in combination with the weight of each sub-model and the scoring rule of each sub-model. By establishing multiple sub-assessment models from multiple dimensions, the credit of an individual can be evaluated more comprehensively, avoiding excessive reliance on a certain credit record, and improving the accuracy of the personal credit scoring method. In addition, by establishing multiple sub-assessment models, the credit assessment of an individual can be adjusted comprehensively in combination with multiple business scenarios, thereby improving the flexibility of the personal credit scoring method.
在本发明一实施例中,四类子样本数据包括为:身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据;四个子模型包括:身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型。其中,身份信息样本数据可以包括用户年龄、用户是否有孩子、用户的工作性质、用户的配偶工作性质等。资产状况样本数据可以包括用户的银行卡等级、银行卡的数量以及对应的银行名称、用户的工资收入、用户的理财情况等能够体现用户的财力状况的数据。信用历史样本数据可以包括历史申请贷款信息、历史还款信息、历史逾期信息、最近贷款日期等。消费行为样本数据可以包括用户的消费金额、用户的消费笔数以及每笔消费金额、用户的消费项目等能够体现用户消费能力及特征的信息。本发明实施例采用身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据分贝输入至对应的身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型中进行逻辑回归,然后计算用户的评分值,加入了用户的实际消费场景,实现了对用户真实消费能力的评估等,进一步增加了用户信用评分的准确性。In one embodiment of the present invention, the four types of sub-sample data include: identity information sample data, asset status sample data, credit history sample data and consumption behavior sample data; the four sub-models include: identity information sub-model, asset status sub-model, credit history sub-model and consumption behavior sub-model. Among them, the identity information sample data may include the user's age, whether the user has children, the nature of the user's job, the nature of the user's spouse's job, etc. The asset status sample data may include the user's bank card level, the number of bank cards and the corresponding bank name, the user's salary income, the user's financial management situation and other data that can reflect the user's financial situation. The credit history sample data may include historical loan application information, historical repayment information, historical overdue information, the latest loan date, etc. The consumption behavior sample data may include the user's consumption amount, the number of the user's consumption and the amount of each consumption, the user's consumption items and other information that can reflect the user's consumption ability and characteristics. The embodiment of the present invention uses identity information sample data, asset status sample data, credit history sample data and consumption behavior sample data to input into the corresponding identity information sub-model, asset status sub-model, credit history sub-model and consumption behavior sub-model for logistic regression, and then calculates the user's score value, adds the user's actual consumption scenario, realizes the assessment of the user's real consumption ability, etc., and further increases the accuracy of the user's credit score.
步骤S101实际上将原样本数据处理成符合逻辑回归算法的格式要求,因此,如图2所示,步骤S101具体的包括如下步骤:Step S101 actually processes the original sample data into a format that meets the requirements of the logistic regression algorithm. Therefore, as shown in FIG2 , step S101 specifically includes the following steps:
步骤S1011:对原样本数据进行清洗,获取清洗后的样本数据;Step S1011: clean the original sample data to obtain cleaned sample data;
步骤S1012:将清洗后的样本数据中的变量分类为多个连续变量以及多个名义变量;Step S1012: classifying the variables in the cleaned sample data into a plurality of continuous variables and a plurality of nominal variables;
步骤S1013:对多个连续变量一一进行分箱处理,生成每个变量的WOE值和信息值;Step S1013: binning multiple continuous variables one by one to generate the WOE value and information value of each variable;
在步骤S1013中,由于原样本数据中的数据被归类了四类(例如消费集聚数据归类为消费行为样本数据),但是各个类别之间的连续变量或许存在相关性以及同一性,为了使得原样本数据中的连续变量得到较好的分箱处理,那么在对多个连续变量进行分箱时,是将原样本数据中的所有连续变量一起进行分箱处理,不会将连续变量归类为四个类别(例如消费集聚数据归类为消费行为样本数据)后再进行类别内的连续变量进行分箱处理。In step S1013, since the data in the original sample data are classified into four categories (for example, consumption cluster data are classified into consumption behavior sample data), but the continuous variables between the categories may be correlated and identical, in order to make the continuous variables in the original sample data be better binned, when binning multiple continuous variables, all continuous variables in the original sample data are binned together, and the continuous variables are not classified into four categories (for example, consumption cluster data are classified into consumption behavior sample data) and then the continuous variables within the category are binned.
步骤S1014:对多个名义变量进行降基处理,生成至少一个新名义变量,并对至少一个新名义变量进行计算,生成新名义变量的WOE值和信息值;Step S1014: performing base reduction processing on multiple nominal variables to generate at least one new nominal variable, and calculating the at least one new nominal variable to generate the WOE value and information value of the new nominal variable;
其中,新样本数据包括每个连续变量的WOE值和信息值以及新名义变量的WOE值和信息值。The new sample data includes the WOE value and information value of each continuous variable and the WOE value and information value of the new nominal variable.
至此步骤S1014,即完成了对原样本数据的预处理,得到新样本数据,其中新样本数据的格式满足逻辑回归算法的格式要求。对原样本数据预处理之后,进一步执行步骤S102。At this point, step S1014 has completed the preprocessing of the original sample data, and obtained new sample data, wherein the format of the new sample data meets the format requirements of the logistic regression algorithm. After the preprocessing of the original sample data, step S102 is further performed.
在实际应用中,往往得到的原样本数据中各个类别的数据并不是完整的,而且还会存在异常数据,不完整的数据以及异常数据均会影响子模型的评分,因此,在本发明一实施例中,如图3所示,步骤S1011进一步包括:In practical applications, the data of each category in the original sample data is often incomplete, and there may be abnormal data. The incomplete data and the abnormal data will affect the score of the sub-model. Therefore, in one embodiment of the present invention, as shown in FIG3, step S1011 further includes:
步骤S10111:对原样本数据中的每组数据进行识别,判断每组数据中的每个类别的数据值是否缺失,当第一组数据中第一类别的数据值缺失时,将第一组数据中的第一类别的数据值采用零或者第一类别的平均值代替;Step S10111: Identify each group of data in the original sample data, and determine whether the data value of each category in each group of data is missing. When the data value of the first category in the first group of data is missing, the data value of the first category in the first group of data is replaced by zero or the average value of the first category.
例如,在用户的身份信息样本数据中,当一个用户的身份数据中是否有小孩的数据缺失,将该用户的身份数据中的是否有小孩这一项采用平均值代替(即是否有小孩这一项的平均值),是否有小孩这类数据的缺失是未能观测到的,采用平均值代替更能反映缺失数据的取值可能性。For example, in the user's identity information sample data, when the data on whether a user has children is missing in the user's identity data, the item on whether the user has children in the user's identity data is replaced by the average value (that is, the average value of the item on whether the user has children). The missing data on whether the user has children is unobservable, and using the average value instead can better reflect the possible values of the missing data.
再例如,在用户的信用历史样本数据中,当一个用户的信用历史样本数据中的信用借贷记录的数据缺失,那么该用户没有使用过任何借贷方式,因此将该用户的信用历史样本数据中的信用借贷的数据采用0代替,更能体现该用户的信用借贷记录为0,即信用白用户,也就是说,本申请实施例所提供的个人信用评分方法,当用户为信用白用户时,依然可以从其他维度(例如身份信息、实际消费信息、资产状况)综合评估该用户的信用评分,相对于现有技术中的信用评分方法,更能够准确客观的评估信用白用户的信用评分。For another example, in the user's credit history sample data, when the data of the credit loan record in the credit history sample data of a user is missing, the user has not used any loan method. Therefore, the credit loan data in the credit history sample data of the user is replaced by 0, which can better reflect that the user's credit loan record is 0, that is, a credit-clear user. In other words, the personal credit scoring method provided in the embodiment of the present application, when the user is a credit-clear user, can still comprehensively evaluate the user's credit score from other dimensions (such as identity information, actual consumption information, and asset status). Compared with the credit scoring method in the prior art, it can more accurately and objectively evaluate the credit score of a credit-clear user.
应当理解,至于什么类别的缺失数据采用0代替,什么类别的数据的缺失数据采用平均值代替,可以根据实际数据类别进行判断,例如数据缺失是因为未能观测到的,且数据的数值可选择的情况比较少,例如用户是否有孩子是否有房产、是否有车,因为是否有房车均是两种答案,要么是(例如用数据1表示),要么否(例如用0表示),那么在数据缺失时,用平均值代替,更能反映缺失数据的取值可能性。It should be understood that which categories of missing data are replaced by 0 and which categories of missing data are replaced by the average value can be judged according to the actual data category. For example, the data is missing because it cannot be observed and there are relatively few cases where the data value can be selected. For example, whether the user has children, real estate, or a car, because whether or not there is a house or a car has two answers, either yes (for example, represented by data 1) or no (for example, represented by 0), then when data is missing, using the average value instead can better reflect the possible values of the missing data.
数据缺失是因为实际情况,例如当一个用户的信用历史样本数据中的信用借贷记录的数据缺失,即可表示该用户没有使用过任何借贷方式,那么这是实际真实情况,那么采用0代替,更能反映缺失数据的取值可能性。Missing data is due to actual conditions. For example, when the credit loan record data in a user's credit history sample data is missing, it means that the user has not used any loan method. This is the actual situation, so using 0 instead can better reflect the possible value of the missing data.
步骤S10111是对原样本数据中缺失的数据的清洗方法,在对原样本数据进性清洗时,不仅要对缺失数据进行补充,还需要对原样本数据中的异常数据进行清洗,即执行步骤S10112。Step S10111 is a method for cleaning missing data in the original sample data. When cleaning the original sample data, not only the missing data needs to be supplemented, but also the abnormal data in the original sample data needs to be cleaned, that is, execute step S10112.
步骤S10112:对原样本数据中的每组数据进行识别,判断每组数据中的每个类别的数据值是否异常,当第二组数据中第二类别的数据值存在异常时,将第二组数据剔除。Step S10112: Identify each group of data in the original sample data, and determine whether the data value of each category in each group of data is abnormal. When the data value of the second category in the second group of data is abnormal, the second group of data is eliminated.
例如在用户的资产状况样本数据中,该用户持有的银行卡数量高达几十张,甚至上百张,那么该用户的资产状况样本数据则为异常数据,将该用户的资产状况样本数据剔除。For example, in the user's asset status sample data, the number of bank cards held by the user is as high as dozens or even hundreds, then the user's asset status sample data is abnormal data and the user's asset status sample data is eliminated.
再例如,在用户的消费行为样本数据中,消费记录中的其中一项的消费金额远远大于剩余消费金额,那该用户的消费记录数据很有可能为异常数据,那么将该用户的消费记录这一数据进行剔除。For another example, in the user's consumption behavior sample data, if the consumption amount of one item in the consumption record is much greater than the remaining consumption amount, then the user's consumption record data is likely to be abnormal data, and the user's consumption record data is eliminated.
步骤S10112是对原样本数据中的异常数据进行剔除。Step S10112 is to remove abnormal data from the original sample data.
应当理解,步骤S10111是对缺失数据进行补充,步骤S10112是对异常数据进行剔除,该两个步骤可以同时执行,也可以仅执行其中一个步骤。本发明实施例对此不作限定。It should be understood that step S10111 is to supplement the missing data, and step S10112 is to remove the abnormal data, and the two steps can be performed simultaneously, or only one of the steps can be performed. The embodiment of the present invention does not limit this.
本发明实施例通过对缺失数据进行补充以及对异常数据进行剔除,降低了样本数据中的异常数据,提高了各子模型的评分规则的准确性,进一步提高了个人信用评分判断的准确性。The embodiment of the present invention reduces the abnormal data in the sample data by supplementing the missing data and eliminating the abnormal data, improves the accuracy of the scoring rules of each sub-model, and further improves the accuracy of the personal credit score judgment.
当步骤S101完成对原样本数据处理成符合逻辑回归算法的格式要求后,即执行步骤S102,即将新样本数据进行分为训练数据集以及测试数据集。在本发明一实施例中,如图4所示,步骤S102具体的包括以下步骤:After step S101 completes processing the original sample data into a format that meets the requirements of the logistic regression algorithm, step S102 is executed, that is, the new sample data is divided into a training data set and a test data set. In one embodiment of the present invention, as shown in FIG4 , step S102 specifically includes the following steps:
步骤S1021:对新样本数据进行下采样处理,生成标准样本数据;以及Step S1021: down-sampling the new sample data to generate standard sample data; and
步骤S1022:将标准样本数据分为训练数据集以及测试数据集。Step S1022: Divide the standard sample data into a training data set and a test data set.
在将标准样本数据划分为训练数据集以及测试数据集时,训练数据集的数量与测试数据集的数量之比可以为8/2。When the standard sample data is divided into a training data set and a test data set, the ratio of the number of the training data set to the number of the test data set may be 8/2.
优选的,对新样本数据进行下采样生成标准样本数据时,为了使得标准样本数据更能反映真实数据情况,如图5所示,步骤S1021(即对新样本数据进行下采样处理)具体包括以下步骤:Preferably, when downsampling the new sample data to generate standard sample data, in order to make the standard sample data more reflective of the real data situation, as shown in FIG5 , step S1021 (i.e., downsampling the new sample data) specifically includes the following steps:
步骤S10211:将新样本数据分为好样本数据和坏样本数据;以及Step S10211: Divide the new sample data into good sample data and bad sample data; and
步骤S10211:从好样本数据中无放回的随机抽取多个好样本数据,其中抽取的好样本数据的数量为坏样本数据的数量的2~4倍;Step S10211: randomly extracting a plurality of good sample data from the good sample data without replacement, wherein the number of the extracted good sample data is 2 to 4 times the number of the bad sample data;
优选的,抽取的好样本数据的数量为坏样本数据的数量的3倍;Preferably, the number of good sample data extracted is three times the number of bad sample data;
由于新样本数据中好样本的数据远远大于坏样本的数量,因为为了使得标准样本数据更能真实反映实际数据,那么在生成标准样本数据时,将原样本数据中的坏样本数据全部保留,即标准样本数据包括抽取的好样本数据以及全部坏样本数据。Since the number of good samples in the new sample data is far greater than the number of bad samples, in order to make the standard sample data more truly reflect the actual data, when generating the standard sample data, all the bad sample data in the original sample data are retained, that is, the standard sample data includes the extracted good sample data and all the bad sample data.
当步骤S102完成将新样本数据划分为训练数据集以及测试数据集后,即执行步骤S103(即构建四个子模型),步骤S103完成四个子模型的建立后,即执行步骤S104(即对四个子模型进行逻辑回归训练,计算每个子模型的评分规则),在本发明一实施例中,如图6所示,步骤S104具体包括以下步骤:When step S102 completes dividing the new sample data into a training data set and a test data set, step S103 is executed (i.e., building four sub-models). After step S103 completes building the four sub-models, step S104 is executed (i.e., performing logistic regression training on the four sub-models and calculating the scoring rules for each sub-model). In one embodiment of the present invention, as shown in FIG6 , step S104 specifically includes the following steps:
步骤S1041:将训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逐步回归,生成每个子模型中每个变量的最初逻辑回归系数;Step S1041: input the variables in the training data set into each sub-model according to the preset corresponding relationship to perform stepwise regression, and generate the initial logistic regression coefficient of each variable in each sub-model;
由于在原样本预处理阶段之前,已经将样本数据按照四个不同的类别进行了归类(例如消费集聚数据归类为消费行为样本数据),但是在步骤S1013中,将原样本数据中的所有变量一起进行分箱处理,因此在将训练数据集中的变量输入至每个子模型中进行逐步回归时,需要在多个变量中选取需要入子模型的变量,即按照预设对应关系将多个变量中的变量对应输入至四个子模型中。Since the sample data have been classified into four different categories before the original sample preprocessing stage (for example, consumption cluster data are classified into consumption behavior sample data), but in step S1013, all variables in the original sample data are binned together. Therefore, when inputting the variables in the training data set into each sub-model for stepwise regression, it is necessary to select the variables that need to be entered into the sub-model from multiple variables, that is, to input the variables in the multiple variables into the four sub-models according to the preset corresponding relationship.
例如:训练样本数据集中包括m个变量,可以按照预设对应关系在m个变量中选取f个变量输入至第一子模型中进行逻辑回归训练,选择a个变量输入至第二子模型中,选择b个变量输入至第三子模型中,选择c个变量输入至第四子模型中。而f个变量、a个变量、b个变量、c个变量中变量的类别互不重叠。For example: the training sample data set includes m variables. According to the preset corresponding relationship, f variables can be selected from the m variables and input into the first sub-model for logistic regression training, a variables can be selected to input into the second sub-model, b variables can be selected to input into the third sub-model, and c variables can be selected to input into the fourth sub-model. The categories of the variables in f variables, a variables, b variables, and c variables do not overlap.
预设对应关系的获取方法可以包括:根据训练数据集中的多个变量的信息值以及预设经验(例如专家经验)获取入模变量,以及入模变量与每个子模型的预设对应关系。The method for obtaining the preset corresponding relationship may include: obtaining the input variables according to the information values of multiple variables in the training data set and preset experience (such as expert experience), and the preset corresponding relationship between the input variables and each sub-model.
步骤S1042:根据每个子模型中的每个变量的最初逻辑回归系数,剔除每个子模型中的干扰变量;Step S1042: according to the initial logistic regression coefficient of each variable in each sub-model, eliminate the interference variables in each sub-model;
由于输入一个子模型的变量之间很有可能有关联性,也很有可能某一变量的系数并不显著,导致子模型的评分准确率低,因此,需要根据最初逻辑回归系数剔除干扰变量。Since there is a high probability that there is a correlation between the variables input into a sub-model, it is also possible that the coefficient of a certain variable is not significant, resulting in a low scoring accuracy of the sub-model. Therefore, it is necessary to eliminate the interfering variables based on the initial logistic regression coefficients.
步骤S1043:根据每个子模型中剩余变量,选择性的对所述剩余变量进行至少一次分箱处理,生成剩余变量的WOE值和信息值;Step S1043: According to the remaining variables in each sub-model, selectively perform at least one binning process on the remaining variables to generate WOE values and information values of the remaining variables;
由于在步骤S1013中,将原样本数据中的所有变量一起进行分箱处理,可能会使得每个类别的变量样本数量较少,那么该变量的好样本和坏样本的比例很不稳定(例如可能异常大或者异常小),因此若将该数量较少的变量输入至一个子模型中进行逻辑回归训练后,该变量的逻辑归回系数不合理,因此,在变量输入至子模型进行逻辑回归训练后,需要根据每个子模型中的剩余变量的最初逻辑回归系数再进行至少一次的分箱处理。例如,一个用户的信用历史信用样本数据中某段时期内的逾期次数这类变量,按照常理,次数越多时这个人越有可能是坏用户,评分也应越低,这个趋势是单调的,而在不调整的情况下,可能出现这个趋势先下降后上升,这就有可能是分箱不合理导致的,所以需要对该用户的信用历史样本数据进行再次分箱调整。Since in step S1013, all variables in the original sample data are binned together, the number of variable samples in each category may be small, and the ratio of good samples to bad samples of the variable is very unstable (for example, it may be abnormally large or abnormally small). Therefore, if the variable with a small number is input into a sub-model for logistic regression training, the logistic regression coefficient of the variable is unreasonable. Therefore, after the variable is input into the sub-model for logistic regression training, it is necessary to perform binning again at least once according to the initial logistic regression coefficient of the remaining variables in each sub-model. For example, for variables such as the number of overdue payments in a certain period of time in the credit sample data of a user's credit history, according to common sense, the more times the person is overdue, the more likely he is a bad user, and the lower the score should be. This trend is monotonic, and in the case of no adjustment, this trend may first decrease and then increase, which may be caused by unreasonable binning, so the credit history sample data of the user needs to be binned again.
步骤S1044:将每个子模型中经过至少一次分箱处理后的剩余变量分别对应输入至每个子模型中进行逻辑回归,获取剩余变量的逻辑回归系数;以及Step S1044: inputting the remaining variables after at least one binning process in each sub-model into each sub-model to perform logistic regression, and obtaining the logistic regression coefficients of the remaining variables; and
步骤S1045:根据每个子模型中每个剩余变量的逻辑回归系数计算每个子模型的评分规则。Step S1045: Calculate the scoring rule for each sub-model according to the logistic regression coefficient of each remaining variable in each sub-model.
本发明实施例中,将变量输入至对应的子模型后进行逐步逻辑回归训练,生成每个变量的最初逻辑回归系数,并根据最初逻辑回归系数选择性的对异常的变量进行进一步的清理以及提调整分箱,能够更加准确的评估个人信用值。In the embodiment of the present invention, after the variables are input into the corresponding sub-model, stepwise logistic regression training is performed to generate the initial logistic regression coefficient of each variable, and the abnormal variables are further cleaned and adjusted according to the initial logistic regression coefficient, so as to more accurately evaluate the personal credit value.
在本发明一实施例中,在步骤S1041中将训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逐步回归时,并不是训练数据集中所有的变量都选择入一个子模型中,例如一个用户是否有房车的数据输入了资产状况子模型中,并没有输入至身份信息子模型,但是用户是否有房车的数据可能对于身份信息评估时具有一定的重要性,那么在步骤S1045(即计算每个子模型的评分规则时),步骤S1045还可以包括:In one embodiment of the present invention, when the variables in the training data set are respectively input into each sub-model according to the preset corresponding relationship for stepwise regression in step S1041, not all variables in the training data set are selected into one sub-model. For example, data on whether a user has a motorhome is input into the asset status sub-model, but not into the identity information sub-model. However, data on whether the user has a motorhome may be of certain importance in the evaluation of identity information. Then, in step S1045 (i.e., when calculating the scoring rule of each sub-model), step S1045 may also include:
步骤S10451:评估未入模变量是否具备评分规则,当未入模变量具备评分规则时,赋予未入模变量的系数,例如根据用户是否有房车的数据在身份信息这个背景里的意义来评估用户是否有房这一变量的系数;以及Step S10451: evaluating whether the unmodeled variable has a scoring rule. When the unmodeled variable has a scoring rule, assigning a coefficient to the unmodeled variable, for example, evaluating the coefficient of the variable of whether the user has a house based on the significance of the data of whether the user has a house in the context of identity information; and
步骤S10452:根据每个子模型中每个剩余变量的逻辑回归系数以及未入模变量的系数计算每个子模型的评分规则。Step S10452: Calculate the scoring rule for each sub-model based on the logistic regression coefficient of each remaining variable in each sub-model and the coefficient of the unmodeled variable.
本发明实施例通过将未入一个子模型的变量根据在该背景下的意义适当的赋予系数,在计算该子模型的评分规则时,除了考虑该子模型中的剩余变量的逻辑回归系数之外,还应考虑该未入模变量被赋予的系数。增加了个人信用评分的准确性。The embodiment of the present invention increases the accuracy of personal credit scoring by appropriately assigning coefficients to variables that are not included in a sub-model according to their meaning in the context, and when calculating the scoring rule of the sub-model, in addition to considering the logistic regression coefficients of the remaining variables in the sub-model, also considering the coefficients assigned to the variables that are not included in the model.
上述介绍了步骤S1042中剔除每个子模型中的干扰变量,可以使得每个子模型的评分更加准确,那么,在本发明一实施例中,如图7所示,步骤S1042具体可以包括以下步骤:As described above, removing the interference variables in each sub-model in step S1042 can make the score of each sub-model more accurate. Then, in one embodiment of the present invention, as shown in FIG7 , step S1042 may specifically include the following steps:
步骤S10421:判断每个子模型中的每个变量的最初逻辑回归系数是否显著,当第一子模型中的第一变量的最初逻辑回归系数不显著时,将变量剔除;当第一子模型中的第一变量的最初逻辑回归系数显著时,将变量归为剩余变量,并进一步被执行步骤S1043。和/或Step S10421: Determine whether the initial logistic regression coefficient of each variable in each sub-model is significant. When the initial logistic regression coefficient of the first variable in the first sub-model is not significant, the variable is eliminated; when the initial logistic regression coefficient of the first variable in the first sub-model is significant, the variable is classified as a remaining variable and step S1043 is further executed. And/or
步骤S10422:判断每个子模型中的每个变量的最初逻辑回归系数符号是否符合预设系数符号,当第一子模型中的第一变量的最初逻辑回归系数符号不符合预设系数符号,将变量剔除,当第一子模型中的第一变量的最初逻辑回归系数符号符合预设系数符号,将变量归为剩余变量,并进一步被执行步骤S1043。和/或Step S10422: Determine whether the initial logistic regression coefficient sign of each variable in each sub-model meets the preset coefficient sign. When the initial logistic regression coefficient sign of the first variable in the first sub-model does not meet the preset coefficient sign, the variable is eliminated. When the initial logistic regression coefficient sign of the first variable in the first sub-model meets the preset coefficient sign, the variable is classified as a remaining variable, and step S1043 is further executed. And/or
步骤S10423:判断每个子模型中的多个变量之间的相关性,当每个子模型中的N个变量之间的相关性大于预设相关性,剔除N个变量中的N-1个变量,其中N为大于一的整数;当每个子模型中的N个变量之间的相关性小于或者等于预设相关性,将N个变量归为剩余变量,并进一步被执行步骤S1043。Step S10423: Determine the correlation between multiple variables in each sub-model. When the correlation between N variables in each sub-model is greater than the preset correlation, eliminate N-1 variables among the N variables, where N is an integer greater than one; when the correlation between N variables in each sub-model is less than or equal to the preset correlation, classify the N variables as remaining variables, and further execute step S1043.
本发明实施例通过最初逻辑回归系数剔除每个子模型中相关性较强、最初逻辑回归系数不显著,最初逻辑回归系数符号不符合实际情况的变量剔除,可以使得每个子模型的评分更加准确。The embodiment of the present invention eliminates variables with strong correlation, insignificant initial logistic regression coefficients, and initial logistic regression coefficients whose signs do not conform to the actual situation in each sub-model through the initial logistic regression coefficient, so that the score of each sub-model can be made more accurate.
应当理解,步骤S10421、步骤S10422以及步骤S10423分别为剔除干扰变量的三种方式,该三个步骤可以同时进行也可以仅进行其中一个步骤或者两个步骤,本发明实施例对此不作限定。It should be understood that step S10421, step S10422 and step S10423 are three ways of eliminating interference variables, respectively. The three steps may be performed simultaneously or only one or two of the three steps may be performed, which is not limited in the embodiment of the present invention.
当步骤S104计算得到每个子模型的评分规则之后,进一步执行步骤S105,即获取每个子模型的权重,在本发明一实施例中,如图7所示,步骤S105具体的包括以下步骤:After step S104 calculates the scoring rule of each sub-model, step S105 is further executed to obtain the weight of each sub-model. In one embodiment of the present invention, as shown in FIG7 , step S105 specifically includes the following steps:
步骤S1051:将测试数据集按照所述预设对应关系分别输入至每个子模型中进行测试,获取每个子模型的AUC值;以及Step S1051: input the test data set into each sub-model according to the preset corresponding relationship for testing, and obtain the AUC value of each sub-model; and
步骤S1052:根据每个子模型的AUC值以及每个子模型的预设权重计算每个子模型的权重。Step S1052: Calculate the weight of each sub-model according to the AUC value of each sub-model and the preset weight of each sub-model.
当获取每个子模型的权重以及每个子模型的评分规则之后,则执行步骤S106:即根据每个子模型的权重以及每个子模型的评分规则,计算个人的信用评分,步骤S106即可得到用户的信用评分。After obtaining the weight of each sub-model and the scoring rule of each sub-model, step S106 is executed: that is, the personal credit score is calculated according to the weight of each sub-model and the scoring rule of each sub-model. Step S106 can obtain the user's credit score.
作为本发明实施例的第二方面,图8所示为本发明一实施例提供的一种个人信用评分系统,如图8所示,该个人信用评分系统,包括:预处理单元1,用于对原样本数据进行预处理,生成新样本数据,新样本数据的数据格式符合逻辑回归算法的格式要求;数据划分单元2,用于将新样本数据划分为训练数据集以及测试数据集;子模型构建单元3,用于构建四个子模型,四个子模型分别为第一子模型31、第二子模型32、第三子模型33以及第四子模型34;子模型评分规则获取单元4,用于将训练数据集中的变量按照预设对应关系分别输入至每个子模型中进行逻辑回归,计算每个子模型的评分规则;子模型权重获取单元5,用于获取每个子模型的权重;以及信用评分单元6,用于根据每个子模型的权重以及每个子模型的评分规则,计算个人信用评分模型的评分规则,并根据接收到的个人信用数据以及个人信用评分模型的评分规则输出个人的信用评分;其中,原样本数据包括四类子样本数据,所述四类子样本数据分别一一对应所述四个子模型。As a second aspect of an embodiment of the present invention, FIG8 shows a personal credit scoring system provided by an embodiment of the present invention. As shown in FIG8, the personal credit scoring system includes: a preprocessing unit 1, which is used to preprocess the original sample data to generate new sample data, and the data format of the new sample data meets the format requirements of the logistic regression algorithm; a data division unit 2, which is used to divide the new sample data into a training data set and a test data set; a sub-model construction unit 3, which is used to construct four sub-models, and the four sub-models are respectively a first sub-model 31, a second sub-model 32, a third sub-model 33 and a fourth sub-model 34; sub-model scoring rules An acquisition unit 4 is used to input the variables in the training data set into each sub-model according to a preset corresponding relationship for logistic regression, and calculate the scoring rule of each sub-model; a sub-model weight acquisition unit 5 is used to obtain the weight of each sub-model; and a credit scoring unit 6 is used to calculate the scoring rule of the personal credit scoring model according to the weight of each sub-model and the scoring rule of each sub-model, and output the personal credit score according to the received personal credit data and the scoring rule of the personal credit scoring model; wherein the original sample data includes four categories of sub-sample data, and the four categories of sub-sample data correspond to the four sub-models one by one.
其中上述的预处理单元1,数据划分单元2,子模型构建单元3,子模型评分规则获取单元4、子模型权重获取单元5以及信用评分单元6在各自的工作过程中,分别执行上述所述的个人信用评分方法中的对应的工作步骤,在此不再做赘述。The above-mentioned preprocessing unit 1, data partitioning unit 2, sub-model construction unit 3, sub-model scoring rule acquisition unit 4, sub-model weight acquisition unit 5 and credit scoring unit 6 respectively perform the corresponding working steps in the above-mentioned personal credit scoring method in their respective working processes, and will not be repeated here.
本发明实施例提供的个人信用评分系统,包括四个不同维度的子模型,可以更全面地对个人的信用进行评估,避免了过分依赖某一项信用记录,提高了个人信用评分方法的准确性,另外,建立多个子评估模型,可以结合多种业务场景综合对个人的信用评估进行调整,提高了个人信用评分方法的灵活性。在进行个人信用评分时,可以采用评分系统中的其中一个单一的子模型,也可以采取四个子模型中的任意两个、三个、四个组合,使得评分系统更加灵活。The personal credit scoring system provided by the embodiment of the present invention includes four sub-models of different dimensions, which can evaluate the credit of an individual more comprehensively, avoid over-reliance on a certain credit record, and improve the accuracy of the personal credit scoring method. In addition, by establishing multiple sub-evaluation models, the personal credit evaluation can be adjusted in combination with various business scenarios, thereby improving the flexibility of the personal credit scoring method. When performing personal credit scoring, a single sub-model in the scoring system can be used, or any combination of two, three, or four of the four sub-models can be used, making the scoring system more flexible.
在本发明一实施例中,四类子样本数据包括为:身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据;四个子模型包括:身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型。其中,身份信息样本数据可以包括用户年龄、用户是否有孩子、用户的工作性质、用户的配偶工作性质等。资产状况样本数据可以包括用户的银行卡等级、银行卡的数量以及对应的银行名称、用户的工资收入、用户的理财情况等能够体现用户的财力状况的数据。信用历史样本数据可以包括历史申请贷款信息、历史还款信息、历史逾期信息、最近贷款日期等。消费行为样本数据可以包括用户的消费金额、用户的消费笔数以及每笔消费金额、用户的消费项目等能够体现用户消费能力及特征的信息。本发明实施例采用身份信息样本数据、资产状况样本数据、信用历史样本数据以及消费行为样本数据分贝输入至对应的身份信息子模型、资产状况子模型、信用历史子模型以及消费行为子模型中进行逻辑回归,然后计算用户的评分值,加入了用户的实际消费场景,实现了对用户真实消费能力的评估等,进一步增加了用户信用评分的准确性。In one embodiment of the present invention, the four types of sub-sample data include: identity information sample data, asset status sample data, credit history sample data and consumption behavior sample data; the four sub-models include: identity information sub-model, asset status sub-model, credit history sub-model and consumption behavior sub-model. Among them, the identity information sample data may include the user's age, whether the user has children, the nature of the user's job, the nature of the user's spouse's job, etc. The asset status sample data may include the user's bank card level, the number of bank cards and the corresponding bank name, the user's salary income, the user's financial management situation and other data that can reflect the user's financial situation. The credit history sample data may include historical loan application information, historical repayment information, historical overdue information, the latest loan date, etc. The consumption behavior sample data may include the user's consumption amount, the number of the user's consumption and the amount of each consumption, the user's consumption items and other information that can reflect the user's consumption ability and characteristics. The embodiment of the present invention uses identity information sample data, asset status sample data, credit history sample data and consumption behavior sample data to input into the corresponding identity information sub-model, asset status sub-model, credit history sub-model and consumption behavior sub-model for logistic regression, and then calculates the user's score value, adds the user's actual consumption scenario, realizes the assessment of the user's real consumption ability, etc., and further increases the accuracy of the user's credit score.
示例性电子设备Exemplary Electronic Devices
作为本发明的第三方面,本发明实施例还提供了一种电子设备,包括一个或多个处理器和存储器。As a third aspect of the present invention, an embodiment of the present invention further provides an electronic device, including one or more processors and a memory.
处理器可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备中的其他组件以执行期望的功能。The processor may be a central processing unit (CPU) or other forms of processing units having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
存储器可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器可以运行上述所述程序指令,以实现上文所述的本申请的各个实施例的个人信用评分方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。The memory may include one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor may run the above-mentioned program instructions to implement the personal credit scoring method of each embodiment of the present application described above and/or other desired functions. Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
示例性计算机程序产品和计算机可读存储介质Exemplary computer program products and computer-readable storage media
除了上述方法和设备以外,本申请的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本申请图1至图3以及图6所示实施例的个人信用评分的方法的步骤。In addition to the above-mentioned methods and devices, an embodiment of the present application may also be a computer program product, which includes computer program instructions, which, when executed by a processor, enable the processor to execute the steps of the method for personal credit scoring according to the embodiments shown in Figures 1 to 3 and Figure 6 of the present application described in the above "Exemplary Method" section of this specification.
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本申请实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。The computer program product may be written in any combination of one or more programming languages to write program codes for performing the operations of the embodiments of the present application, including object-oriented programming languages, such as Java, C++, etc., and conventional procedural programming languages, such as "C" language or similar programming languages. The program code may be executed entirely on the user computing device, partially on the user device, as an independent software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server.
此外,本申请的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本申请各种实施例的功率参数调整方法或强化学习模型的训练方法中的步骤。In addition, an embodiment of the present application may also be a computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, causes the processor to execute the steps of the power parameter adjustment method or the reinforcement learning model training method according to various embodiments of the present application described in the above “Exemplary Method” section of this specification.
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The computer readable storage medium can adopt any combination of one or more readable media. The readable medium can be a readable signal medium or a readable storage medium. The readable storage medium can include, for example, but is not limited to, a system, device or device of electricity, magnetism, light, electromagnetic, infrared, or semiconductor, or any combination of the above. More specific examples (non-exhaustive list) of readable storage media include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
以上结合具体实施例描述了本申请的基本原理,但是,需要指出的是,在本申请中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本申请的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本申请为必须采用上述具体的细节来实现。The basic principles of the present application are described above in conjunction with specific embodiments. However, it should be noted that the advantages, strengths, effects, etc. mentioned in the present application are only examples and not limitations, and it cannot be considered that these advantages, strengths, effects, etc. are required by each embodiment of the present application. In addition, the specific details disclosed above are only for the purpose of illustration and ease of understanding, not for limitation, and the above details do not limit the present application to being implemented by adopting the above specific details.
本申请中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。The block diagrams of devices, apparatuses, equipment, and systems involved in this application are only illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As will be appreciated by those skilled in the art, these devices, apparatuses, equipment, and systems may be connected, arranged, or configured in any manner.
还需要指出的是,在本申请的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本申请的等效方案。It should also be noted that in the apparatus, device and method of the present application, each component or each step can be decomposed and/or recombined. Such decomposition and/or recombination should be regarded as equivalent solutions of the present application.
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本申请。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本申请的范围。因此,本申请不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of the present application. Therefore, the present application is not intended to be limited to the aspects shown herein, but rather to the widest scope consistent with the principles and novel features disclosed herein.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the scope of protection of the present invention.
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011106851.6ACN112258312B (en) | 2020-10-16 | 2020-10-16 | Personal credit scoring method and system, electronic device and storage medium | 
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011106851.6ACN112258312B (en) | 2020-10-16 | 2020-10-16 | Personal credit scoring method and system, electronic device and storage medium | 
| Publication Number | Publication Date | 
|---|---|
| CN112258312A CN112258312A (en) | 2021-01-22 | 
| CN112258312Btrue CN112258312B (en) | 2024-09-13 | 
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011106851.6AActiveCN112258312B (en) | 2020-10-16 | 2020-10-16 | Personal credit scoring method and system, electronic device and storage medium | 
| Country | Link | 
|---|---|
| CN (1) | CN112258312B (en) | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN113298121B (en)* | 2021-04-30 | 2023-08-18 | 上海淇玥信息技术有限公司 | Message sending method and device based on multi-data source modeling and electronic equipment | 
| CN113570259A (en)* | 2021-07-30 | 2021-10-29 | 北京房江湖科技有限公司 | Dimensional Model-Based Data Evaluation Method and Computer Program Product | 
| CN113988986A (en)* | 2021-11-02 | 2022-01-28 | 京东城市(北京)数字科技有限公司 | Credit evaluation method, credit evaluation device, electronic equipment and storage medium | 
| CN115423603B (en)* | 2022-08-31 | 2023-05-23 | 厦门国际银行股份有限公司 | Wind control model building method, system and storage medium based on machine learning | 
| CN115907971B (en)* | 2023-01-30 | 2023-05-30 | 江苏安则达信用信息服务有限公司 | Data processing method and device suitable for personal credit evaluation system | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110956273A (en)* | 2019-11-07 | 2020-04-03 | 中信银行股份有限公司 | Credit scoring method and system integrating multiple machine learning models | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20140188442A1 (en)* | 2012-12-27 | 2014-07-03 | Pearson Education, Inc. | System and Method for Selecting Predictors for a Student Risk Model | 
| US20160225073A1 (en)* | 2015-01-30 | 2016-08-04 | Wal-Mart Stores, Inc. | System, method, and non-transitory computer-readable storage media for predicting a customer's credit score | 
| CN106779457A (en)* | 2016-12-29 | 2017-05-31 | 深圳微众税银信息服务有限公司 | A kind of rating business credit method and system | 
| CN107633030B (en)* | 2017-09-04 | 2020-11-27 | 深圳市华傲数据技术有限公司 | Credit evaluation method and device based on data model | 
| CN107633265B (en)* | 2017-09-04 | 2021-03-30 | 深圳市华傲数据技术有限公司 | Data processing method and device for optimizing credit evaluation model | 
| CN108596757A (en)* | 2018-04-23 | 2018-09-28 | 大连火眼征信管理有限公司 | A kind of personal credit file method and system of intelligences combination | 
| CN109345368A (en)* | 2018-08-22 | 2019-02-15 | 中国平安人寿保险股份有限公司 | Credit estimation method, device, electronic equipment and storage medium based on big data | 
| CN109377058A (en)* | 2018-10-26 | 2019-02-22 | 中电科新型智慧城市研究院有限公司 | Risk assessment method of enterprise relocation based on logistic regression model | 
| AU2019100362A4 (en)* | 2019-04-05 | 2019-05-09 | Guo, Fengyu Miss | Personal Credit Rating System Based on The Logistic Regression | 
| CN110544155B (en)* | 2019-09-02 | 2023-05-19 | 中诚信征信有限公司 | User credit score acquisition method, acquisition device, server and storage medium | 
| CN111080397A (en)* | 2019-11-18 | 2020-04-28 | 支付宝(杭州)信息技术有限公司 | Credit evaluation method and device and electronic equipment | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110956273A (en)* | 2019-11-07 | 2020-04-03 | 中信银行股份有限公司 | Credit scoring method and system integrating multiple machine learning models | 
| Publication number | Publication date | 
|---|---|
| CN112258312A (en) | 2021-01-22 | 
| Publication | Publication Date | Title | 
|---|---|---|
| CN112258312B (en) | Personal credit scoring method and system, electronic device and storage medium | |
| CN112017040B (en) | Credit scoring model training method, scoring system, equipment and medium | |
| JP4218099B2 (en) | Database, customer information search method, and customer information search device | |
| CN115812209A (en) | Machine Learning Feature Recommendation | |
| CN113516511B (en) | A financial product purchase prediction method, device and electronic equipment | |
| CN111199469A (en) | User payment model generation method and device and electronic equipment | |
| CN115968478A (en) | Machine learning feature recommendation | |
| CN111210332A (en) | Method and device for generating post-loan management strategy and electronic equipment | |
| CN117391841A (en) | Wind control strategy evaluation method and device, storage medium and electronic equipment | |
| CN113781056A (en) | Method and device for predicting fraudulent behavior of users | |
| KR20230094936A (en) | Activist alternative credit scoring system model using work behavior data and method for providing the same | |
| CN117972381A (en) | Internet insurance user feature screening method and device based on diffusion model | |
| CN118838701A (en) | Data processing method, apparatus, device, storage medium and computer program product | |
| CN119250915A (en) | Financial product recommendation method, device, computer equipment and storage medium | |
| CN110197382B (en) | Method and apparatus for generating information | |
| CN112418670A (en) | Case allocation method, device, equipment and medium | |
| CN113435987B (en) | A method and device for dynamically identifying high-risk users in the financial field | |
| JP7700565B2 (en) | EXPLANATION INFORMATION OUTPUT PROGRAM, EXPLANATION INFORMATION OUTPUT METHOD, AND INFORMATION PROCESSING APPARATUS | |
| CN112801563B (en) | Risk assessment method and device | |
| CN118607723A (en) | User behavior prediction method, device, equipment, storage medium and product | |
| CN116777065A (en) | Data prediction method, device, equipment and storage medium based on artificial intelligence | |
| CN120219030A (en) | A method, device, electronic device and storage medium for determining supply and demand distribution | |
| CN119205295A (en) | Risk assessment method, device, electronic device and readable storage medium | |
| CN117634893A (en) | Risk assessment model training method, risk prediction method | |
| CN115953233A (en) | Risk assessment system | 
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CP03 | Change of name, title or address | Address after:1006 and 1008 zhangheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203 Patentee after:UnionPay Business Payment Co.,Ltd. Country or region after:China Address before:No. 1006 and 1008 Zhangheng Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai Patentee before:CHINA UMS CO.,LTD. Country or region before:China | |
| CP03 | Change of name, title or address |