
技术领域technical field
本发明涉及金融行业的反洗钱技术领域,尤其是涉及一种基于XGBoost算法的洗钱交易识别方法及系统。The invention relates to the technical field of anti-money laundering in the financial industry, in particular to a money laundering transaction identification method and system based on the XGBoost algorithm.
背景技术Background technique
目前,各大银行与金融机构判定客户存在洗钱风险、向人民银行上报可疑客户的技术手段都是先定立一系列监测指标,若客户的交易触发一定监测条件,则由系统自动生成可疑案例,再由富有经验的甄别人员将可疑案例所涉及的客户过去几个月的交易取出,人工一一鉴别上报。At present, major banks and financial institutions determine that customers have money laundering risks and report suspicious customers to the People's Bank of China by establishing a series of monitoring indicators. If the customer's transaction triggers certain monitoring conditions, the system will automatically generate suspicious cases. Experienced screening personnel take out the transactions of the customers involved in suspicious cases in the past few months, and manually identify and report them one by one.
但是,从生成的案例里找出需要上报的案例,这个过程是人工完成的,会出现以下几个问题:一是经验丰富的甄别人员有限,可疑案例的甄别工作目前的工作流程是甄别人员随机处理案例,经验丰富的人员和经验不足的人员无差别地接收案例进行处理,经验不足的人员遇到难以判定的案例时,仍需向经验丰富的人员求助,或造成较高的漏报和误报,浪费了大量人力成本,处理案例的效率也受到影响;二是随着人口的增加,客户人数不断增大,第二,用规则化引擎生成案例就是所有满足规则的客户都会被生成案例,这导致每个月自动生成的可疑案例数目非常大,使得甄别人员的工作强度很大,但被报送案例占比很小,无效案例占比高,浪费了人力成本;三是由于自动生成案例是依靠规则化引擎,但是制定的规则不够灵活,只有满足固定标准的客户才会生成案例,可能会出现一些可疑用户未被生成案例的情况。However, to find out the cases that need to be reported from the generated cases, this process is done manually, and there will be the following problems: First, the experienced screening personnel are limited, and the current workflow for the screening of suspicious cases is that the screening personnel are randomly selected. When handling cases, experienced personnel and inexperienced personnel receive and handle cases indiscriminately. When inexperienced personnel encounter cases that are difficult to determine, they still need to seek help from experienced personnel, or cause high false negatives and false positives. It wastes a lot of labor costs, and the efficiency of handling cases is also affected; second, with the increase of the population, the number of customers continues to increase; second, the use of a rule-based engine to generate cases means that all customers who meet the rules will be generated. This results in a very large number of suspicious cases that are automatically generated every month, which makes the work intensity of the screening personnel very high, but the proportion of reported cases is small, and the proportion of invalid cases is high, which wastes labor costs; third, because the automatic generation of cases It relies on a rule-based engine, but the rules are not flexible enough. Only customers who meet fixed standards will generate cases. There may be cases where some suspicious users are not generated.
因此,如何提高甄别效率、节约人力成本是摆在反洗钱机构面前的一个迫在眉睫的技术问题。Therefore, how to improve the screening efficiency and save labor costs is an urgent technical issue facing anti-money laundering agencies.
发明内容SUMMARY OF THE INVENTION
本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种基于XGBoost算法的洗钱交易识别方法及系统。The purpose of the present invention is to provide a money laundering transaction identification method and system based on the XGBoost algorithm in order to overcome the above-mentioned defects in the prior art.
本发明的目的可以通过以下技术方案来实现:The object of the present invention can be realized through the following technical solutions:
一种基于XGBoost算法的洗钱交易识别方法,包括:A method for identifying money laundering transactions based on the XGBoost algorithm, comprising:
获取客户的交易数据,将客户交易数据送入训练好的机器学习模型中,机器学习模型给出洗钱可疑度得分,根据洗钱可疑度得分筛选出可疑案例,对可疑案例进行甄别;Obtain the customer's transaction data, send the customer's transaction data into the trained machine learning model, and the machine learning model will give a money laundering suspicious degree score, screen out suspicious cases according to the money laundering suspicious degree score, and screen the suspicious cases;
所述机器学习模型的建立和训练过程如下:The establishment and training process of the machine learning model is as follows:
构建数据集,所述数据集包括多组交易数据,所述交易数据的标签为洗钱和非洗钱,所述交易数据包括多个维度的特征;constructing a data set, the data set includes multiple sets of transaction data, the transaction data is labeled as money laundering and non-money laundering, and the transaction data includes features of multiple dimensions;
提取与洗钱操作存在关联的可疑特征,对可疑特征所对应的值进行数据清洗、空缺值填充以及标准化处理,对可疑特征进行降维处理;Extract suspicious features associated with money laundering operations, perform data cleaning, fill in vacancies, and standardize the values corresponding to suspicious features, and perform dimensionality reduction processing on suspicious features;
将数据集划分为训练集和测试集,采用XGBoost算法构建机器学习模型,并使用训练集对机器学习模型进行训练,使用测试集测试机器学习模型的预测准确率,若预测准确率大于预设精度阈值,则得到训练好的机器学习模型,否则,利用测试集更新训练集,对机器学习模型进行迭代优化,得到训练好的机器学习模型。Divide the data set into training set and test set, use the XGBoost algorithm to build a machine learning model, use the training set to train the machine learning model, and use the test set to test the prediction accuracy of the machine learning model. If the prediction accuracy is greater than the preset accuracy If the threshold is set, the trained machine learning model is obtained; otherwise, the training set is updated with the test set, and the machine learning model is iteratively optimized to obtain the trained machine learning model.
进一步地,所述洗钱可疑度得分的在0~1之间,取值越大,说明该交易数据的洗钱可疑度越高,根据洗钱可疑度得分筛选出可疑案例具体为:Further, the money laundering suspicious degree score is between 0 and 1, and the larger the value is, the higher the money laundering suspicious degree of the transaction data is, and the suspicious cases are screened out according to the money laundering suspicious degree score as follows:
设定第一得分阈值F,将洗钱可疑度得分大于F的交易数据作为候选案例;Set the first score threshold F, and use transaction data with a money laundering suspicion score greater than F as a candidate case;
设定第一数量阈值K,若候选案例的数量大于K,则按洗钱可疑度得分自高至低选择K个候选案例作为可疑案例,否则,将所有的候选案例作为可疑案例。A first number threshold K is set. If the number of candidate cases is greater than K, K candidate cases are selected as suspicious cases according to the money laundering suspicion score from high to low, otherwise, all candidate cases are regarded as suspicious cases.
进一步地,对可疑案例进行甄别具体为:Further, the screening of suspicious cases is as follows:
确定方法甄别人员的经验度,将甄别人员划分为M个不同的等级,确定各个等级的甄别人员的工作量;Determine the experience degree of the method screening personnel, divide the screening personnel into M different levels, and determine the workload of the screening personnel at each level;
将可疑案例按照洗钱可疑度得分的高低排序,按照各个等级的甄别人员的工作量将可疑案例划分为M个不同的案例集,分别由对应等级的甄别人员进行甄别。The suspicious cases are sorted according to the level of money laundering suspiciousness score, and the suspicious cases are divided into M different case sets according to the workload of the screening personnel at each level, and the screening personnel of the corresponding level are respectively screened.
进一步地,构建数据集具体为:Further, the construction of the data set is as follows:
获取被报送的交易数据和未被报送的交易数据,采用最邻近重采样法对所述被报送的交易数据进行重采样,并加上随机扰动,得到数据集。The reported transaction data and the unreported transaction data are acquired, and the reported transaction data is resampled by the nearest neighbor resampling method, and random disturbance is added to obtain a data set.
被报送的交易数据是指:被报送给人民银行作为参考的、存在洗钱风险的可疑交易数据,目的是提醒人民银行这些客户存在较大的洗钱风险,需要被重点监测。因此,被报送的交易数据很大程度上表征标签为洗钱的交易数据。The reported transaction data refers to: suspicious transaction data with money laundering risks that are reported to the People's Bank of China for reference. The purpose is to remind the People's Bank of these customers that there is a greater money laundering risk and need to be monitored. Therefore, the reported transaction data largely characterizes transaction data tagged as money laundering.
由于被报送的交易数据远远小于未被报送的交易数据,因此数据集类别不平衡,故而,采用最邻近重采样法对被报送客户的交易数据进行重采样,在每个被报送类客户交易数据的最邻近样本基础上加上随机扰动,生成新的被报送类交易数据,使被报送类交易数据总数与未被报送类交易数据总数达到数量上的均衡,避免出现过拟合。Since the reported transaction data is far smaller than the unreported transaction data, the data set categories are unbalanced. Therefore, the nearest neighbor resampling method is used to resample the reported customer's transaction data. On the basis of the nearest sample of the customer transaction data of the sending type, random disturbance is added to generate new reported transaction data, so that the total number of reported transaction data and the total number of unreported transaction data reach a quantitative balance, avoiding Overfitting occurs.
进一步地,所述数据标准化处理具体为:Further, the data standardization processing is specifically:
对可疑特征中的离散型特征对应的数据,采用One-Hot重新编码;对可疑特征中的连续型特征对应的数据,用L2-范数标准化方法将数据按比例缩放,去除量纲。For the data corresponding to the discrete features in the suspicious features, use One-Hot to re-encode; for the data corresponding to the continuous features in the suspicious features, use the L2-norm normalization method to scale the data to remove the dimension.
进一步地,对可疑特征进行降维处理具体为:利用主成分分析法PCA将可疑特征进行降维。Further, the dimensionality reduction processing for suspicious features is specifically: using principal component analysis method PCA to reduce the dimensionality of suspicious features.
一种基于XGBoost算法的洗钱交易识别系统,包括:A money laundering transaction identification system based on the XGBoost algorithm, including:
洗钱交易识别模块,被配置为:获取客户的交易数据,将客户交易数据送入训练好的机器学习模型中,机器学习模型给出洗钱可疑度得分,根据洗钱可疑度得分筛选出可疑案例,对可疑案例进行甄别;The money laundering transaction identification module is configured to: obtain the customer transaction data, send the customer transaction data into the trained machine learning model, the machine learning model gives a money laundering suspicious degree score, and screen out suspicious cases according to the money laundering suspicious degree score. Identify suspicious cases;
模型训练模块,被配置为:构建数据集,所述数据集包括多组交易数据,所述交易数据的标签为洗钱和非洗钱,所述交易数据包括多个维度的特征;a model training module, configured to: construct a data set, the data set includes multiple sets of transaction data, the transaction data is labeled as money laundering and non-money laundering, and the transaction data includes features of multiple dimensions;
提取与洗钱操作存在关联的可疑特征,对可疑特征所对应的值进行数据清洗、空缺值填充以及标准化处理,对可疑特征进行降维处理;Extract suspicious features associated with money laundering operations, perform data cleaning, fill in vacancies, and standardize the values corresponding to suspicious features, and perform dimensionality reduction processing on suspicious features;
将数据集划分为训练集和测试集,采用XGBoost算法构建机器学习模型,并使用训练集对机器学习模型进行训练,使用测试集测试机器学习模型的预测准确率,若预测准确率大于预设精度阈值,则得到训练好的机器学习模型,否则,利用测试集更新训练集,对机器学习模型进行迭代优化,得到训练好的机器学习模型。Divide the data set into training set and test set, use the XGBoost algorithm to build a machine learning model, use the training set to train the machine learning model, and use the test set to test the prediction accuracy of the machine learning model. If the prediction accuracy is greater than the preset accuracy If the threshold is set, the trained machine learning model is obtained; otherwise, the training set is updated with the test set, and the machine learning model is iteratively optimized to obtain the trained machine learning model.
进一步地,构建数据集具体为:Further, the construction of the data set is as follows:
获取被报送的交易数据和未被报送的交易数据,采用最邻近重采样法对所述被报送的交易数据进行重采样,并加上随机扰动,得到数据集。The reported transaction data and the unreported transaction data are acquired, and the reported transaction data is resampled by the nearest neighbor resampling method, and random disturbance is added to obtain a data set.
进一步地,所述数据标准化处理具体为:Further, the data standardization processing is specifically:
对可疑特征中的离散型特征对应的数据,采用One-Hot重新编码;对可疑特征中的连续型特征对应的数据,用L2-范数标准化方法将数据按比例缩放,去除量纲。For the data corresponding to the discrete features in the suspicious features, use One-Hot to re-encode; for the data corresponding to the continuous features in the suspicious features, use the L2-norm normalization method to scale the data to remove the dimension.
进一步地,对可疑特征进行降维处理具体为:利用主成分分析法PCA将可疑特征进行降维。Further, the dimensionality reduction processing for suspicious features is specifically: using principal component analysis method PCA to reduce the dimensionality of suspicious features.
与现有技术相比,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
(1)提供了一种评估客户洗钱风险的方案,将反洗钱与机器学习中的XGBoost算法结合起来,根据客户的交易数据能够给出客户的洗钱可疑度得分,给甄别人员提供直观的参考依据,节省了人力,提供了效率。(1) Provides a solution for assessing the risk of money laundering of customers, which combines anti-money laundering with the XGBoost algorithm in machine learning. According to the customer's transaction data, the customer's money laundering suspiciousness score can be given, providing an intuitive reference for the screening personnel. , saving manpower and providing efficiency.
(2)相较于规则化引擎的固定标准,基于XGBoost算法人工智能算法对客户洗钱可疑度的评估是全方位考量的,更加灵活多变,可以减少遗漏案例的情况出现。(2) Compared with the fixed standard of the rule-based engine, the XGBoost algorithm-based artificial intelligence algorithm evaluates the suspicious degree of customer money laundering in an all-round way, which is more flexible and can reduce the occurrence of missing cases.
(3)根据洗钱可疑度得分筛选出可疑案例,对可疑案例进行分级,不同等级的可疑案例交由不同的甄别人员处理,从而提高了甄别效率,节约了人力成本。(3) Screen out suspicious cases according to the money laundering suspicion score, classify suspicious cases, and hand over suspicious cases of different grades to different screening personnel, thereby improving screening efficiency and saving labor costs.
附图说明Description of drawings
图1为本发明的结构示意图。FIG. 1 is a schematic structural diagram of the present invention.
具体实施方式Detailed ways
下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. This embodiment is implemented on the premise of the technical solution of the present invention, and provides a detailed implementation manner and a specific operation process, but the protection scope of the present invention is not limited to the following embodiments.
在附图中,结构相同的部件以相同数字标号表示,各处结构或功能相似的组件以相似数字标号表示。附图所示的每一组件的尺寸和厚度是任意示出的,本发明并没有限定每个组件的尺寸和厚度。为了使图示更清晰,附图中有些地方适当夸大了部件。In the drawings, structurally identical components are denoted by the same numerals, and structurally or functionally similar components are denoted by like numerals throughout. The size and thickness of each component shown in the drawings are arbitrarily shown, and the present invention does not limit the size and thickness of each component. Parts in the drawings have been appropriately exaggerated in some places for clarity of illustration.
实施例1:Example 1:
一种基于XGBoost算法的洗钱交易识别方法,如图1所示,包括:A method for identifying money laundering transactions based on the XGBoost algorithm, as shown in Figure 1, includes:
获取客户的交易数据,将客户交易数据送入训练好的机器学习模型中,机器学习模型给出洗钱可疑度得分,根据洗钱可疑度得分筛选出可疑案例,对可疑案例进行甄别。Obtain the customer's transaction data, send the customer's transaction data into the trained machine learning model, and the machine learning model will give a money laundering suspicious degree score, screen out suspicious cases according to the money laundering suspicious degree score, and screen the suspicious cases.
机器学习就是对一部分数据进行学习,再利用学习到的规律对另外一部分数据进行预测与判断。这个过程和人的学习过程有些类似,比如人获取一定的经验,可以对新问题进行结果的预测。通过机器学习模型进行反洗钱风险判定无疑可以降低人力成本,而且相较于固定的规则引擎,机器学习模型更加灵活,能够尽可能减少漏报。Machine learning is to learn part of the data, and then use the learned rules to predict and judge another part of the data. This process is somewhat similar to the human learning process. For example, people gain certain experience and can predict the results of new problems. Anti-money laundering risk determination through machine learning models can undoubtedly reduce labor costs. Compared with fixed rule engines, machine learning models are more flexible and can reduce false negatives as much as possible.
机器学习模型的建立和训练过程如下:The establishment and training process of the machine learning model is as follows:
(1)构建数据集,数据集包括多组交易数据,交易数据的标签为洗钱和非洗钱,交易数据包括多个维度的特征;(1) Constructing a data set, the data set includes multiple sets of transaction data, the tags of transaction data are money laundering and non-money laundering, and the transaction data includes features of multiple dimensions;
构建数据集具体为:The construction of the dataset is as follows:
获取被报送的交易数据和未被报送的交易数据,采用最邻近重采样法对被报送的交易数据进行重采样,并加上随机扰动,得到数据集。Obtain the reported transaction data and the unreported transaction data, use the nearest neighbor resampling method to resample the reported transaction data, and add random disturbance to obtain a data set.
在实际应用时,可以选择预测月份的前几个月的报送样本和不报送样本的交易数据,构建数据集。其中,被报送的交易数据是指:被报送给人民银行作为参考的、存在洗钱风险的可疑交易数据,目的是提醒人民银行这些客户存在较大的洗钱风险,需要被重点监测。因此,被报送的交易数据很大程度上表征标签为洗钱的交易数据,故数据集中的交易数据有金额、笔数、时间、地点等特征种类,根据是否为被报送数据设置标签,被报送则标签为1,表示该行交易数据对应洗钱行为,不被报送则标签为0,表示该行交易数据不对应洗钱行为。In practical application, you can select the submitted samples of the first few months of the forecast month and the transaction data that do not submit samples to construct a data set. Among them, the reported transaction data refers to the suspicious transaction data that is reported to the People's Bank of China as a reference and has money laundering risks. Therefore, the reported transaction data largely represents the transaction data tagged as money laundering. Therefore, the transaction data in the data set has features such as amount, number of transactions, time, location, etc. If it is submitted, the label is 1, indicating that the transaction data of the bank corresponds to money laundering. If it is not submitted, the label is 0, indicating that the transaction data of the bank does not correspond to money laundering.
重采样方法是一种统计推断的非参数方法,它从原始数据样本中反复抽取样本,加入随机扰动生成新的样本,达到增加样本数量的目的。过拟合是指由于学习过度和样本特征不均衡,模型在训练集上表现很好,但在测试集上表现很差。The resampling method is a non-parametric method of statistical inference, which repeatedly extracts samples from the original data samples, adds random disturbances to generate new samples, and achieves the purpose of increasing the number of samples. Overfitting is when the model performs well on the training set but poorly on the test set due to over-learning and imbalanced sample features.
由于被报送的交易数据远远小于未被报送的交易数据,因此数据集类别不平衡,故而,采用最邻近重采样法对被报送客户的交易数据进行重采样,在每个被报送类客户交易数据的最邻近样本基础上加上随机扰动,生成新的被报送类交易数据,使被报送类交易数据总数与未被报送类交易数据总数达到数量上的均衡,避免出现过拟合。Since the reported transaction data is far smaller than the unreported transaction data, the data set categories are unbalanced. Therefore, the nearest neighbor resampling method is used to resample the reported customer's transaction data. On the basis of the nearest sample of the customer transaction data of the sending type, random disturbance is added to generate new reported transaction data, so that the total number of reported transaction data and the total number of unreported transaction data reach a quantitative balance, avoiding Overfitting occurs.
(2)提取与洗钱操作存在关联的可疑特征,对可疑特征所对应的值进行数据清洗、空缺值填充以及标准化处理,对可疑特征进行降维处理;(2) Extracting suspicious features associated with money laundering operations, performing data cleaning, filling in vacancies and standardizing the values corresponding to the suspicious features, and performing dimensionality reduction processing on the suspicious features;
机器学习的核心是“使用算法解析数据,从中学习,然后对新数据做出决定或预测”,那么精准地提取可疑客户的特征就非常重要,若选取的特征过少,机器学习得到的模型可能达不到很好的预测效果,但如果选取的无关特征过多,也会对结果产生不利的影响。如何选取到最优的特征组合,需要技术人员进行多次排列组合的试验,也需要专家筛选。在机器学习模型的训练过程中,可以不断精简可疑特征,尝试不同的特征选择,通过模型的预测准确率验证特征是否有效。The core of machine learning is "using algorithms to analyze data, learn from it, and then make decisions or predictions on new data", so it is very important to accurately extract the characteristics of suspicious customers. If the selected features are too few, the model obtained by machine learning may be It does not achieve a good prediction effect, but if too many irrelevant features are selected, it will also adversely affect the results. How to select the optimal feature combination requires technicians to perform multiple permutation and combination experiments, and also requires expert screening. During the training process of the machine learning model, suspicious features can be continuously simplified, different feature selections can be tried, and the validity of the features can be verified by the prediction accuracy of the model.
(2-1)数据标准化处理具体为:(2-1) Data standardization processing is as follows:
对可疑特征中的离散型特征对应的数据,采用One-Hot重新编码;对可疑特征中的连续型特征对应的数据,用L2-范数标准化方法将数据按比例缩放,去除量纲。For the data corresponding to the discrete features in the suspicious features, use One-Hot to re-encode; for the data corresponding to the continuous features in the suspicious features, use the L2-norm normalization method to scale the data to remove the dimension.
其中,One-Hot独热编码也称一位有效编码,其方法是使用N位寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。通过这个方法,可以将离散型的类别变量转换为机器学习算法易于利用的形式。L2-范数标准化方法作用范围是每一行数据,使每个样本缩放到一个单位范数。计算方式是对每个样本计算其L2-范数,然后对该样本中的每个元素除以该范数,这样处理后每个样本的L2-范数都等于1。Among them, One-Hot one-hot encoding is also called one-bit efficient encoding. The method is to use N-bit registers to encode N states, each state has its own register bit, and at any time, only one bit of efficient. In this way, discrete categorical variables can be transformed into a form that can be easily exploited by machine learning algorithms. The L2-norm normalization method works on each row of data, scaling each sample to a unit norm. It is calculated by calculating its L2-norm for each sample, and then dividing each element in the sample by the norm, so that the L2-norm of each sample after processing is equal to 1.
(2-2)对可疑特征进行降维处理具体为:利用主成分分析法PCA将可疑特征进行降维,把几百个特征转化为数个综合特征。(2-2) The dimensionality reduction of suspicious features is specifically: using principal component analysis method PCA to reduce the dimensionality of suspicious features, and convert hundreds of features into several comprehensive features.
其中,主成分分析法也称主分量分析,是一种利用降维思想简化数据集的技术,能够把多指标转化为少数几个综合指标。它是一个线性变换,把数据变换到一个新的坐标系统中,使得任何数据投影的第一大方差在第一个坐标(第一主成分)上,第二大方差在第二个坐标(第二主成分)上,依次类推,最后选取总贡献度大于一定阈值的几个主成分,作为新的特征维度。主成分分析法能够在减少数据特征维度的同时,保留对方差贡献度最大的特征信息。Among them, principal component analysis, also known as principal component analysis, is a technology that uses the idea of dimensionality reduction to simplify data sets, and can convert multiple indicators into a few comprehensive indicators. It is a linear transformation that transforms the data into a new coordinate system such that the first largest variance of any data projection is in the first coordinate (the first principal component), and the second largest variance is in the second coordinate (the first principal component). Two principal components), and so on, and finally select several principal components whose total contribution is greater than a certain threshold as the new feature dimension. The principal component analysis method can reduce the dimension of data features while retaining the feature information that contributes the most to the variance.
(3)将数据集划分为训练集和测试集,采用XGBoost算法构建机器学习模型,并使用训练集对机器学习模型进行训练,使用测试集测试机器学习模型的预测准确率,若预测准确率大于预设精度阈值,则得到训练好的机器学习模型,否则,利用测试集更新训练集,对机器学习模型进行迭代优化,调整参数大小,得到训练好的机器学习模型。(3) Divide the data set into a training set and a test set, use the XGBoost algorithm to build a machine learning model, use the training set to train the machine learning model, and use the test set to test the prediction accuracy of the machine learning model. If the prediction accuracy is greater than If the accuracy threshold is preset, the trained machine learning model is obtained; otherwise, the training set is updated with the test set, the machine learning model is iteratively optimized, and the parameter size is adjusted to obtain the trained machine learning model.
XGBoost极端梯度提升法(eXtreme Gradient Boosting),是梯度提升树方法(GBDT,Gradient Boosting Decision Tree)的改进,属于集成学习方法中的Boosting流派。与GBDT相比,步骤大致都相同,主要差异在于目标函数定义的不同,XGBoost在目标函数中增加了正则化项,用于惩罚复杂模型,防止过拟合。这个算法的本质就是建立K个决策树,使得数群的预测值尽量接近真实值,且有尽量大的泛化能力。XGBoost extreme gradient boosting method (eXtreme Gradient Boosting) is an improvement of the gradient boosting tree method (GBDT, Gradient Boosting Decision Tree), which belongs to the Boosting genre in the ensemble learning method. Compared with GBDT, the steps are roughly the same, and the main difference lies in the definition of the objective function. XGBoost adds a regularization term to the objective function to punish complex models and prevent overfitting. The essence of this algorithm is to establish K decision trees, so that the predicted value of the number group is as close to the real value as possible, and it has as much generalization ability as possible.
用XGBoost算法解析训练样本,生成模型,这个模型就是机器学习学习到的什么样的特征值会导致被报送这一结果的规律。将需要预测的月份的交易数据代入模型,即可得到每行数据对应的洗钱可疑度得分。甄别人员可以将洗钱可疑度得分作为参考,从而合理优化人员配置,将可疑度较高的客户案例直接交给经验丰富的甄别人员判定,将可疑度非常小的客户案例交给经验不足的甄别人员判定。Use the XGBoost algorithm to parse the training samples and generate a model. This model is the law of what kind of eigenvalues learned by machine learning will lead to the results being reported. Substitute the transaction data of the month to be predicted into the model, and then the money laundering suspicion score corresponding to each row of data can be obtained. Screeners can use the money laundering suspicion score as a reference, so as to rationally optimize staff allocation, directly hand over customer cases with a high degree of suspicion to experienced screeners for judgment, and hand over customer cases with very low suspicion to inexperienced screeners. determination.
本实施例中洗钱可疑度得分的在0~1之间,越接近0表示越不可疑,越接近1表示越可疑。当然,也可以使用百分制、疑似级别等形式表示洗钱可疑度得分,取值越大,说明该交易数据的洗钱可疑度越高,根据洗钱可疑度得分筛选出可疑案例具体为:In this embodiment, the money laundering suspiciousness score is between 0 and 1, the closer to 0, the less suspicious, and the closer to 1, the more suspicious. Of course, the money laundering suspiciousness score can also be expressed in the form of a percentage system and a suspected level.
设定第一得分阈值F,将洗钱可疑度得分大于F的交易数据作为候选案例;设定第一数量阈值K,若候选案例的数量大于K,则按洗钱可疑度得分自高至低选择K个候选案例作为可疑案例,否则,将所有的候选案例作为可疑案例。Set the first score threshold F, and select the transaction data with money laundering suspicion score greater than F as candidate cases; set the first number threshold K, if the number of candidate cases is greater than K, select K according to the money laundering suspicion score from high to low All candidate cases are regarded as suspicious cases, otherwise, all candidate cases are regarded as suspicious cases.
对可疑案例进行甄别具体为:The screening of suspicious cases is as follows:
确定方法甄别人员的经验度,将甄别人员划分为M个不同的等级,确定各个等级的甄别人员的工作量;将可疑案例按照洗钱可疑度得分的高低排序,按照各个等级的甄别人员的工作量将可疑案例划分为M个不同的案例集,分别由对应等级的甄别人员进行甄别,从而提高了甄别效率,节约了人力成本。Determine the experience of the screening personnel, divide the screening personnel into M different levels, and determine the workload of the screening personnel at each level; sort the suspicious cases according to the level of money laundering suspiciousness score, and classify the screening personnel according to the workload of each level. The suspicious cases are divided into M different case sets, which are respectively screened by the screening personnel of the corresponding level, thereby improving the screening efficiency and saving labor costs.
第一得分阈值F的取值决定了候选案例的数量,F越大,得到的候选案例中存在洗钱风险的可以案例的比例就越高,而且,F的取值还应当考虑到机器学习模型给出的洗钱可疑度得分的准确性,可以综合上述因素设定第一得分阈值F的大小。The value of the first score threshold F determines the number of candidate cases. The larger F is, the higher the proportion of possible cases with money laundering risks in the obtained candidate cases. Moreover, the value of F should also take into account the value of the machine learning model. To determine the accuracy of the obtained money laundering suspicion score, the size of the first score threshold F can be set by combining the above factors.
第一数量阈值K应当根据甄别人员的数量、工作能力设定,保证筛选出的可疑案例在甄别人员的工作量范围内,均可以被甄别人员进行处理和判定。The first number threshold K should be set according to the number and work ability of the screening personnel to ensure that the screened suspicious cases can be processed and judged by the screening personnel within the workload scope of the screening personnel.
通过洗钱可疑度得分,可以结合第一得分阈值F和第一数量阈值K,灵活控制可疑案例的数量,显著降低无效案例的数量和比例。With the money laundering suspiciousness score, the first score threshold F and the first number threshold K can be combined to flexibly control the number of suspicious cases and significantly reduce the number and proportion of invalid cases.
(1)甄别人员按照经验程度分为M个等级,第一等级的经验最丰富,可处理的工作量为S1件,第二等级可处理的工作量为S2件,…,第M等级的经验最少,可处理的工作量为SM件,则可以令K=S1+S2+…+SM,选择K个候选案例作为可疑案例,将洗钱可疑度得分最高的前S1件可疑案例分配给第一等级的甄别人员判定,依次分配,最后将洗钱可疑度得分最低的SM件可疑案例分配给第M等级的甄别人员判定。(1) The screening personnel are divided into M grades according to the degree of experience. The first grade has the most experience, and the workload that can be processed is S1 piece, and the workload that can be processed by the second level is S2 pieces, …, the M-th grade With the least experience and the workload that can be processed is SM cases, then K=S1 +S2 +…+SM , K candidate cases are selected as suspicious cases, and the top S1 cases with the highest money laundering suspicion score are selected. Suspicious cases are assigned to the first-level screening personnel for judgment, and they are allocated in turn. Finally, the SM suspicious cases with the lowest money laundering suspicion score are allocated to the M-level screening personnel for judgment.
(2)若可疑案例的数量Num不大于S1+S2+…+SM,则可以将洗钱可疑度得分最高的前(S1/(S1+S2+…+SM))×Num件可疑案例分配给第一等级的甄别人员判定,依次分配,最后将洗钱可疑度得分最低的(SM/(S1+S2+…+SM))×Num件可疑案例分配给第M等级的甄别人员判定。(2) If the number Num of suspicious cases is not greater than S1 +S2 +…+SM , the top ones with the highest money laundering suspicion score (S1 /(S1 +S2 +…+SM ))× Num suspicious cases are assigned to the first-level screening personnel for judgment, and they are assigned in turn. Finally, (SM /(S1 +S2 +…+SM ))×Num suspicious cases with the lowest money laundering suspicion score are assigned to the first level. M-level screening personnel judgment.
本实施例还提供一种基于XGBoost算法的洗钱交易识别系统,包括:This embodiment also provides a money laundering transaction identification system based on the XGBoost algorithm, including:
洗钱交易识别模块,被配置为:获取客户的交易数据,将客户交易数据送入训练好的机器学习模型中,机器学习模型给出洗钱可疑度得分,根据洗钱可疑度得分筛选出可疑案例,对可疑案例进行甄别;The money laundering transaction identification module is configured to: obtain the customer transaction data, send the customer transaction data into the trained machine learning model, the machine learning model gives a money laundering suspicious degree score, and screen out suspicious cases according to the money laundering suspicious degree score. Identify suspicious cases;
模型训练模块,被配置为:构建数据集,数据集包括多组交易数据,交易数据的标签为洗钱和非洗钱,交易数据包括多个维度的特征;The model training module is configured to: construct a data set, the data set includes multiple sets of transaction data, the tags of the transaction data are money laundering and non-money laundering, and the transaction data includes features of multiple dimensions;
提取与洗钱操作存在关联的可疑特征,对可疑特征所对应的值进行数据清洗、空缺值填充以及标准化处理,对可疑特征进行降维处理;Extract suspicious features associated with money laundering operations, perform data cleaning, fill in vacancies, and standardize the values corresponding to suspicious features, and perform dimensionality reduction processing on suspicious features;
将数据集划分为训练集和测试集,采用XGBoost算法构建机器学习模型,并使用训练集对机器学习模型进行训练,使用测试集测试机器学习模型的预测准确率,若预测准确率大于预设精度阈值,则得到训练好的机器学习模型,否则,利用测试集更新训练集,对机器学习模型进行迭代优化,得到训练好的机器学习模型。Divide the data set into training set and test set, use the XGBoost algorithm to build a machine learning model, use the training set to train the machine learning model, and use the test set to test the prediction accuracy of the machine learning model. If the prediction accuracy is greater than the preset accuracy If the threshold is set, the trained machine learning model is obtained; otherwise, the training set is updated with the test set, and the machine learning model is iteratively optimized to obtain the trained machine learning model.
构建数据集具体为:The construction of the dataset is as follows:
获取被报送的交易数据和未被报送的交易数据,采用最邻近重采样法对被报送的交易数据进行重采样,并加上随机扰动,得到数据集。Obtain the reported transaction data and the unreported transaction data, use the nearest neighbor resampling method to resample the reported transaction data, and add random disturbance to obtain a data set.
数据标准化处理具体为:The data standardization process is as follows:
对可疑特征中的离散型特征对应的数据,采用One-Hot重新编码;对可疑特征中的连续型特征对应的数据,用L2-范数标准化方法将数据按比例缩放,去除量纲。For the data corresponding to the discrete features in the suspicious features, use One-Hot to re-encode; for the data corresponding to the continuous features in the suspicious features, use the L2-norm normalization method to scale the data to remove the dimension.
对可疑特征进行降维处理具体为:利用主成分分析法PCA将可疑特征进行降维。The dimensionality reduction of suspicious features is specifically: using principal component analysis method PCA to reduce the dimensionality of suspicious features.
洗钱可疑度得分的在0~1之间,取值越大,说明该交易数据的洗钱可疑度越高,系统还包括可疑案例获取模块,被配置为:设定第一得分阈值F,将洗钱可疑度得分大于F的交易数据作为候选案例;设定第一数量阈值K,若候选案例的数量大于K,则按洗钱可疑度得分自高至低选择K个候选案例作为可疑案例,否则,将所有的候选案例作为可疑案例。The money laundering suspicious degree score is between 0 and 1, and the larger the value, the higher the money laundering suspicious degree of the transaction data. The system also includes a suspicious case acquisition module, which is configured to: set the first score threshold F, and set the money laundering The transaction data with a suspicious degree score greater than F is used as a candidate case; the first number threshold K is set. If the number of candidate cases is greater than K, K candidate cases are selected as suspicious cases according to the money laundering suspicious degree score from high to low, otherwise, the All candidate cases are regarded as suspicious cases.
对可疑案例进行甄别具体为:The screening of suspicious cases is as follows:
确定方法甄别人员的经验度,将甄别人员划分为M个不同的等级,确定各个等级的甄别人员的工作量;将K个可疑案例按照洗钱可疑度得分的高低排序,按照各个等级的甄别人员的工作量将K个可疑案例划分为M个不同的案例集,分别由对应等级的甄别人员进行甄别。Determine the experience of the screening personnel, divide the screening personnel into M different levels, and determine the workload of the screening personnel at each level; sort the K suspicious cases according to the level of money laundering suspiciousness score, according to the screening personnel of each level. The workload divides the K suspicious cases into M different case sets, which are screened by screening personnel of corresponding levels.
以上详细描述了本发明的较佳具体实施例。应当理解,本领域的普通技术人员无需创造性劳动就可以根据本发明的构思做出诸多修改和变化。因此,凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative efforts. Therefore, any technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments on the basis of the prior art according to the concept of the present invention shall fall within the protection scope determined by the claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210472746.7ACN115222506A (en) | 2022-04-29 | 2022-04-29 | XGboost algorithm-based money laundering transaction identification method and system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210472746.7ACN115222506A (en) | 2022-04-29 | 2022-04-29 | XGboost algorithm-based money laundering transaction identification method and system |
| Publication Number | Publication Date |
|---|---|
| CN115222506Atrue CN115222506A (en) | 2022-10-21 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210472746.7APendingCN115222506A (en) | 2022-04-29 | 2022-04-29 | XGboost algorithm-based money laundering transaction identification method and system |
| Country | Link |
|---|---|
| CN (1) | CN115222506A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115994816A (en)* | 2022-10-25 | 2023-04-21 | 上海浦东发展银行股份有限公司 | A method and system for identifying relevant customers of an anti-collection alliance |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112712369A (en)* | 2020-12-15 | 2021-04-27 | 中国建设银行股份有限公司 | Method and device for monitoring suspicious transactions of anti-money laundering |
| CN113095927A (en)* | 2021-02-23 | 2021-07-09 | 广发证券股份有限公司 | Method and device for identifying suspicious transactions of anti-money laundering |
| CN113362163A (en)* | 2021-06-29 | 2021-09-07 | 中国农业银行股份有限公司 | Early warning method and device and server |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112712369A (en)* | 2020-12-15 | 2021-04-27 | 中国建设银行股份有限公司 | Method and device for monitoring suspicious transactions of anti-money laundering |
| CN113095927A (en)* | 2021-02-23 | 2021-07-09 | 广发证券股份有限公司 | Method and device for identifying suspicious transactions of anti-money laundering |
| CN113362163A (en)* | 2021-06-29 | 2021-09-07 | 中国农业银行股份有限公司 | Early warning method and device and server |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115994816A (en)* | 2022-10-25 | 2023-04-21 | 上海浦东发展银行股份有限公司 | A method and system for identifying relevant customers of an anti-collection alliance |
| Publication | Publication Date | Title |
|---|---|---|
| WO2023142424A1 (en) | Power financial service risk control method and system based on gru-lstm neural network | |
| CN112418738B (en) | A Logistic Regression Based Employee Operational Risk Prediction Method | |
| CN108960833A (en) | A kind of abnormal transaction identification method based on isomery finance feature, equipment and storage medium | |
| CN110837963A (en) | Risk control platform construction method based on data, model and strategy | |
| CN118606847B (en) | Bridge technical condition prediction method and system based on machine learning | |
| KR102499182B1 (en) | Loan regular auditing system using artificia intellicence | |
| CN114612239A (en) | Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence | |
| CN118694673A (en) | A life cycle risk management method based on neural network algorithm | |
| CN111738824B (en) | A method, device and system for screening accounting data processing methods | |
| CN113781201B (en) | Risk assessment method and device for electronic financial activity | |
| CN115222506A (en) | XGboost algorithm-based money laundering transaction identification method and system | |
| CN115222505A (en) | A method and system for identifying money laundering transactions based on a multi-layer perceptron algorithm | |
| CN119150239A (en) | Big data processing method and system for realizing mixed data analysis | |
| CN118626281A (en) | A fast proofreading method and system for error-sensitive word detection | |
| CN112686446A (en) | Machine learning interpretability-oriented credit default prediction method and system | |
| Kumar et al. | Tax Management in the Digital Age: A TAB Algorithm-based Approach to Accurate Tax Prediction and Planning | |
| Chen et al. | Predicting a corporate financial crisis using letters to shareholders. | |
| CN114529393A (en) | Abnormal transaction identification method and system by using logistic regression algorithm | |
| CN112069392A (en) | Network-related crime prevention and control method, device, computer equipment and storage medium | |
| CN117422311A (en) | Enterprise risk rating model construction method, system and storage medium | |
| CN118691393A (en) | Method for building retail credit risk prediction model and retail credit Scoresigmam1 model | |
| CN113554278A (en) | Dynamic flexible rule company operation crisis early warning method and system | |
| Saxena et al. | Bankruptcy prediction using machine learning | |
| Zhang | Research on Intelligent Analysis and Processing System of Financial Big Data Based on Machine Learning | |
| Machado et al. | Forecasting Commercial Customers Credit Risk Through Early Warning Signals Data: A Machine Learning based Approach |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |