CN114549164A

Movatterモバイル変換

Info

Publication number: CN114549164A
Application number: CN202011244453.0A
Authority: CN
Inventors: 汪海涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2022-05-27

Abstract

Translated fromChinese

本申请公开了一种欺诈概率分析方法、装置、电子设备及存储介质，属于电子信息技术领域，其中欺诈概率分析方法包括获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。

The present application discloses a fraud probability analysis method, device, electronic equipment and storage medium, which belong to the field of electronic information technology, wherein the fraud probability analysis method includes acquiring user information data and performing cleaning processing to obtain a denoising data set; Set input to the trained machine learning model to get the community to which the user belongs; calculate the probability that the user belongs to a fraudulent user according to the community to which the user belongs. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

Description

Translated fromChinese

欺诈概率分析方法、装置、电子设备及存储介质Fraud probability analysis method, device, electronic device and storage medium

技术领域technical field

本申请涉及电子信息技术领域，具体涉及一种欺诈概率分析方法、装置、电子设备及存储介质。The present application relates to the field of electronic information technology, and in particular to a fraud probability analysis method, device, electronic device and storage medium.

背景技术Background technique

互联网金融信贷产业，近几年得到了飞速发展，伴随着产业发展，欺诈黑色产业链也在不断地渗透到该领域，各种新颖的欺诈模式层出不穷，对线上信贷产业的健康发展蒙上了一层阴影。据统计，每年因欺诈导致的损失在500亿—1000亿，欺诈风险防范显得极其重要。The Internet finance and credit industry has developed rapidly in recent years. With the development of the industry, the black industry chain of fraud is constantly infiltrating this field, and various novel fraud models emerge one after another, which has cast a shadow on the healthy development of the online credit industry. A layer of shadow. According to statistics, the annual loss due to fraud is between 50 billion and 100 billion, and fraud risk prevention is extremely important.

发明内容SUMMARY OF THE INVENTION

本申请实施例的目的是提供一种欺诈概率分析方法、装置、电子设备及存储介质，以至少解决现有金融信贷欺诈的问题。The purpose of the embodiments of the present application is to provide a fraud probability analysis method, apparatus, electronic device and storage medium, so as to at least solve the existing problem of financial credit fraud.

本申请的技术方案如下：The technical solution of this application is as follows:

根据本申请实施例的第一方面，提供一种欺诈概率分析方法，包括：According to a first aspect of the embodiments of the present application, a fraud probability analysis method is provided, including:

获取用户信息数据并进行清洗处理，得到去噪数据集；Obtain user information data and clean it to obtain a denoising data set;

将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；Input the denoising dataset into the trained machine learning model to get the community to which the user belongs;

根据用户所属的社区计算用户属于欺诈用户的概率。Calculate the probability that the user belongs to a fraudulent user based on the community to which the user belongs.

根据本申请实施例的第二方面，提供一种欺诈判断装置，该装置可以包括：According to a second aspect of the embodiments of the present application, a fraud determination device is provided, and the device may include:

获取去噪模块，用于获取用户信息数据并进行清洗处理，得到去噪数据集；Obtain a denoising module, which is used to obtain user information data and perform cleaning processing to obtain a denoising data set;

所属社区分析模块，用于将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；The community analysis module to which the user belongs is used to input the denoising data set into the trained machine learning model to obtain the community to which the user belongs;

欺诈概率计算模块，用于根据用户所属的社区计算用户属于欺诈用户的概率。The fraud probability calculation module is used to calculate the probability that the user belongs to a fraudulent user according to the community to which the user belongs.

根据本申请实施例的第三方面，提供一种电子设备，该电子设备可以包括：According to a third aspect of the embodiments of the present application, an electronic device is provided, and the electronic device may include:

处理器；processor;

用于存储处理器可执行指令的存储器；memory for storing processor-executable instructions;

其中，处理器被配置为执行指令，以实现如第一方面的任一项实施例中所示的欺诈判断方法。Wherein, the processor is configured to execute the instructions to implement the fraud determination method as shown in any one of the embodiments of the first aspect.

根据本申请实施例的第四方面，提供一种存储介质，当存储介质中的指令由信息处理装置或者服务器的处理器执行时，以使信息处理装置或者服务器实现以实现如第一方面的任一项实施例中所示的欺诈判断方法。According to a fourth aspect of the embodiments of the present application, there is provided a storage medium, when an instruction in the storage medium is executed by a processor of an information processing apparatus or a server, so that the information processing apparatus or server can implement any of the first aspect. The fraud determination method shown in one embodiment.

本申请的实施例提供的技术方案至少带来以下有益效果：The technical solutions provided by the embodiments of the present application bring at least the following beneficial effects:

本申请实施例通过获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。In the embodiment of the present application, a denoising data set is obtained by acquiring user information data and performing cleaning processing; inputting the denoising data set into a trained machine learning model to obtain the community to which the user belongs; calculating the user belonging to a fraudulent user according to the community to which the user belongs. probability. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限值本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and do not limit the application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理，并不构成对本申请的不当限定。The accompanying drawings are incorporated into the specification and constitute a part of the specification, illustrate embodiments consistent with the present application, and together with the description, serve to explain the principles of the present application, and do not constitute an improper limitation of the present application.

图1是根据一示例性实施例示出的欺诈概率分析方法流程示意图；1 is a schematic flowchart of a fraud probability analysis method according to an exemplary embodiment;

图2是根据一示例性实施例示出的欺诈概率分析方法具体流程示意图；FIG. 2 is a schematic flowchart of a specific flow of a fraud probability analysis method according to an exemplary embodiment;

图3是根据一示例性实施例示出的特征工程处理过程流程图；FIG. 3 is a flowchart of a feature engineering process according to an exemplary embodiment;

图4是根据一示例性实施例示出的LINE算法相似性定义示意图；FIG. 4 is a schematic diagram showing the similarity definition of the LINE algorithm according to an exemplary embodiment;

图5是根据一示例性实施例示出的迭代过程顶点的最佳归属社区示意图；FIG. 5 is a schematic diagram of the optimal belonging community of a vertex in an iterative process according to an exemplary embodiment;

图6是根据一示例性实施例示出的电子设备结构示意图；6 is a schematic structural diagram of an electronic device according to an exemplary embodiment;

图7是根据一示例性实施例示出的电子设备的硬件结构示意图。FIG. 7 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本申请的技术方案，下面将结合附图，对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as recited in the appended claims.

图1是本申请提供的欺诈概率分析方法的一实施例的流程示意图。如图1 所示，该欺诈概率分析方法，可以包括：FIG. 1 is a schematic flowchart of an embodiment of a fraud probability analysis method provided by the present application. As shown in Figure 1, the fraud probability analysis method can include:

步骤100：获取用户信息数据并进行清洗处理，得到去噪数据集；Step 100: obtaining user information data and performing cleaning processing to obtain a denoising data set;

步骤200：将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；Step 200: Input the denoising data set into the trained machine learning model to obtain the community to which the user belongs;

步骤300：根据用户所属的社区计算用户属于欺诈用户的概率。Step 300: Calculate the probability that the user belongs to a fraudulent user according to the community to which the user belongs.

上述实施例方法通过获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。The method of the above embodiment obtains the denoising data set by acquiring the user information data and performing cleaning processing; inputting the denoising data set into the trained machine learning model to obtain the community to which the user belongs; calculating the user belonging to the fraudulent user according to the community to which the user belongs. probability. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

在本申请的实施例中，获取用户信息数据并进行清洗处理，得到去噪数据集，可以包括：In the embodiment of the present application, acquiring user information data and performing cleaning processing to obtain a denoising data set may include:

获取用户信息数据；Obtain user information data;

删除用户信息数据中错误信息，并对缺失数据进行填补，得到去噪数据集。Delete the wrong information in the user information data, and fill in the missing data to obtain a denoising data set.

在本申请的实施例中，用户信息数据可以包括：用户通信行为数据、APP 使用情况数据和短信网关数据。In the embodiment of the present application, the user information data may include: user communication behavior data, APP usage data, and short message gateway data.

在本申请的实施例中，训练好的机器学习模型是可以通过下述步骤建立的：In the embodiment of the present application, the trained machine learning model can be established through the following steps:

获取模型构建信息数据进行清洗处理，得到模型构建去噪数据集；Obtain the model building information data for cleaning, and obtain a model building denoising data set;

确定模型构建去噪数据集的特征信息，得到带有关系网络的数据集；Determine the feature information of the denoising data set constructed by the model, and obtain a data set with a relational network;

利用带有关系网络的数据集对机器学习模型进行训练和测试，得到训练好的机器学习模型。Use the dataset with relational network to train and test the machine learning model, and get the trained machine learning model.

上述模型建立过程可以包括：噪声处理、特征提取和边权定义；具体场景中模型的训练与部署上线过程主要分为五步如，图2所示：The above model building process may include: noise processing, feature extraction and edge weight definition; the model training and deployment process in specific scenarios is mainly divided into five steps, as shown in Figure 2:

数据采集：采集用户的通信行为、APP使用情况、短信网关等数据，保存到数据库种，这一步主要发生在基站侧。Data collection: Collect the user's communication behavior, APP usage, SMS gateway and other data, and save it to the database. This step mainly occurs on the base station side.

数据清洗：由于采集过程中可能存在数据缺失或错误信息，需要在这一步进行去噪，对数据进行填补或删除缺失数据，并导入数据库。Data cleaning: Since there may be missing data or wrong information in the collection process, it is necessary to denoise in this step, fill in or delete the missing data, and import it into the database.

特征工程：这一部分主要对海量数据进行处理，分析用户之前关系的强弱，得到用户之前的关系网络。Feature engineering: This part mainly processes massive data, analyzes the strength of the user's previous relationship, and obtains the user's previous relationship network.

模型训练：在得到用户的关系网络之后，我们将应用新的社区发现算法，求解具有强关联关系的用户社区，分析社区中用户的欺诈情况，对每个用户得出一个欺诈概率。Model training: After obtaining the user's relationship network, we will apply a new community discovery algorithm to solve the user community with strong correlation, analyze the fraud situation of users in the community, and obtain a fraud probability for each user.

模型部署：在此过程采用了JanusGraph+ElasticSearch+HBase的部署方式，该方式基于HBase存储，可容纳数据量大，又使用了ElasticSearch搜索引擎，搜索速度快，适用于大数据情况下的环境部署。而且JanusGraph图数据库自带服务，可以供外部应用调用，省时省力。Model deployment: In this process, the deployment method of JanusGraph+ElasticSearch+HBase is adopted, which is based on HBase storage, which can accommodate a large amount of data, and uses the ElasticSearch search engine, which has fast search speed and is suitable for environmental deployment in the case of big data. Moreover, the JanusGraph graph database has its own service, which can be called by external applications, saving time and effort.

模型的调用过程：外部金融机构通过JanusGraph数据库提供的API，调用模型。Model calling process: External financial institutions call models through the API provided by the JanusGraph database.

输入：用户的姓名、手机号、身份证三要素；Input: the user's name, mobile phone number, ID card three elements;

输出：用户的欺诈概率。Output: Fraud probability for the user.

其中，特征工程目的是最大限度地从原始数据的通信、短信等数中提取特征以供算法和模型使用。在此研究如何从用户据中构建用户关系网络(网络中两两用户之间的边权重大小代表两个用户之间的关系强弱)。如图3所示。特征工程的处理过程主要分为数据的噪声处理、特征提取与用户关系强弱的定义。接下来详细介绍这三部分。Among them, the purpose of feature engineering is to maximize the extraction of features from the communication, text messages, etc. of the original data for the use of algorithms and models. Here we study how to construct a user relationship network from user data (the edge weights between two users in the network represent the strength of the relationship between the two users). As shown in Figure 3. The processing process of feature engineering is mainly divided into data noise processing, feature extraction and the definition of the relationship between users. These three parts are described in detail next.

(1)噪声处理(1) Noise processing

每时每刻都有成千上万的人在打电话，除了我们日常的工作生活联系之外，还有营销、外卖、快递等等的通话。如果我们两个用户之间的通话次数或者通话时长作为用户之间的关系强度的话，那么明显是不合理的，我们每天收到的营销电话或者快递、外面等并不能表示我们和对方有较强的社会联系，因此我们需要消除这部分通话数据的噪音影响，提取出真正有效的信息。在这里我们基于用户通话行为数据，用网页排名算法(PageRank算法)来重新定义用户之间的直接关系强弱。网页排名算法本是用于网页排名的算法，这里用于噪声处理是一大创新点。Thousands of people are on the phone every moment, in addition to our daily work and life contacts, there are also calls for marketing, takeaway, express delivery, etc. If the number of calls or the duration of calls between our two users is used as the strength of the relationship between users, then it is obviously unreasonable. The marketing calls or express delivery, outside, etc. we receive every day do not mean that we have a strong relationship with the other party. Therefore, we need to eliminate the noise effect of this part of the call data and extract really effective information. Here we use the PageRank algorithm to redefine the strength of the direct relationship between users based on user call behavior data. The webpage ranking algorithm is an algorithm used for webpage ranking, and the noise processing here is a major innovation.

网页排名算法的数量假设与质量假设恰巧与我们对通话重要性的评价原则一致，如果一个人经常给别人打电话，那么这个人很有可能从事营销等行业，其每通电话的权重应该比较低，其个人的PageRank值也比较低；如果一个人经常收到其他PageRank值比较高的人的电话，那么这个人的PageRank值也应该比较高。因此，这里我们采用PageRank算法处理噪声。The quantitative assumptions and quality assumptions of the page ranking algorithm happen to be consistent with our evaluation principle of the importance of calls. If a person makes frequent calls to others, then this person is likely to be engaged in marketing and other industries, and the weight of each call should be relatively low. , its personal PageRank value is also relatively low; if a person often receives calls from other people with relatively high PageRank values, then this person's PageRank value should also be relatively high. Therefore, here we adopt the PageRank algorithm to deal with the noise.

PageRank的计算充分利用了两个假设：数量假设和质量假设。步骤如下：The calculation of PageRank takes full advantage of two assumptions: the quantitative assumption and the qualitative assumption. Proceed as follows:

在初始阶段：网页通过链接关系构建起Web图，每个页面设置相同的 PageRank值，通过若干轮的计算，会得到每个页面所获得的最终PageRank值。随着每一轮的计算进行，网页当前的PageRank值会不断得到更新。In the initial stage: Web pages build a Web graph through link relationships, each page is set with the same PageRank value, and through several rounds of calculation, the final PageRank value obtained by each page will be obtained. With each round of calculation, the current PageRank value of the web page will be continuously updated.

在一轮中更新页面PageRank得分的计算方法：在一轮更新页面PageRank 得分的计算中，每个页面将其当前的PageRank值平均分配到本页面包含的出链上，这样每个链接即获得了相应的权值。而每个页面将所有指向本页面的入链所传入的权值求和，即可得到新的PageRank得分(公式1)。当每个页面都获得了更新后的PageRank值，就完成了一轮PageRank计算。The calculation method of the PageRank score of the updated page in one round: In the calculation of the PageRank score of the updated page in one round, each page evenly distributes its current PageRank value to the outgoing links contained in this page, so that each link is obtained. corresponding weights. And each page sums up the incoming weights of all incoming links pointing to this page to get a new PageRank score (Formula 1). When each page has obtained the updated PageRank value, a round of PageRank calculation is completed.

其中PR(A)代表节点A的PageRank值，L(B)代表节点B的出度，q为阻尼系数。经过这一步，我们得到了每个用户的PageRank值，然后我们定义每通电话的权重为

这样一来，我们就得到了所有通话的权重。Where PR(A) represents the PageRank value of node A, L(B) represents the out-degree of node B, and q is the damping coefficient. After this step, we get the PageRank value of each user, then we define the weight of each call as

This way, we get the weights for all calls.

本步骤的输入为：G(V,E)，输出为G(V,E′)，即对边做了更新。The input of this step is: G(V, E), and the output is G(V, E'), that is, the edge is updated.

(2)特征提取(2) Feature extraction

经过第一步去噪之后，我们得到了所有通话的权重，但是我们并不能用它直接衡量用户之间的关系。这是因为现在网络通信已经十分发达，很多关系比较亲密的用户之间交流采用网络通信的方式，而不是电话或短信，因此我们要采用一些方法来发现这些信息，这里可以使用LINE算法。After the first step of denoising, we get the weight of all calls, but we cannot use it to directly measure the relationship between users. This is because network communication is very developed now, and many users with close relationships use network communication instead of phone calls or text messages. Therefore, we need to use some methods to discover this information, and we can use the LINE algorithm here.

LINE是一种基于邻域相似假设的方法，可以看作是一种使用BFS构造邻域的算法，能够挖掘顶点之间的二阶相似性。此外，LINE还可以应用在带权图中。如图4所示。LINE is a method based on the assumption of neighborhood similarity, which can be seen as an algorithm that uses BFS to construct neighborhoods, capable of mining second-order similarities between vertices. In addition, LINE can also be applied to weighted graphs. As shown in Figure 4.

一阶相似度用于描述图中成对顶点之间的局部相似度，形式化描述为若 v_i,v_j之间存在直连边，则边权为ω_ij，若不存在直连边，则一阶相似度为0。对于每一条无向边(i,j)，定义顶点v_i,v_j之间的联合概率为：The first-order similarity is used to describe the local similarity between paired vertices in the graph. The formal description is that if there is a directly connected edge between v_i and v_j , the edge weight is ω_ij . If there is no directly connected edge, Then the first-order similarity is 0. For each undirected edge (i,j), define the joint probability between vertices v_i , v_j as:

其中：u_i为顶点v_i的低维向量表示。同时定义经验分布：where: u_i is the low-dimensional vector representation of vertex v_i . Also define the empirical distribution:

使用KL散度来衡量两个分布之间的距离，优化目标为最小化：Using KL divergence to measure the distance between two distributions, the optimization objective is to minimize:

一阶相似度不足以描述顶点之间的关系，比如图4中的顶点5与6，尽管不直接相连，但是也很相似，可以用二阶相似度来衡量。对于有向边(i,j)，定义给定顶点v_i的条件下，产生上下文顶点v_j的概率为：The first-order similarity is not enough to describe the relationship between vertices. For example, thevertices 5 and 6 in Figure 4, although not directly connected, are also very similar and can be measured by the second-order similarity. For a directed edge (i, j), the probability of generating a context vertex v_j given a given vertex v_i is:

其中，|V|为上下文顶点的个数。使用KL散度作为目标函数，得到优化目标：where |V| is the number of context vertices. Using the KL divergence as the objective function, the optimization objective is obtained:

通过梯度下降、牛顿等方法优化上述一阶优化目标O₁和二阶优化目标O₂，即可得到所有顶点的低维向量表示u_i。By optimizing the above-mentioned first-order optimization objective O₁ and second-order optimization objective O₂ by methods such as gradient descent and Newton, the low-dimensional vector representation_ui of all vertices can be obtained.

本步骤的输入为G(V,E′)，输出为顶点的低维表示矩阵U＝(u_ij)，矩阵维度为低维向量长度*顶点个数。The input of this step is G(V, E'), the output is the low-dimensional representation matrix U=(u_ij ) of the vertices, and the matrix dimension is the length of the low-dimensional vector * the number of vertices.

(3)边权定义(3) Definition of edge rights

经过上述特征提取过程，我们得到了每个节点的特征向量，然后我们通过公式(7)计算两两节点特征向量之间的相关性作为节点之间边权重。After the above feature extraction process, we obtain the feature vector of each node, and then we calculate the correlation between the feature vectors of two nodes by formula (7) as the edge weight between nodes.

E(i,j)＝Cov(u_i,u_j) (7)E(i,j)=Cov(u_i ,u_j ) (7)

但是这么一来两两节点之间都有相关性，这对于大数据情况下是不允许的，一些无效的边应该被抛弃。因此我们只保留每个节点权重前30的边，从而降低数据规模，提高计算速度。由此，我们得到了用户关系网络。But in this way, there is a correlation between two nodes, which is not allowed in the case of big data, and some invalid edges should be discarded. Therefore, we only keep the top 30 edges of each node weight, thereby reducing the data size and improving the calculation speed. From this, we get the user relationship network.

本步骤的输入为G(V,E′)和顶点的低维表示矩阵U，输出为G(V,E″)，即对图的边权值进行了更新。The input of this step is G(V, E′) and the low-dimensional representation matrix U of the vertex, and the output is G(V, E″), that is, the edge weights of the graph are updated.

在本申请的实施例中，确定模型构建去噪数据集的特征信息，得到带有关系网络的数据集，可以包括：In the embodiment of the present application, determining the feature information of the model-building denoising data set to obtain a data set with a relational network may include:

确定模型构建去噪数据集中每个用户与其他用户的关系强弱；Determine the strength of the relationship between each user and other users in the model-building denoising dataset;

基于每个用户与其他用户的关系强弱，利用邻域相似假设确定正常无缺失数据集中每个用户的关系网络，得到带有关系网络的数据集。Based on the strength of the relationship between each user and other users, the relationship network of each user in the normal non-missing data set is determined by using the neighborhood similarity hypothesis, and the data set with the relationship network is obtained.

在本申请的实施例中，确定模型构建去噪数据集中每个用户与其他用户的关系强弱，可以包括：In the embodiment of the present application, determining the strength of the relationship between each user and other users in the model-building denoising data set may include:

利用网页排名算法的数量假设与质量假设对模型构建去噪数据集中每个用户与其他用户的关系进行排名，得到正常无缺失数据集中每个用户与其他用户的关系强弱。The relationship between each user and other users in the model-built denoising data set is ranked by using the quantitative and qualitative assumptions of the webpage ranking algorithm, and the relationship between each user and other users in the normal non-missing data set is obtained.

在本申请的实施例中，训练好的机器学习模型为：In the embodiment of the present application, the trained machine learning model is:

其中，Σ_in为社区c内的边的权重之和；Σ_kn为所有用户与社区c内节点相连的边的权重之和，因为i属于社区c，其包括社区内节点与节点i的边和社区外节点与节点i的边；

为所有用户与社区c内节点相连的边的权重之和，因为j属于社区c，其包括社区内节点与节点j的边和社区外节点与节点j的边；Σ_tot为代替

和

是社区c内边权重和加社区c与其他社区连边的权重和。Among them, Σ_in is the sum of the weights of the edges in the community c; Σ_kn is the sum of the weights of the edges connecting all users to the nodes in the community c, because i belongs to the community c, which includes the sum of the edges between the nodes in the community and node i The edge between the node outside the community and node i;

is the sum of the weights of the edges connecting all users to the nodes in the community c, because j belongs to the community c, which includes the edge between the node in the community and the node j and the edge between the node outside the community and the node j; Σ_tot is the replacement

and

is the sum of the weights within the community c plus the weights of the edges connecting the community c to other communities.

上述模型，是通过社区发现算法(Louvain算法)变形得到的。The above model is obtained by deforming the community discovery algorithm (Louvain algorithm).

在社区发现算法中不需要求每个社区具体的模块度，只需要比较社区中加入某个节点之后的模块度变化，所以需要求解△Q。将节点i分配到某一社区中，社区的模块度变化为：In the community discovery algorithm, it is not necessary to find the specific modularity of each community, but only to compare the modularity changes after adding a certain node to the community, so it is necessary to solve △Q. Assign node i to a community, and the modularity of the community changes as:

其中，k_i,in为社区内所有节点与节点i连边权重之和(对应新社区的实际内部权重和乘以2，因为k_i,in对于社区内所有的顶点i，每条边其实被计算了两次)； k_i为所有与节点i相连的边的权重之和；Among them, k_i,in is the sum of the edge weights between all nodes in the community and node i (corresponding to the actual internal weight sum of the new community multiplied by 2, because k_i,in is for all vertices i in the community, each edge is actually Calculated twice);_ki is the sum of the weights of all edges connected to node i;

该公式把公共系数

提出来，实现时只需求

即可。The formula puts the common coefficient

proposed, only needs when implemented

That's it.

具体计算过程如下：The specific calculation process is as follows:

步骤1：构造图G(V,E)Step 1: Construct the graph G(V,E)

这里是对前述步骤得到的图G(V,E″)分布式存储，重新记为G(V,E)。Here is the distributed storage of the graph G(V, E″) obtained in the previous steps, and re-denoted as G(V, E).

步骤2：计算每个节点所属的最佳社区Step 2: Calculate the best community to which each node belongs

这里详细说明调用(9)的过程The process of calling (9) is described in detail here

对于每一个节点i，获得其当前所属社区以及与节点i相连的所有节点所属的社区，记为集合S，然后计算：For each node i, obtain the community to which it currently belongs and the communities to which all nodes connected to node i belong, denoted as set S, and then calculate:

c′_i＝argmax△Q(c_i) (10)c′_i =argmaxΔQ(_ci ) (10)

其中△Q(c_i)表示将节点i划分到社区c_i时模块度的改变量，根据公式(9) 进行计算。这里求使得改变量最大的c_i，记为c′_i。where ΔQ(ci ) represents the change in modularity when the node_{i is divided into the community c i}_, and is calculated according to formula (9). Here, find the c_i that makes the largest change, denoted as c′_i .

步骤3：更新节点所属的社区Step 3: Update the community to which the node belongs

即对每个顶点v_i，令其所属社区c_i为新计算得到的最佳社区c′_i。That is, for each vertex v_i , let the community c_i to which it belongs be the best community c′_i obtained from the new calculation.

步骤4：迭代2-3步p次Step 4: Iterate 2-3 steps p times

这么做的原因是直接迭代虽然收敛慢，但是计算速度较快。The reason for this is that although the direct iteration has slow convergence, the calculation speed is faster.

步骤5：再次计算每个节点所属的最佳社区Step 5: Calculate again the best community to which each node belongs

类似于第2步，按照公式(10)计算每个节点i的最佳社区。Similar to step 2, calculate the optimal community for each node i according to formula (10).

步骤6：更新节点所属的社区Step 6: Update the community to which the node belongs

对每个顶点v_i，其与顶点v_j相连且v_j∈c′_i；如果满足以下条件中的一个，则令其所属社区c_i为新计算得到的最佳社区c′_i，否则不更新社区信息。For each vertex v_i , it is connected to the vertex v_j and v_j ∈ c′_i ; if one of the following conditions is satisfied, let the community c_i to which it belongs be the new best community c′_i , otherwise not. Update community information.

当前迭代轮次为奇数轮，且i<j；The current iteration round is an odd round, and i<j;

当前迭代轮次为偶数论，且i>j；The current iteration round is even number theory, and i>j;

例如图5所示，假设初始时顶点1属于社区A，顶点2属于社区B，在第一轮迭代时，计算得到顶点1的最佳社区为B，那么这里的v_j就是顶点2；由于1<2，满足条件，所以顶点1的社区更新为B。以上策略能够有效的避免社区无法合并的问题，如果不加该策略，第一次迭代后，1会属于社区A，2属于社区B；然后不断交换，无法合并。还有其他特殊情况也有类似的作用。For example, as shown in Figure 5, assuming thatvertex 1 belongs to community A andvertex 2 belongs to community B at the beginning, in the first round of iteration, the best community ofvertex 1 is calculated as B, then v_j here isvertex 2; since 1 <2, the condition is satisfied, so the community ofvertex 1 is updated to B. The above strategy can effectively avoid the problem that the community cannot be merged. If this strategy is not added, after the first iteration, 1 will belong to community A, and 2 will belong to community B; and then continue to exchange and cannot be merged. There are other special cases that have a similar effect.

步骤7：计算5-6步改变所属社区的节点数，如果节点数大于0，那么重复5-6步，否则进入第8步Step 7: Calculate the number of nodes in the community that changes in steps 5-6. If the number of nodes is greater than 0, repeat steps 5-6, otherwise go tostep 8

步骤8：按照节点所属的社区信息重新构造图G′(V′,E′)图G′(V′,E′)中的每个顶点对应原图G(V,E)中的一个社区，将原图G(V,E)中的社区id作为图G′(V′,E′) 中的顶点id；图G′(V′,E′)中的顶点内部权重为原图G(V,E)中社区内所有边的权重和；图G′(V′,E′)中的边为原图G(V,E)中社区之间的边。Step 8: Reconstruct the graph G'(V', E') according to the community information to which the node belongs. Each vertex in the graph G'(V', E') corresponds to a community in the original graph G(V, E), Take the community id in the original graph G(V,E) as the vertex id in the graph G'(V',E'); the internal weight of the vertices in the graph G'(V',E') is the original graph G(V , E) is the sum of the weights of all edges in the community; the edges in the graph G'(V', E') are the edges between the communities in the original graph G(V, E).

重复第2-8步，直到第3步中没有节点更新社区信息。Repeat steps 2-8 until no nodes update community information instep 3.

经过以上步骤，我们计算得到了每个节点所属的社区。我们通过公式(11) 计算一个新用户属于欺诈用户的概率。After the above steps, we calculate the community to which each node belongs. We calculate the probability that a new user is a fraudulent user by formula (11).

其中v_i是新的用户，c_i是新用户所属社区，即社区内标记为欺诈用户的数目除以总用户数。where_vi is the new user, and_ci is the community to which the new user belongs, that is, the number of fraudulent users in the community divided by the total number of users.

上述实施例方法步骤中的特征工程部分融入了PageRank算法的思想，在计算节点权重的基础上，进一步计算边的权重，有效的降低了数据噪声，相比现有方法提高了结果准确度，并且，该方法有别于普通的Louvain算法，该算法收敛速度更快，且针对特殊情况下不收敛的问题做了特殊优化(第6步)，克服了社区无法合并的问题，从而使得结果社区准确度更高。The feature engineering part in the method steps of the above-mentioned embodiment incorporates the idea of the PageRank algorithm. On the basis of calculating the node weight, the edge weight is further calculated, which effectively reduces the data noise, improves the result accuracy compared with the existing method, and , this method is different from the ordinary Louvain algorithm, the algorithm converges faster, and special optimization (step 6) is made for the problem of non-convergence in special cases, which overcomes the problem that the community cannot be merged, so that the result community is accurate. higher degree.

基于同一发明构思，本申请实施例还提供了一种欺诈概率分析装置，包括：Based on the same inventive concept, an embodiment of the present application also provides a fraud probability analysis device, including:

上述实施例装置通过获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。The device of the above embodiment obtains the denoising data set by acquiring user information data and performing cleaning processing; inputting the denoising data set into the trained machine learning model to obtain the community to which the user belongs; calculating the user belonging to the fraudulent user according to the community to which the user belongs. probability. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

基于同一发明构思，本申请实施例还提供了一种电子设备，具体结合图6 进行详细说明。Based on the same inventive concept, an embodiment of the present application further provides an electronic device, which will be described in detail with reference to FIG. 6 .

可选的，如图6所示，本申请实施例还提供一种电子设备600，包括处理器601，存储器602，存储在存储器602上并可在处理器601上运行的程序或指令，该程序或指令被处理器601执行时实现上述欺诈概率分析方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Optionally, as shown in FIG. 6 , an embodiment of the present application further provides anelectronic device 600, including aprocessor 601, amemory 602, a program or instruction stored in thememory 602 and executable on theprocessor 601, the program Or, when the instruction is executed by theprocessor 601, each process of the above-mentioned embodiment of the fraud probability analysis method can be realized, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.

需要说明的是，本申请实施例中的电子设备包括上述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

上述实施例电子设备通过获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。The electronic device in the above embodiment obtains the denoising data set by acquiring the user information data and performing cleaning processing; inputting the denoising data set into the trained machine learning model to obtain the community to which the user belongs; according to the community to which the user belongs, it is calculated that the user belongs to a fraudulent user The probability. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

图7为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

该电子设备700包括但不限于：射频单元701、网络模块702、音频输出单元703、输入单元704、传感器705、显示单元706、用户输入单元707、接口单元708、存储器709、以及处理器710等部件。Theelectronic device 700 includes but is not limited to: a radio frequency unit 701, anetwork module 702, an audio output unit 703, aninput unit 704, asensor 705, adisplay unit 706, auser input unit 707, aninterface unit 708, amemory 709, and aprocessor 710, etc. part.

本领域技术人员可以理解，电子设备700还可以包括给各个部件供电的电源(比如电池)，电源可以通过电源管理系统与处理器710逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图7中示出的电子设备结构并不构成对电子设备的限定，电子设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置，在此不再赘述。Those skilled in the art can understand that theelectronic device 700 may also include a power source (such as a battery) for supplying power to various components, and the power source may be logically connected to theprocessor 710 through a power management system, so as to manage charging, discharging, and power management through the power management system. consumption management and other functions. The structure of the electronic device shown in FIG. 7 does not constitute a limitation on the electronic device. The electronic device may include more or less components than the one shown, or combine some components, or arrange different components, which will not be repeated here. .

应理解的是，本申请实施例中，输入单元704可以包括图形处理器 (GraphicsProcessing Unit，GPU)7041和麦克风7042，图形处理器7041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元706可包括显示面板7061，可以采用液晶显示器、有机发光二极管等形式来配置显示面板7061。用户输入单元707 包括触控面板7071以及其他输入设备7072。触控面板7071，也称为触摸屏。触控面板7071可包括触摸检测装置和触摸控制器两个部分。其他输入设备 7072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆，在此不再赘述。存储器709可用于存储软件程序以及各种数据，包括但不限于应用程序和操作系统。处理器710可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器710中。It should be understood that, in this embodiment of the present application, theinput unit 704 may include a graphics processor (Graphics Processing Unit, GPU) 7041 and amicrophone 7042. camera) to process the image data of still pictures or videos. Thedisplay unit 706 may include adisplay panel 7061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. Theuser input unit 707 includes atouch panel 7071 andother input devices 7072 . Thetouch panel 7071 is also called a touch screen. Thetouch panel 7071 may include two parts, a touch detection device and a touch controller.Other input devices 7072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here.Memory 709 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems. Theprocessor 710 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and application programs, and the like, and the modem processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into theprocessor 710.

上述实施例电子设备硬件通过获取用户信息数据并进行清洗处理，得到去噪数据集；将去噪数据集输入训练好的机器学习模型，得到用户所属的社区；根据用户所属的社区计算用户属于欺诈用户的概率。该方法判断速度快、可用性高，能够快速准确的判断是否存在欺诈风险。The electronic device hardware of the above embodiment obtains the denoising data set by acquiring the user information data and performing cleaning processing; inputting the denoising data set into the trained machine learning model to obtain the community to which the user belongs; according to the community to which the user belongs, calculates that the user belongs to fraud probability of users. The method has fast judgment speed and high availability, and can quickly and accurately judge whether there is a fraud risk.

本申请实施例还提供了一种计算机存储介质，计算机存储介质中存储有计算机可执行指令，计算机可执行指令用于实现本申请实施例所记载的欺诈概率分析方法。The embodiments of the present application further provide a computer storage medium, where computer-executable instructions are stored in the computer storage medium, and the computer-executable instructions are used to implement the fraud probability analysis method described in the embodiments of the present application.

在一些可能的实施方式中，本申请提供的方法的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当程序产品在计算机设备上运行时，程序代码用于使计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的方法中的步骤，例如，计算机设备可以执行本申请实施例所记载的欺诈概率分析方法。In some possible implementations, various aspects of the methods provided by the present application can also be implemented in the form of a program product, which includes program code, and when the program product runs on a computer device, the program code is used to cause the computer device to execute For the steps in the method according to various exemplary embodiments of the present application described above in this specification, for example, a computer device may execute the fraud probability analysis method described in the embodiments of the present application.

程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于：电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

本申请是参照根据本申请的方法、设备和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程信息处理设备的处理器以产生一个机器，使得通过计算机或其他可编程信息处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable information processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable information processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程信息处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable information processing device to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程信息处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable information processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. A fraud probability analysis method, comprising:

acquiring user information data and cleaning the user information data to obtain a denoising data set;

inputting the denoising data set into a trained machine learning model to obtain a community to which the user belongs;

and calculating the probability that the user belongs to a fraudulent user according to the community to which the user belongs.

2. The method of claim 1, wherein the obtaining user information data and performing cleaning processing to obtain a de-noising data set comprises:

acquiring user information data;

and deleting the error information in the user information data, and filling the missing data to obtain a denoising data set.

3. The method of claim 1, wherein the user information data comprises: user communication behavior data, APP use condition data and short message gateway data.

4. The method of any of claims 1-3, wherein the trained machine learning model is built by:

obtaining model construction information data, and cleaning the model construction information data to obtain a model construction denoising data set;

determining characteristic information of the model construction denoising data set to obtain a data set with a relational network;

and training and testing the machine learning model by using the data set with the relational network to obtain the trained machine learning model.

5. The method of claim 4, wherein the determining the feature information of the model-constructed denoised dataset to obtain the dataset with the relationship network comprises:

determining the strength of the relation between each user and other users in the model construction denoising data set;

and determining the relationship network of each user in the normal non-missing data set by using a neighborhood similarity hypothesis based on the strength of the relationship between each user and other users to obtain the data set with the relationship network.

6. The method of claim 5, wherein the determining the strength of the relationship between each user and other users in the model-constructed denoised data set comprises:

and ranking the relation between each user and other users in the model construction denoising data set by using the quantity hypothesis and the quality hypothesis of the webpage ranking algorithm to obtain the strength of the relation between each user and other users in the normal non-missing data set.

7. The method of claim 4, wherein the trained machine learning model is:

wherein, sigma_inIs the sum of the weights of the edges within community c; sigma_knThe sum of the weights of all the edges connecting the user and the node in the community c is obtained, and because i belongs to the community c, i comprises the edges of the node in the community and the node i and the edges of the nodes outside the community and the node i;

the sum of the weights of all the edges connecting the user and the node in the community c is obtained, and j belongs to the community c and comprises the edges of the node in the community and the node j and the edges of the nodes outside the community and the node j; sigma_totTo replace

And

is the sum of the edge weight within community c and the weight of the edges connecting community c and other communities.

8. An apparatus for fraud probability analysis, comprising:

the acquisition denoising module is used for acquiring user information data and cleaning the user information data to obtain a denoising data set;

the community analysis module is used for inputting the denoising data set into a trained machine learning model to obtain a community to which the user belongs;

and the fraud probability calculation module is used for calculating the probability that the user belongs to the fraud user according to the community to which the user belongs.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the fraud probability analysis method of any of claims 1-7.

10. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of an information processing apparatus or a server, cause the information processing apparatus or the server to implement the fraud probability analysis method according to any one of claims 1 to 7.