CN116384551A

Movatterモバイル変換

Info

Publication number: CN116384551A
Application number: CN202310226816.5A
Authority: CN
Inventors: 李朋骏; 辛辉; 谢镇玺; 王金龙; 熊晓芸
Original assignee: Qingdao University of Technology
Current assignee: Qingdao University of Technology
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-07-04

Abstract

The invention discloses a knowledge graph-based method for predicting illegal risks of a marketing enterprise, which comprises the steps of constructing an enterprise associated party hypergraph and a Hyper-GNN hypergraph representation learning model, and incorporating risk assimilation factors of an enterprise associated party cluster into enterprise illegal predictions; improving a graph transfer algorithm propagation mechanism, defining an illegal risk diffusion range by adopting a community division algorithm, introducing enterprise benefit association closeness and a probabilistic mechanism into a graph migration algorithm, simulating graph migration paths of simulation propagation of bad risk sources among enterprises, and accurately evaluating the illegal risk degree of suspicious enterprises by means of convergence trend after deep iteration of the graph transfer algorithm; and designing a Legal-GNN neural network, fusing enterprise risk features with enterprise associated party cluster features, enhancing the difference between sparse suspicious enterprise node features and large-scale Legal enterprise node features, weakening the feature similarity degree when nodes are embedded and represented, and improving the discrimination precision of a prediction model to illegal enterprises.

Description

Translated fromChinese

一种基于知识图谱的上市企业违法风险预测方法A method for predicting illegal risks of listed companies based on knowledge graph

技术领域technical field

本发明涉及知识图谱技术领域和企业法律风控领域，特别涉及一种基于知识图谱的上市企业违法风险预测方法。The present invention relates to the technical field of knowledge graph and the field of corporate legal risk control, in particular to a method for predicting illegal risks of listed companies based on knowledge graph.

背景技术Background technique

据统计，近年来上市企业的违约违法案件的数量逐年上升，扰乱破坏了金融市场的良好秩序。由于上市企业股权投资、担保额度远超过中小企业，牵连的股东数量众多，监管者如不及时识别与管控违法风险，将损害其合作伙伴、相关投资方的财产利益。随着人工智能技术的兴起，已有学者致力于利用机器学习及深度学习领域模型赋能企业法律风控领域，通过对企业经营指标等大规模数据智能分析，推断企业违法倾向，陆续取得研究成果，实现了企业法律风险的自动化评估和企业违法预测。According to statistics, in recent years, the number of cases of breach of contract and law by listed companies has increased year by year, disrupting the good order of the financial market. Since the equity investment and guarantee amount of listed companies far exceeds that of small and medium-sized enterprises, and the number of shareholders involved is large, if regulators fail to identify and control illegal risks in a timely manner, they will damage the property interests of their partners and related investors. With the rise of artificial intelligence technology, scholars have been committed to using machine learning and deep learning models to empower the field of corporate legal risk control. Through intelligent analysis of large-scale data such as corporate operating indicators, inferring corporate illegal tendencies, and successively obtained research results , which realizes the automatic assessment of corporate legal risks and the prediction of corporate violations.

然而，实际情况中上市企业间存在投资、担保、持股等多样化的利益关联，风险传递的方式错综复杂，并非仅与企业自身经营状况有关。并且信息化时代背景下的企业违法行为趋于隐蔽化，企业瞒报误报自身经营信息等不良行为屡有发生，账目数据的真实性无法得到保障，使得基于数据分析的企业违法预测模式在面对账目及征信记录无异样的违法企业时难以为继，监管方对上市企业进行违法预测和风险管控业务陷入困境。However, in reality, listed companies have diversified interest relationships such as investment, guarantee, and shareholding, and the risk transmission methods are intricate and not only related to the company's own operating conditions. In addition, under the background of the information age, corporate illegal behavior tends to be concealed. Bad behaviors such as corporate concealment and misreporting of their own business information occur frequently, and the authenticity of account data cannot be guaranteed. This makes the corporate illegal prediction model based on data analysis in the face Illegal companies with no abnormalities in their accounts and credit records are unsustainable, and regulators are in trouble in predicting illegal activities and risk control of listed companies.

知识图谱作为大规模语义关联网络，具备卓越的知识关联能力，成为金融风控领域的新兴技术。针对上述问题，Huidong Wu等人提出了基于路径搜索方法SFE的知识图谱推理框架，基于图结构推理出存在审计舞弊风险的企业；Xuting Mao等人通过知识图谱获取指定企业间的交易次数、贷款类交易占比等企业资金往来特征，结合机器学习模型识别出具有欺诈嫌疑的企业；Chunyan Xue以多类关系综合评估得到的风险传播概率量化节点间的关联程度，实现企业贷款违约关联风险的模拟传播，继而对信贷风险较高的企业实施风险预警。As a large-scale semantic association network, knowledge graph has excellent knowledge association ability and has become an emerging technology in the field of financial risk control. In response to the above problems, Huidong Wu et al. proposed a knowledge graph reasoning framework based on the path search method SFE, and inferred companies with audit fraud risks based on the graph structure; Xuting Mao et al. The characteristics of corporate capital transactions such as the proportion of transactions, combined with machine learning models to identify companies suspected of fraud; Chunyan Xue quantified the degree of correlation between nodes with the risk propagation probability obtained from a comprehensive evaluation of multiple types of relationships, and realized the simulated propagation of corporate loan default associated risks , and then implement risk warnings for enterprises with higher credit risks.

现有方法虽然凭借知识图谱的信息关联特性实现了企业间违法风险的模拟传导和违法倾向的预测，但仍存在不足之处。一方面，基于图路径的风险模拟传递方法仅关注于企业两两之间“点对点”模式的风险传递，忽视了企业不良利益集群的隐患，且主观默认采用交易次数、贷款额或自拟权重数值量化风险传递程度，致使结果易受主观性影响；另一方面，依靠图表示学习的违法预测方法受限于合法企业与违法企业的比例不均衡的现象，庞大的合法企业集群包裹稀疏化的违法企业节点，在对目标企业节点进行邻域节点采样时违法企业节点特征与占比更多的合法企业节点特征相混淆，非法企业的违法特征信息未能有效被图神经网络表征，当违法企业拥有较多合法关联方时模型预测精度随之降低。Although the existing methods realize the simulated transmission of illegal risks among enterprises and the prediction of illegal tendencies by virtue of the information association characteristics of knowledge graphs, there are still shortcomings. On the one hand, the risk simulation transmission method based on the graph path only focuses on the risk transmission of the "point-to-point" mode between two enterprises, ignoring the hidden dangers of enterprise bad interest clusters, and subjectively adopts the number of transactions, loan amount or self-made weight value by default Quantifying the degree of risk transmission makes the results susceptible to subjectivity; on the other hand, the illegal prediction method relying on graph representation learning is limited by the unbalanced proportion of legal enterprises and illegal enterprises, and the huge legal enterprise clusters package sparse illegal For enterprise nodes, when sampling neighborhood nodes of target enterprise nodes, the characteristics of illegal enterprise nodes are confused with the characteristics of legal enterprise nodes that account for a larger proportion. The illegal characteristic information of illegal enterprises cannot be effectively represented by the graph neural network. When there are more legitimate related parties, the prediction accuracy of the model decreases.

发明内容Contents of the invention

为解决上述技术问题，本发明提供了一种基于知识图谱的上市企业违法风险预测方法，利用知识图谱技术弥补企业违法预测传统方法过度依赖经济学指标的缺陷，结合超图表示学习洞察企业违法行为的集群化效应，并改良图传播算法和图神经网络结构解决风险传递方式主观、片面以及具有较多合法关联方贸易伙伴的违法企业难以预测其违法倾向的问题。In order to solve the above-mentioned technical problems, the present invention provides a method for predicting illegal risks of listed companies based on knowledge graphs, using knowledge graph technology to make up for the defects of traditional methods of corporate illegal predictions that rely too much on economic indicators, and combining hypergraph representations to learn and gain insight into corporate illegal behaviors The clustering effect, and improve the graph propagation algorithm and graph neural network structure to solve the problem that the risk transmission method is subjective and one-sided, and it is difficult for illegal enterprises with many legitimate related party trading partners to predict their illegal tendencies.

为达到上述目的，本发明的技术方案如下：To achieve the above object, the technical scheme of the present invention is as follows:

一种基于知识图谱的上市企业违法风险预测方法，包括如下步骤：A method for predicting illegal risks of listed companies based on knowledge graphs, comprising the following steps:

步骤一，采集上市企业经营指标、违法事件以及关联方历史记录并通过爬虫获取企业违规事件日增量信息，对采集的信息进行数据清洗，并将所获得的数据均转化为结构化数据；Step 1: Collect the listed company’s operating indicators, illegal events, and historical records of related parties, and use crawlers to obtain daily incremental information on corporate violations, perform data cleaning on the collected information, and convert the obtained data into structured data;

步骤二，设计上市企业关联方及违法信息知识图谱模式层，将多维度结构化数据转化为知识图谱三元组格式，并基于Neo4j图数据库存储所构建的知识图谱实体与关系数据；Step 2: Design the knowledge map model layer of related parties and illegal information of listed companies, convert multi-dimensional structured data into knowledge map triplet format, and store the constructed knowledge map entity and relationship data based on the Neo4j graph database;

步骤三，基于上市企业关联方及违法信息知识图谱，利用Cypher语言依次检索每一上市企业节点其二阶邻域范围内的与其存在直接或间接关联关系的企业节点，并获取上市企业间潜在的关联关系类别及不同量纲下的关联程度数值；Step 3: Based on the knowledge graph of related parties of listed companies and illegal information, use the Cypher language to sequentially retrieve the company nodes that have direct or indirect relationships with each listed company node within its second-order neighborhood, and obtain the potential Relationship category and the value of the degree of relationship in different dimensions;

步骤四，根据检索获得的查询结果，利用分位点划分与条件概率评估相结合的二阶段风险转移概率计算方法，量化企业间利益关联的紧密度，进而评估企业间的违法风险转移概率；Step 4. According to the query results obtained from the retrieval, use the two-stage risk transfer probability calculation method combining quantile division and conditional probability assessment to quantify the closeness of interest relationship between enterprises, and then evaluate the probability of illegal risk transfer among enterprises;

步骤五，以企业间的违法风险转移概率为边的权重，以上市企业为节点，基于networkx工具包构建大规模企业模拟集群，将初始状态的知识图谱转化为企业违法风险传播网络G_risk；Step 5: Taking the probability of illegal risk transfer between enterprises as the weight of the edge, and taking listed companies as nodes, a large-scale enterprise simulation cluster is constructed based on the networkx toolkit, and the knowledge map of the initial state is transformed into a corporate illegal risk communication network G_risk ;

步骤六，基于Louvain算法将企业违法风险传播网络G_risk划分为企业风险传播子图G_sub，设计LeagalRank图传递算法，模拟不良风险源在企业风险传播子图G_sub扩散传播的图游走路径，通过图传递算法深层次迭代后的收敛趋向，推演评估各上市企业违法风险指数；Step 6: Based on the Louvain algorithm, the corporate illegal risk communication network G_risk is divided into the corporate risk communication sub-graph G_sub , and the LeagalRank graph transfer algorithm is designed to simulate the graph walk path of the diffusion of bad risk sources in the corporate risk communication sub-graph G_sub . Through the convergence trend after deep iteration of the graph transfer algorithm, deduce and evaluate the illegal risk index of each listed company;

步骤七，将趋于稳态后的各子图中所涉及的上市企业违法风险指数存入Neo4j图数据库，通过Cypher语句添加节点属性，作为各企业节点的违法风险传播评分LegalRiskScore属性字段；Step 7: Store the illegal risk indices of listed companies involved in each sub-graph that tends to the steady state into the Neo4j graph database, and add node attributes through the Cypher statement as the LegalRiskScore attribute field of the illegal risk propagation score of each enterprise node;

步骤八，根据企业间存在相同股东、审计师、投资方讯息构建企业关联方超图，进而搭建具备企业关联方集群特征表示能力的Hyper-GNN超图神经网络，基于超图神经网络完成企业关联方集群特征的向量化表示；Step 8: Construct a hypergraph of related parties of the enterprise according to the information of the same shareholders, auditors, and investors among enterprises, and then build a Hyper-GNN hypergraph neural network with the ability to express the characteristics of the cluster of related parties of the enterprise, and complete the enterprise association based on the hypergraph neural network Vectorized representation of square cluster features;

步骤九，首先根据上市企业基础信息及企业违法风险传播评分LegalRiskScore完成企业自身特征的向量化表示，而后与基于Hyper-GNN超图神经网络获得的企业关联方集群特征向量有效拼接，进而拟定企业违法预测对应的神经网络损失函数，构建神经网络层前向传播与反向传播模块单元，完成Legal-GNN企业次年违法预测模型的搭建；Step 9: First, complete the vectorized representation of the company’s own characteristics based on the basic information of the listed company and the LegalRiskScore of the company’s illegal risk communication score, and then effectively splicing with the cluster feature vectors of related parties of the company obtained based on the Hyper-GNN supergraph neural network, and then formulate the corporate illegal Predict the corresponding neural network loss function, construct the forward propagation and back propagation module units of the neural network layer, and complete the construction of the Legal-GNN enterprise illegal prediction model for the next year;

步骤十，对搭建的Legal-GNN企业次年违法预测模型采用k折交叉验证的方式迭代训练，利用训练后的预测模型进行上市企业违法风险的预测。Step 10: Use k-fold cross-validation to iteratively train the established Legal-GNN enterprise violation prediction model for the next year, and use the trained prediction model to predict the violation risk of listed companies.

上述方案中，步骤四的具体方法如下：In the above scheme, the specific method ofstep 4 is as follows:

Step1：采用Cypher语句检索存储在Neo4j图数据库中的企业一阶利益关联信息，将检索到的利益关系属性数值添加至两组列表，并按照企业间持股比例或投资额度升序排序；Step1: Use the Cypher statement to retrieve the first-order interest relationship information of the enterprise stored in the Neo4j graph database, add the value of the retrieved interest relationship attribute to two sets of lists, and sort in ascending order according to the shareholding ratio or investment amount among enterprises;

Step2：求解两组企业间一阶利益关联列表的五分位点位置，根据目标企业的关系属性数值所属于的等级区间，依次评定上市企业间的一阶利益关联紧密度等级；Step2: Solve the quintile position of the first-order interest relationship list between the two groups of enterprises, and evaluate the degree of closeness of the first-order interest relationship between listed companies in turn according to the grade interval to which the value of the relationship attribute of the target enterprise belongs;

Step3：根据近五年企业违法记录，以年度为单位计算不同关系类别、不同利益关联紧密度程度下的次年企业违法概率，作为企业间风险传播的一阶风险转移概率P_lv1，具体计算公式如下：Step3: According to the company’s illegal records in the past five years, calculate the probability of corporate violations in the next year under different relationship types and different degrees of closeness of interest association on an annual basis, and use it as the first-order risk transfer probability P_lv1 of risk communication among enterprises. The specific calculation formula as follows:

P_lv1-rs＝(Num_illegal/Num_pair-rs)，r∈{r_stock，r_amount}，l∈[1,5]P_lv1-rs = (Num_illegal /Num_pair-rs ), r ∈ {r_stock , r_amount }, l ∈ [1,5]

其中，r代表利益关系类别，s代表Step2求取的该关系类别利益紧密度等级，Num_pair-rs即满足该利益关联和风险等级的企业关联方总数，Num_illegal即Num_pair-rs中关联双方出现一方本年度违法，另一方次年违法事件发生的总数；Among them, r represents the type of interest relationship, s represents the level of interest closeness of the relationship type obtained in Step2, Num_pair-rs is the total number of related parties of the enterprise that meet the interest relationship and risk level, and Num_illegal is the number of related parties in Num_pair-rs The total number of incidents in which one party broke the law in the current year and the other party broke the law in the next year;

Step4：针对持股、控股、投资三类企业间利益关联，采用对二阶利益关联拆解的方式，转入Step3，分别计算两阶段的条件概率，进而求解企业二阶利益关联类的风险转移概率；针对共同关联方人物关联包括董事长、总经理、高管、审计师四类，将前三类关系归并为任职关联，采用Step2、Step3涉及的分位点与概率计算方法获得企业二阶共同关联方人物类的风险转移概率；Step4: Aiming at the interest relationship between the three types of enterprises, including shareholding, holding, and investment, adopt the method of dismantling the second-order interest relationship, transfer to Step3, and calculate the conditional probabilities of the two stages respectively, and then solve the risk transfer of the second-order interest relationship of the enterprise Probability: In view of the four types of joint related parties, including the chairman, general manager, executives, and auditors, the first three types of relationships are classified as employment relationships, and the quantile points and probability calculation methods involved in Step2 and Step3 are used to obtain the second-order of the enterprise The risk transfer probability of the category of common related parties;

Step5：结合企业一阶风险转移概率，采用多变量条件概率计算方法求解上市企业间违法连带效应风险转移概率P_(A,B)，具体计算公式如下：Step5: Combined with the first-order risk transfer probability of the enterprise, the multi-variable conditional probability calculation method is used to solve the risk transfer probability P_{(A,B) of} illegal joint effects among listed companies. The specific calculation formula is as follows:

P_(A,B)＝1-Π_d(A,B)(1-P_d)，P_d∈{P_lv1,P_lv2-α,P_lv2-β}P_(A,B) ＝1-Π_d(A,B) (1-P_d ), P_d ∈{P_lv1 ,P_lv2-α ,P_lv2-β }

其中，d(A,B)表示任意企业间的关联路径，P_d表示该路径对应的风险转移概率。Among them, d(A, B) represents the associated path between any enterprises, and P_d represents the risk transfer probability corresponding to this path.

上述方案中，步骤六的具体方法如下：In the above scheme, the specific method of step six is as follows:

Step1：基于Louvain社区划分算法划分企业风险传播子图G_sub，以模块度为优化目标，将企业违法风险传播网络G_risk划分为规模不等的子图，作为企业风险传递模拟集群，Louvain算法模块度评估公式如下：Step1: Divide the enterprise risk communication subgraph G_sub based on the Louvain community division algorithm, and take the modularity as the optimization goal, divide the enterprise illegal risk communication network G_risk into subgraphs of different sizes, as the enterprise risk transmission simulation cluster, Louvain algorithm module The degree evaluation formula is as follows:

其中，Q_part为社区模块度数值，∑in表示社区c内所有边的权重之和，∑out表示与社区c内存在节点相连的边的权重之和，m表示企业违法风险传播网络G_risk的边数总和；Among them, Q_part is the community modularity value, ∑in represents the sum of the weights of all edges in the community c, ∑out represents the sum of the weights of the edges connected to existing nodes in the community c, m represents the G_risk of the enterprise’s illegal risk communication network sum of sides;

Step2：依次将每一子图G_sub转换为邻接矩阵形式，并对邻接矩阵每列元素进行归一化处理，得到风险传播概率矩阵W_ij；Step2: Convert each subgraph G_sub into the form of adjacency matrix in turn, and normalize the elements of each column of the adjacency matrix to obtain the risk propagation probability matrix W_ij ;

Step3：根据各子图内企业的本年度违法行为次数，设计企业违法风险程度评估函数，作为企业风险传播自身携带的初始风险值，并生成企业风险传播算法中的风险初始向量PR₀；Step3: According to the number of illegal activities of the enterprises in each sub-graph in this year, design the evaluation function of the enterprise's illegal risk degree as the initial risk value carried by the enterprise's risk communication itself, and generate the risk initial vector PR₀ in the enterprise risk communication algorithm;

Step4：基于所设计的LeagalRank图传播算法模拟企业违法风险传播，通过风险源的深层次迭代游走，得到图传递算法收敛后的各上市企业违法风险传播评分LegalRiskScore；LeagalRank风险模拟传播的具体公式如下：Step4: Based on the designed LeagalRank graph propagation algorithm to simulate the corporate illegal risk propagation, through the deep iterative walk of the risk source, obtain the LegalRiskScore of each listed company's illegal risk propagation score after the graph propagation algorithm converges; the specific formula of the LeagalRank risk simulation propagation is as follows :

其中，PR表示当前轮次的节点风险迭代值，

为阻尼系数，n_j为企业节点n_i的相邻企业节点，k为迭代轮次，D(n_i)为企业节点n_i所属于的企业集群子图G_sub，W_ij为Step2得到的风险传播概率矩阵，Risk_comi为风险初始向量PR₀中企业节点n_i对应的初始风险值。Among them, PR represents the node risk iteration value of the current round,

is the damping coefficient, n_j is the adjacent enterprise node of enterprise node n_i , k is the iteration round, D(n_i ) is the enterprise cluster subgraph G_{sub to} which enterprise node n_i belongs, W_ij is the risk obtained in Step2 Propagation probability matrix, Risk_comi is the initial risk value corresponding to enterprise node n_i in the risk initial vector PR₀ .

进一步的技术方案中，Step1具体如下：依次尝试将企业节点及与其每一相邻企业节点合并，并计算模块度增益值，并选择归入模块度增益最大的社区，迭代执行此操作至所有企业节点所属社区不再变化，为一轮次；而后将各个社区所有节点压缩视作一个新节点，压缩后节点的边权为原始社区内所有节点的边权重之和，进行新一轮次的模块度计算与节点归并操作，直至各社区模块度数值基本恒定，完成企业关联子图G_sub的划分。In the further technical solution, Step 1 is specifically as follows: try to merge the enterprise node and each of its adjacent enterprise nodes in turn, and calculate the modularity gain value, and select the community with the largest modularity gain, and perform this operation iteratively to all enterprises The community to which the node belongs no longer changes, and it is a round; then all the nodes in each community are compressed as a new node, and the edge weight of the node after compression is the sum of the edge weights of all nodes in the original community, and a new round of modules is carried out Degree calculation and node merging operations until the module degree value of each community is basically constant, and the division of the enterprise association subgraph G_sub is completed.

进一步的技术方案中，Step3具体如下：对于任一子图，依次对子图内所有企业利用Cypher语言检索其知识图谱中存储的关联违法事件实体数目，并基于风险值评估公式确定每个企业节点携带的风险值，而后形成表征企业违法风险的向量，并进行归一化处理，得到初始化企业风险传播向量PR₀。In the further technical solution, Step3 is specifically as follows: For any subgraph, use Cypher language to retrieve the number of associated illegal event entities stored in its knowledge map for all enterprises in the subgraph in turn, and determine the number of each enterprise node based on the risk value assessment formula The carried risk value then forms a vector representing the illegal risk of the enterprise, and performs normalization processing to obtain the initial enterprise risk propagation vector PR₀ .

进一步的技术方案中，Step4具体如下：在PageRank图传播算法基础上，进行三项优化改进，使其契合于企业法律风险传播任务：限定风险传播范围为louvain社区划分算法获得的企业关联方集群范围，提高大规模图传播的收敛效率；将随机传播机制转变为依据企业违法行为连带效应的概率有倾向地传播；根据企业本年度违法次数评估各节点初始携带的风险值，并且每开启新一轮迭代游走，均是以违法企业作为起点。In the further technical solution, Step4 is specifically as follows: Based on the PageRank graph propagation algorithm, three optimizations and improvements are made to make it suitable for the task of enterprise legal risk propagation: limit the scope of risk propagation to the range of enterprise related party clusters obtained by the louvain community division algorithm , to improve the convergence efficiency of large-scale graph propagation; change the random propagation mechanism to propagating with a tendency based on the probability of the joint effect of corporate violations; evaluate the initial risk value of each node according to the number of corporate violations this year, and start a new round of iteration Wandering, all take illegal enterprises as the starting point.

上述方案中，步骤八具体如下：In the above scheme, step eight is specifically as follows:

Step1：超边构建：基于已构建的上市企业关联方及违法信息知识图谱，分别将涉及相同股东、审计师、投资方、高管的企业纳入同一子集，每个子集中的节点构成一条超边e_i；Step1: Hyperedge construction: Based on the established knowledge graph of related parties and illegal information of listed companies, companies involving the same shareholders, auditors, investors, and executives are included in the same subset, and the nodes in each subset form a hyperedge e_i ;

Step2：超图构建：根据构造的超边e_i与知识图谱中的企业类型顶点v_com，以及需要经过模型训练获得的各超边权重ω_i，进而构造超边种类为4的上市企业关联方超图H_com，并利用各超边权重ω_i的初始值构建对角矩阵W；Step2: Hypergraph construction: According to the hyperedge e_i constructed, the enterprise type vertex v_com in the knowledge graph, and the hyperedge weights ω_i that need to be obtained through model training, the related parties of listed companies with a hyperedge type of 4 are constructed hypergraph H_com , and use the initial value of each hyperedge weight ω_i to construct a diagonal matrix W;

Step3：求解超边的节点度和边缘度，并生成节点度对角矩阵D_v和边缘度对角矩阵D_e；Step3: Solve the node degree and edge degree of hyperedge, and generate node degree diagonal matrix D_v and edge degree diagonal matrix D_e ;

Step4：设计超图神经网络，并利用超图神经网络对邻域节点特征聚合更新，最终得到企业关联方集群特征Vec_relate；利用超图神经网络得到企业关联方集群特征Vec_relate的公式如下：Step4: Design the hypergraph neural network, and use the hypergraph neural network to aggregate and update the characteristics of the neighborhood nodes, and finally obtain the Vec_relate of the enterprise related party cluster feature; the formula for obtaining the enterprise related party cluster feature Vec_relate using the hypergraph neural network is as follows:

其中，l为网络迭代层数，

为第l轮迭代后超图内各节点的特征向量表示，D_v、D_e分别为Step3所求的节点度对角矩阵与边缘度对角矩阵，W为超边权重对角矩阵，需要通过模型训练获得的超参数，Θ^l为l层的向量维度转换矩阵，h_com为各超图的不加权条件下的邻接矩阵。Among them, l is the number of network iteration layers,

is the eigenvector representation of each node in the hypergraph after the l-th iteration, D_v and D_e are the node degree diagonal matrix and edge degree diagonal matrix obtained in Step3 respectively, and W is the hyperedge weight diagonal matrix, which needs to be passed The hyperparameters obtained from model training, Θ^l is the vector dimension conversion matrix of layer l, and h_com is the adjacency matrix of each hypergraph under unweighted conditions.

进一步的技术方案中，Step3具体如下，首先生成各超图的邻接矩阵h_i(v_i,e_i)，而后根据各节点连接的超边e_i及各超边的权重ω_i进行加权求和，计算节点度d(v_com1)；对各超边所连接的节点数量求和，得到超边的边缘度为δ(e_i)；将各超图的节点度和边缘度放置入矩阵并执行对角化操作，得到节点度对角矩阵D_v和边缘度对角矩阵D_e，两类对角矩阵非对角线元素均为0。In the further technical solution, Step3 is specifically as follows, first generate the adjacency matrix h_i (v_i , e_i ) of each hypergraph, and then perform weighted summation according to the hyperedge e_i connected by each node and the weight ω_i of each hyperedge , calculate the node degree d(v_com1 ); sum the number of nodes connected by each hyperedge, and obtain the edge degree of the hyperedge as δ(e_i ); put the node degree and edge degree of each hypergraph into the matrix and execute Diagonalization operation to obtain node degree diagonal matrix D_v and edge degree diagonal matrix D_e , the off-diagonal elements of the two types of diagonal matrices are all 0.

上述方案中，步骤九中，具体如下：In the above scheme, in step nine, the details are as follows:

Step1：向量化表示企业自身经营信息、企业违法风险特征，实现一阶段的特征拼接；Step1: Vectorization represents the company's own business information and company's illegal risk characteristics, and realizes the first-stage feature splicing;

具体地，将企业经营信息特征与违法风险特征共同组成44维的字段属性，通过一层全连接网络得到企业自身经营信息与风险的混合特征，记作Vec_mix；Specifically, the characteristics of the business information of the enterprise and the characteristics of illegal risks are combined to form a 44-dimensional field attribute, and the mixed characteristics of the enterprise's own business information and risks are obtained through a fully connected network, which is recorded as Vec_mix ;

Step2：通过LeakyReLU激活函数针对Vec_relate完成非线性特征变换，并进一步通过设计MLP多层感知机神经网络，包含两层隐层以及ReLU激活层在内，使得Vec_mix的特征维度对标Vec_relate的特征维度，将企业自身经营信息与法律风险信息混合特征Vec_mix与基于Hyper-GNN超图神经网络获得的企业关联方集群特征Vec_relate有效拼接，得到各个企业节点的最终特征表示Vec_com，具体设计的特征拼接公式如下：Step2: Use the LeakyReLU activation function to complete the nonlinear feature transformation for Vec_relate , and further design the MLP multi-layer perceptron neural network, including two hidden layers and ReLU activation layer, so that the feature dimension of Vec_mix can be compared with that of Vec_relate The characteristic dimension is to effectively combine the mixed characteristic Vec_{mix of} the enterprise's own business information and legal risk information with the enterprise related party cluster characteristic Vec_relate obtained based on the Hyper-GNN hypergraph neural network, and obtain the final characteristic representation Vec_com of each enterprise node. The specific design The feature splicing formula of is as follows:

Vec_com＝ξ*LeakyReLU(Vec_relate)+(1-ξ)*MLP(Vec_mix)Vec_com ＝ξ*LeakyReLU(Vec_relate )+(1-ξ)*MLP(Vec_mix )

其中，ξ为模型训练过程中获得的超参数；Among them, ξ is the hyperparameter obtained during the model training process;

Step3：设计Legal-GNN企业次年违法预测模型的神经网络分类层，通过多层神经网络层级间的线性变换与激活函数涉及的非线性变换，构造神经网络前向传播单元，并设计企业违法预测相应的损失函数，完成神经网络的反向传播模块，经过各层级神经元的迭代更新，最终输出上市企业次年违法倾向的预测值；Step3: Design the neural network classification layer of the Legal-GNN enterprise illegal prediction model for the next year, construct the neural network forward propagation unit through the linear transformation between the layers of the multi-layer neural network and the nonlinear transformation involved in the activation function, and design the enterprise illegal prediction The corresponding loss function completes the backpropagation module of the neural network, and after iterative updating of neurons at each level, finally outputs the predicted value of the listed company's illegal tendency in the next year;

所述损失函数采用最小交叉熵损失函数，表示如下：The loss function adopts the minimum cross-entropy loss function, expressed as follows:

其中，

为模型预测的企业次年违法类别，y_com为真实的企业次年违法类别，Y为已标记违法标签的训练集中的企业条目。in,

is the illegal category of the enterprise in the next year predicted by the model, y_com is the actual illegal category of the enterprise in the next year, and Y is the enterprise entry in the training set that has been marked with illegal labels.

通过上述技术方案，本发明提供的一种基于知识图谱的上市企业违法风险预测方法具有如下有益效果：Through the above technical solutions, a method for predicting illegal risks of listed companies based on knowledge graphs provided by the present invention has the following beneficial effects:

(1)针对于现有基于知识图谱的企业法律风险预测方法仅挖掘企业间的风险传递路径，忽视了不良企业关联方集群的风险同化效应的问题，本发明构建Hyper-GNN的超图表示学习模型。通过超图神经网络对知识图谱中的上市企业节点建立共同股东、共同审计师、共同高管、共同投资方四类超图领域的超边，突破传统图表示学习或图传播算法仅能反映企业网络中单点风险或两点间单线风险的缺陷，较好的表征企业的集群化特征，将不良企业集群的风险同化因素纳入企业违法预测中，提高对于集群化企业违法现象的预测精度。(1) Aiming at the problem that the existing enterprise legal risk prediction method based on knowledge graph only excavates the risk transmission path between enterprises and ignores the risk assimilation effect of the related party cluster of bad enterprises, the present invention constructs Hyper-GNN hypergraph representation learning Model. Through the hypergraph neural network, the hyperedges of the listed company nodes in the knowledge graph are established in the four types of hypergraph fields: common shareholders, common auditors, common executives, and joint investors, breaking through traditional graph representation learning or graph propagation algorithms that can only reflect enterprises The defect of single-point risk or single-line risk between two points in the network better characterizes the clustering characteristics of enterprises, and incorporates the risk assimilation factors of unhealthy enterprise clusters into the prediction of enterprise violations to improve the prediction accuracy of violations of clustered enterprises.

(2)针对于现有企业风控领域的风险传播算法默认采用交易次数、贷款额或自拟权重数值量化风险传递程度，致使结果易受主观性影响的问题，本发明设计了LegalRank企业违法风险传播方法，改良图传递算法传播机理，采用社区划分算法划定违法风险扩散范围，并将企业利益关联紧密度与概率学机理引入图游走算法，模拟不良风险源在企业间模拟传播的图游走路径，通过图传递算法深层次迭代后的收敛趋向，精准、科学地评估出企业违法风险指数，既避免了以交易额等数值为权重情况下偏高值或偏低值导致转移概率近乎于0或1的极端现象，也使风险转移概率更贴近于违法概率真实值而非无实际意义的主观拟定值。(2) Aiming at the problem that the existing risk propagation algorithm in the field of enterprise risk control uses the number of transactions, loan amount or self-designed weight value to quantify the degree of risk transfer by default, resulting in the problem that the result is easily affected by subjectivity, the present invention designs LegalRank enterprise illegal risk Propagation method, improving the transmission mechanism of the graph transfer algorithm, using the community division algorithm to delineate the scope of illegal risk diffusion, and introducing the closeness of corporate interests and the probability mechanism into the graph walk algorithm, simulating the graph tour in which bad risk sources are simulated and propagated among enterprises Follow the path, and through the convergence trend of the deep iteration of the graph transfer algorithm, accurately and scientifically evaluate the enterprise’s illegal risk index, which not only avoids the transition probability that is close to The extreme phenomenon of 0 or 1 also makes the risk transfer probability closer to the real value of the illegal probability rather than a subjective value with no practical significance.

(3)针对现有技术面临的合法企业与非法企业比例不均衡，异构图表示学习时易受大量合法化企业节点特征的干扰，较难识别出拥有较多合法关联方贸易伙伴的违法企业问题，本发明设计了Legal-GNN神经网络，将违法风险传播算法得到的企业风险特征与Hyper-GNN得到的企业关联方集群特征相融合，从而增强稀疏化的可疑企业节点特征与大规模的合法企业节点特征的差异性，减弱节点嵌入特征的相似程度，提高对违法企业的甄别精度。(3) In view of the unbalanced proportion of legal enterprises and illegal enterprises faced by the existing technology, the heterogeneous graph representation is easily disturbed by the characteristics of a large number of legalized enterprise nodes during learning, and it is difficult to identify illegal enterprises with many legal related party trading partners Problem, the present invention designs the Legal-GNN neural network, which combines the enterprise risk characteristics obtained by the illegal risk propagation algorithm with the enterprise related party cluster characteristics obtained by Hyper-GNN, thereby enhancing the sparse characteristics of suspicious enterprise nodes and large-scale legal The difference of enterprise node characteristics weakens the similarity of node embedding characteristics and improves the identification accuracy of illegal enterprises.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the drawings that are required in the description of the embodiments or the prior art.

图1为本发明实施例所公开的一种基于知识图谱的上市企业违法风险预测方法技术架构图；FIG. 1 is a technical framework diagram of a method for predicting illegal risks of listed companies based on a knowledge graph disclosed in an embodiment of the present invention;

图2为上市企业关联方及法律信息图谱局部示意图；Figure 2 is a partial schematic diagram of related parties and legal information maps of listed companies;

图3为本发明实施例所公开的一种基于知识图谱的上市企业违法风险预测方法流程图；FIG. 3 is a flow chart of a method for predicting illegal risks of listed companies based on knowledge graphs disclosed in an embodiment of the present invention;

图4为本发明实施例所公开的分位点划分与风险转移概率计算方法流程图；Fig. 4 is a flow chart of quantile point division and risk transfer probability calculation method disclosed in the embodiment of the present invention;

图5为本发明实施例所公开的企业风险传播评分LegalRiskScore求解过程流程图；Fig. 5 is the flow chart of solving process of LegalRiskScore disclosed by the embodiment of the present invention;

图6为本发明实施例所公开的基于Hyper-GNN超图神经网络进行企业关联方集群特征的向量化表示流程图；Fig. 6 is the vectorized representation flow chart of the cluster characteristics of enterprise related parties based on the Hyper-GNN hypergraph neural network disclosed by the embodiment of the present invention;

图7为本发明实施例所公开的上市企业次年违法预测分类模型搭建流程图；Fig. 7 is a flow chart of building a listed company's illegal prediction and classification model for the next year disclosed in the embodiment of the present invention;

图8为本发明实施例所公开的上市企业次年违法预测分类模型神经网络架构示意图；Fig. 8 is a schematic diagram of the neural network architecture of the listed company's illegal prediction and classification model for the next year disclosed in the embodiment of the present invention;

图9为本发明实施例所公开的上市企业违法预测所涉及模型特征重要性评估结果。FIG. 9 shows the evaluation results of the importance of model features involved in the illegal prediction of listed companies disclosed in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention.

本发明提供一种基于知识图谱的上市企业违法风险预测方法，其目的在于通过挖掘企业关联方利益特征，增强对于企业高风险集群的侦破能力，弥补企业经营指标难以反映出的违法讯息，并结合所设计的企业违法风险的图传播算法，精准锁定大规模企业集群中稀疏化的可疑企业，提高模型对于上市企业违法倾向预测的准确率。该方法的技术架构图如图1所示，该方法的步骤流程图如图3所示，违法预测模型神经网络整体架构如图8所示，具体包括以下步骤：The present invention provides a method for predicting illegal risks of listed companies based on knowledge graphs. The purpose is to enhance the ability to detect high-risk clusters of enterprises by mining the interest characteristics of related parties of enterprises, and to make up for illegal information that is difficult to reflect in enterprise operating indicators. The designed graph propagation algorithm of corporate illegal risks accurately locks sparse suspicious companies in large-scale corporate clusters, and improves the accuracy of the model in predicting the illegal tendency of listed companies. The technical architecture diagram of this method is shown in Figure 1, the step-by-step flowchart of this method is shown in Figure 3, and the overall architecture of the neural network of the illegal prediction model is shown in Figure 8, which specifically includes the following steps:

步骤一，采集上市企业经营指标、违法事件以及关联方历史记录并通过爬虫获取企业违规事件日增量信息，对采集的信息进行数据清洗，并将所获得的数据均转化为结构化数据。Step 1: Collect the listed company’s operating indicators, illegal events, and historical records of related parties, and use crawlers to obtain daily incremental information on corporate violations, perform data cleaning on the collected information, and convert the obtained data into structured data.

具体地，所需采集的企业经营信息包括：国内上市企业近五年净资产收益率、资产负债率、股东持股比例、流动比率、流动资产周转率、审计评估意见等43维的经济化指标。Specifically, the business information that needs to be collected includes: 43-dimensional economic indicators such as return on net assets, asset-liability ratio, shareholder shareholding ratio, current ratio, current asset turnover rate, and audit evaluation opinions of domestic listed companies in the past five years. .

所需采集的企业关联方数据包括上市企业股东、审计师、高管、控制人、投资方共5类密切关联方信息。The data on related parties to be collected includes five types of closely related party information: shareholders, auditors, executives, controllers, and investors of listed companies.

所需采集的企业违法数据包括上市企业近五年的违法事件及案件涉事企业，案件涉事责任人，案件发生时间、案件类型等属性。The corporate violation data that needs to be collected includes the illegal events of listed companies in the past five years, the companies involved in the cases, the responsible persons involved in the cases, the time when the cases occurred, and the types of cases.

需要补充说明的是，采集的数据格式包括结构化数据和非结构化数据两类，结构化数据来源于国泰安CSMAR企业权威数据库；非结构化数据主要针对企业违法事项，需要利用爬虫获取证监会、深交所、上交所披露的上市企业法律风险动态对应的PDF格式处罚书文件，通过OCR文字识别将其转换为文本格式。由于企业违规事项处罚书格式较为规整，可通过相应的实体、属性抽取模板完成企业、企业关联方相关的实体、实体属性以及案件实体、案件类别等属性的实体抽取、属性抽取任务。It needs to be added that the collected data formats include structured data and unstructured data. The structured data comes from the Guotaian CSMAR enterprise authoritative database; , Shenzhen Stock Exchange, and Shanghai Stock Exchange disclosed the legal risk dynamics of listed companies corresponding to the PDF format penalty letter file, and converted it into a text format through OCR text recognition. Since the format of the penalty letter for enterprise violations is relatively regular, the entity extraction and attribute extraction tasks of entities and entity attributes related to enterprises and enterprise related parties, as well as case entities and case categories can be completed through the corresponding entity and attribute extraction templates.

步骤二，设计上市企业关联方及违法信息知识图谱模式层，将多维度结构化数据转化为知识图谱三元组格式，并基于Neo4j图数据库存储所构建的知识图谱实体与关系数据。Step 2: Design the knowledge map model layer of related parties and illegal information of listed companies, convert multi-dimensional structured data into knowledge map triplet format, and store the constructed knowledge map entity and relationship data based on the Neo4j graph database.

具体地，上市企业关联方及违法信息知识图谱模式层包含上市企业、股东、审计师、控制人、董事长、总经理、高管、违规事件共8类实体及投资、持股、审计、控股、任职、违规共6类关系，其中，持股与控股关系在知识图谱中以股份比例作为企业间的关系属性，投资关系以投资额度作为企业间的关系属性，任职、审计、违规三类关系不包含可量化数值，以类别名称作为企业间的关系属性，而后将格式化数据按照三元组的形式存储在Neo4j图数据库中，完成知识图谱的构建。图2为所述知识图谱的局部示意图，取各节点类别前三位英文字母及节点ID作为图数据库中的节点标识。所设计的关联方及违法信息知识图谱各实体及关系的数量明细、标识规则如表1所示。Specifically, the knowledge graph model layer of related parties and illegal information of listed companies includes listed companies, shareholders, auditors, controllers, chairman, general manager, executives, violations, a total of 8 types of entities and investment, shareholding, auditing, holding There are 6 types of relationships including tenure, tenure, and violation. Among them, the shareholding and holding relationship in the knowledge map uses the share ratio as the relationship attribute between enterprises, and the investment relationship takes the investment amount as the relationship attribute between enterprises. There are three types of relationships: tenure, audit, and violation No quantifiable values are included, the category name is used as the relationship attribute between enterprises, and then the formatted data is stored in the Neo4j graph database in the form of triples to complete the construction of the knowledge map. FIG. 2 is a partial schematic diagram of the knowledge map, and the first three English letters of each node category and the node ID are taken as node identifiers in the graph database. Table 1 shows the quantity details and identification rules of each entity and relationship in the designed related party and illegal information knowledge graph.

需要补充说明的是，由于企业间利益关联数额时常发生变化，且违法案件对于关联企业的负面影响也具有时效性，因此采用分时间区间存储图谱的方式，将近五年不同年份的企业信息与法律事件构建的以一年度为周期分别存储在不同且独立的图数据库备份中，以便后续进行次年违法预测模型的训练。表1所示数据为2021年份对应的图谱数据信息，同时最新年份的图谱信息通过定期爬取增量数据填充完善。What needs to be added is that since the amount of inter-enterprise interest associations often changes, and the negative impact of illegal cases on affiliated enterprises is also time-sensitive, the method of storing graphs in different time intervals is used to store corporate information and legal information in different years in the past five years. Events constructed in a yearly cycle are stored in different and independent graph database backups for subsequent training of the illegal prediction model for the next year. The data shown in Table 1 is the map data information corresponding to 2021, and the map information of the latest year is filled and perfected by regularly crawling incremental data.

表1Table 1

步骤三，基于上市企业关联方及违法信息知识图谱，利用Cypher语言依次检索每一上市企业节点其二阶邻域范围内的与其存在直接或间接关联关系的企业节点，并获取上市企业间潜在的关联关系类别及不同量纲下的关联程度数值。Step 3: Based on the knowledge graph of related parties of listed companies and illegal information, use the Cypher language to sequentially retrieve the company nodes that have direct or indirect relationships with each listed company node within its second-order neighborhood, and obtain the potential Association relationship category and association degree value under different dimensions.

具体地，Cypher语言是一种声明式图数据库查询语言，能够支持Neo4j图数据库中完成多阶图节点关系的条件查询。Specifically, the Cypher language is a declarative graph database query language that can support the conditional query of multi-level graph node relationships in the Neo4j graph database.

间接关联指企业间并非存在直接利益关联，但两企业均与第三方实体具有关联关系，包括具有相同审计师、股东、高管、投资方、控制人五类。Indirect association means that there is no direct interest relationship between enterprises, but both enterprises have an association relationship with a third-party entity, including five types of auditors, shareholders, executives, investors, and controllers.

步骤四，根据检索获得的查询结果，利用分位点划分与条件概率评估相结合的二阶段风险转移概率计算方法，量化企业间利益关联的紧密度，进而评估企业间的违法风险转移概率。Step 4: According to the query results obtained from the retrieval, the two-stage risk transfer probability calculation method combining quantile division and conditional probability evaluation is used to quantify the closeness of the interest relationship between enterprises, and then evaluate the probability of illegal risk transfer among enterprises.

具体地，分位点划分与风险转移概率计算方法流程如图4所示，包括以下步骤：Specifically, the flow chart of quantile point division and risk transfer probability calculation method is shown in Figure 4, including the following steps:

Step1：针对企业间的一阶利益关联关系，包括持股、控股、投资三类关系，其中持股与控股量纲一致，均为股份占比，因此按照量纲类别持股比例与投资额度，分别记为r_stock，r_amount，采用Cypher语句检索存储在Neo4j图数据库中的企业一阶利益关联信息，将检索到的利益关系属性数值添加至两组列表，记为List_stock、List_amount，并对两组列表按照企业间持股比例或投资额度升序排序。Step1: Aiming at the first-order interest-related relationship between enterprises, it includes three types of relationships: shareholding, holding, and investment. Among them, the dimension of holding shares and holdings is the same, and they are both proportions of shares. They are recorded as r_stock and r_amount respectively, and the Cypher statement is used to retrieve the first-order interest association information of the enterprise stored in the Neo4j graph database, and the value of the retrieved interest relationship attributes is added to two lists, recorded as List_stock , List_amount Sort the two groups of lists in ascending order according to the shareholding ratio or investment amount among enterprises.

Step2：对于企业间一阶利益关联，继续求解两组列表的五分位点位置，并将对应位置的数值作为企业利益紧密度等级的分界线，进而根据目标企业的关系属性数值所属于的等级区间，依次评定上市企业间的一阶利益关联紧密度等级。Step2: For the first-order interest relationship between enterprises, continue to solve the quintile position of the two sets of lists, and use the value of the corresponding position as the dividing line of the level of closeness of enterprise interests, and then according to the level to which the value of the relationship attribute of the target enterprise belongs Interval, in order to evaluate the degree of closeness of the first-order interest relationship between listed companies.

具体地，求解企业利益关联紧密度等级分界线即五分位点位置的公式如下：Specifically, the formula for solving the dividing line of the level of closeness of corporate interests, that is, the position of the quintile is as follows:

Q_r＝(N_r+1)MQ_r =(N_r +1)M

r∈{r_stock，r_amount}，M∈{0.2,0.4,0.6,0.8}r ∈ {r_stock , r_amount }, M ∈ {0.2, 0.4, 0.6, 0.8}

其中，N_r代表企业间持股关联与投资关联的关联关系总数，Q_r代表企业一阶持股关联和投资关联的五分位点。Among them, N_r represents the total number of related relations of shareholding and investment among enterprises, and_Qr represents the quintile of the first-order shareholding and investment relations of enterprises.

企业利益等级区间包含5个等级，利益关联紧密度随等级提高而更密切。The enterprise interest level range includes 5 levels, and the closeness of interest relationship becomes closer as the level increases.

Step3：根据近五年企业违法记录，以年度为单位计算不同关系类别、不同利益关联紧密度程度下的次年企业违法概率，并以此作为企业间风险传播的一阶风险转移概率P_lv1。具体计算公式如下：Step3: Based on the corporate violation records in the past five years, calculate the probability of corporate violations in the next year under different relationship types and different degrees of closeness of interest association on an annual basis, and use this as the first-order risk transfer probability P_lv1 for risk communication among enterprises. The specific calculation formula is as follows:

其中，r代表利益关系类别，s代表Step2求取的该关系类别利益紧密度等级，Num_pair-rs即满足该利益关联和风险等级的企业关联方总数，Num_illegal即Num_pair-rs中关联双方出现一方本年度违法，另一方次年违法事件发生的总数。Among them, r represents the type of interest relationship, s represents the level of interest closeness of the relationship type obtained in Step2, Num_pair-rs is the total number of related parties of the enterprise that meet the interest relationship and risk level, and Num_illegal is the number of related parties in Num_pair-rs The total number of incidents in which one party broke the law in the current year and the other party broke the law in the next year.

据此，依次求解不同关联类别与程度条件下的违法行为连带效应触发概率。Based on this, the trigger probability of the joint effect of illegal behavior under the conditions of different association types and degrees is sequentially solved.

Step4：计算企业间二阶违法风险连带效应转移概率，记作P_lv2，包括利益往来二阶关联风险概率与共同关联方人物二阶关联风险概率，记作P_lv2-α、P_lv2-β。Step4: Calculate the transfer probability of the second-order illegal risk joint effect between enterprises, denoted as P_lv2 , including the second-order associated risk probability of interest exchanges and the second-order associated risk probability of joint related parties, denoted as P_lv2-α and P_lv2-β .

利益往来关联包括持股、控股、投资三类关系，采用对二阶利益关联拆解的方式，转入Step3，分别计算两阶段的条件概率，并将两阶段条件概率P_step1、P_step2的乘积作为企业二阶利益关联类的风险转移概率。Interest relationship includes three types of relationships: shareholding, holding, and investment. Adopt the method of dismantling the second-order interest relationship, transfer to Step3, calculate the conditional probabilities of the two stages respectively, and multiply the product of the two-stage conditional probabilities P_step1 and P_step2 As the risk transfer probability of the second-order interest association class of the enterprise.

共同关联方人物关联包括董事长、总经理、高管、审计师四类，将前三类关系归并为任职关联，记作r_post，将共同审计师关联记作r_audit。可采用Step2、Step3涉及的分位点与概率计算方法获得企业二阶共同关联方人物类的风险转移概率。The person associations of common related parties include four categories: chairman, general manager, executives, and auditors. The first three types of relationships are classified as post relationship, which is recorded as r_post , and the joint auditor relationship is recorded as r_audit . The quantile points and probability calculation methods involved in Step2 and Step3 can be used to obtain the risk transfer probability of the second-order joint related parties of the enterprise.

Step5：由于任意企业A与企业B之间的关联关系路径可以不止一种，并且概率不具有可加性，因此采用多变量条件概率计算方法求解任意上市企业间涉及的违法连带效应风险转移概率P_(A,B)，具体计算公式如下：Step5: Since there can be more than one relationship path between any company A and company B, and the probability is not additive, the multivariate conditional probability calculation method is used to solve the risk transfer probability P of illegal joint effects involved in any listed company_(A,B) , the specific calculation formula is as follows:

P_(A,B)＝1-Π_d(A,B)(1-P_d)，P_d∈{P_lv1,P_lv2-α,P_lv2-β}P_(A,B)＝ 1-Π_d(A,B) (1-P_d ), P_d ∈{P_lv1 ,P_lv2-α ,P_lv2-β }

需要补充说明的是，由于任意企业间可以拥有多个相同类别的关联方，因此，P_d涉及的概率类别选项即使为相同类别，也可以不止出现一次。It needs to be added that since any enterprise can have multiple related parties of the same category, the probability category options involved in P_d can appear more than once even if they are of the same category.

步骤五，以企业间的违法风险转移概率为边的权重，以上市企业为节点，基于networkx工具包构建大规模企业模拟集群，将初始状态的知识图谱转化为企业违法风险传播网络G_risk。Step 5: Taking the illegal risk transfer probability between enterprises as the weight of the edge, and using the listed companies as nodes, build a large-scale enterprise simulation cluster based on the networkx toolkit, and transform the knowledge map of the initial state into an enterprise illegal risk communication network G_risk .

具体地，由于风险传播效应具有双向性，因此若企业A与企业B之间存在风险传播路径，传播概率为P_(A,B)，则同时添加反向路径d(B,A)，传播概率P_(B,A)与P_(A,B)相同。Specifically, because the risk propagation effect is bidirectional, if there is a risk propagation path between enterprise A and enterprise B, and the propagation probability is P_(A,B) , then add the reverse path d(B,A) at the same time, and the propagation probability P_(B,A) is the same as P_(A,B) .

需要补充说明的是，两阶段风险转移概率相乘会产生小概率事件，为避免较多小概率事件导致企业风险模拟传播的冗余化，对于企业违法风险传播网络中低于0.01概率的转移路径予以剔除，并且剔除无关联方的上市企业孤立节点，使其不参与风险模拟传播过程。What needs to be added is that the multiplication of the two-stage risk transfer probabilities will generate low-probability events. In order to avoid the redundancy of enterprise risk simulation communication due to more small-probability events, for the transfer path with a probability lower than 0.01 in the enterprise’s illegal risk communication network Eliminate them, and eliminate the isolated nodes of listed companies with no related parties, so that they do not participate in the process of risk simulation and communication.

步骤六，基于Louvain算法将企业违法风险传播网络G_risk划分为企业风险传播子图G_sub，设计LeagalRank图传递算法，模拟不良风险源在企业风险传播子图G_sub扩散传播的图游走路径，通过图传递算法深层次迭代后的收敛趋向，推演评估各上市企业违法风险指数。Step 6: Based on the Louvain algorithm, the corporate illegal risk communication network G_risk is divided into the corporate risk communication sub-graph G_sub , and the LeagalRank graph transfer algorithm is designed to simulate the graph walk path of the diffusion of bad risk sources in the corporate risk communication sub-graph G_sub . Through the convergence trend after the deep iteration of the graph transfer algorithm, deduce and evaluate the illegal risk index of each listed company.

具体地，LeagalRank图传递算法求解企业风险传播评分LegalRiskScore的流程如图5所示，包括以下步骤：Specifically, the process of solving the enterprise risk communication score LegalRiskScore by the LeagalRank graph transfer algorithm is shown in Figure 5, including the following steps:

Step1：基于Louvain算法将企业违法风险传播网络G_risk划分为企业风险传播子图G_sub，以模块度为优化目标，将法律风险传播网络划分为规模不等的子图。Step1: Divide the corporate illegal risk communication network G_risk into corporate risk communication subgraphs G_sub based on the Louvain algorithm, and divide the legal risk communication network into subgraphs of different sizes with modularity as the optimization goal.

具体地，将G_risk中每个顶点作为一个社区，社区数目与节点数目相同；依次尝试将企业节点及与其每一相邻企业节点合并，并计算模块度增益值Q_part，并基于贪心思想选择归入模块度增益最大的社区，迭代执行此操作至所有企业节点所属社区不再变化，为一轮次；而后将各个社区所有节点压缩视作一个新节点，压缩后节点的边权为原始社区内所有节点的边权重之和，进行新一轮次的模块度计算与节点归并操作，直至各社区模块度数值基本恒定，完成企业关联子图G_sub的划分。Specifically, each vertex in G_risk is regarded as a community, and the number of communities is the same as the number of nodes; sequentially try to merge the enterprise node and each of its adjacent enterprise nodes, and calculate the modularity gain value Q_part , and select Classify into the community with the largest gain in modularity, perform this operation iteratively until the communities to which all enterprise nodes belong no longer change, which is one round; then compress all nodes in each community as a new node, and the edge weight of the compressed node is the original community The sum of the edge weights of all nodes in the network will perform a new round of modularity calculation and node merging operations until the modularity value of each community is basically constant, and the division of the enterprise association subgraph G_sub is completed.

具体地，Louvain算法模块度评估公式如下：Specifically, the Louvain algorithm modularity evaluation formula is as follows:

其中，Q_part为社区模块度数值，∑in表示社区c内所有边的权重之和，∑out表示与社区c内存在节点相连的边的权重之和，m表示企业违法风险传播网络G_risk的边数总和。Among them, Q_part is the community modularity value, ∑in represents the sum of the weights of all edges in the community c, ∑out represents the sum of the weights of the edges connected to existing nodes in the community c, m represents the G_risk of the enterprise’s illegal risk communication network sum of edges.

Step2：依次将每一子图G_sub转换为邻接矩阵形式。Step2: Transform each subgraph G_sub into an adjacency matrix in turn.

具体地，并为确保后续风险传播的收敛性，对邻接矩阵每列元素进行归一化处理，得到风险传播概率矩阵W_ij，风险传播矩阵的数目等同于所划分的风险传播子图G_sub数目。Specifically, in order to ensure the convergence of subsequent risk propagation, the elements in each column of the adjacency matrix are normalized to obtain the risk propagation probability matrix W_ij , and the number of risk propagation matrices is equal to the number of divided risk propagation subgraphs G_sub .

Step3：根据各子图内企业的本年度违法行为次数，设计企业违法风险程度评估函数，作为企业风险传播自身携带的初始风险值，并生成企业风险传播算法中的风险初始向量PR₀。Step3: According to the number of illegal activities of the enterprises in each sub-graph in this year, design the assessment function of the enterprise's illegal risk degree as the initial risk value carried by the enterprise's risk communication itself, and generate the initial risk vector PR₀ in the enterprise risk communication algorithm.

具体地，对于任一子图，依次对子图内所有企业利用Cypher语言检索其知识图谱中存储的关联违法事件实体数目，并基于风险值评估公式确定每个企业节点携带的风险值，而后形成h维表征企业违法风险的向量，向量的个数与子图的数目相等，h为每一子图内的企业个数。得到每个企业社区的违法风险向量后，需要对h维向量进行归一化处理，得到初始化企业风险传播向量PR₀，以备后续风险模拟传播所需。企业初始风险值Risk_com评估公式如下：Specifically, for any subgraph, use the Cypher language to retrieve the number of associated illegal event entities stored in its knowledge graph for all enterprises in the subgraph in turn, and determine the risk value carried by each enterprise node based on the risk value evaluation formula, and then form The h dimension is a vector representing the illegal risk of an enterprise. The number of vectors is equal to the number of subgraphs, and h is the number of enterprises in each subgraph. After obtaining the illegal risk vector of each corporate community, it is necessary to normalize the h-dimensional vector to obtain the initial corporate risk propagation vector PR₀ for subsequent risk simulation propagation. The enterprise initial risk value Risk_com evaluation formula is as follows:

Risk_com＝ln(x+1)Risk_com =ln(x+1)

其中，x为目标企业的本年度违法事件数目。Among them, x is the number of illegal incidents of the target company in this year.

Step4：基于所设计的LeagalRank图传播算法模拟企业违法风险传播，通过风险源的深层次迭代游走，得到图传递算法收敛后的各上市企业违法风险传播评分LegalRiskScore。Step4: Based on the designed LeagalRank graph propagation algorithm, simulate the corporate illegal risk propagation, and obtain the LegalRiskScore of each listed company's illegal risk propagation score after the graph propagation algorithm converges through the deep iterative walk of the risk source.

具体地，在PageRank图传播算法基础上，设计LeagalRank风险模拟传播算法，完成三项优化改进，使其契合于企业法律风险传播任务：限定风险传播范围为louvain社区划分算法获得的企业关联方集群范围，提高大规模图传播的收敛效率；将随机传播机制转变为依据企业违法行为连带效应的概率有倾向地传播；根据企业本年度违法次数评估各节点初始携带的风险值，并且每开启新一轮迭代游走，总是以违法企业作为起点。LeagalRank风险模拟传播的具体公式如下：Specifically, based on the PageRank graph propagation algorithm, the LeagalRank risk simulation propagation algorithm is designed, and three optimizations and improvements are completed to make it suitable for the corporate legal risk propagation task: limit the scope of risk propagation to the range of enterprise related party clusters obtained by the louvain community division algorithm , to improve the convergence efficiency of large-scale graph propagation; change the random propagation mechanism to propagating with a tendency based on the probability of the joint effect of corporate violations; evaluate the initial risk value of each node according to the number of corporate violations this year, and start a new round of iteration Wandering, always take illegal enterprises as the starting point. The specific formula of LeagalRank risk simulation propagation is as follows:

具体地，PR表示当前轮次的节点风险迭代值，

为阻尼系数，图传播研究通常取0.85，n_j为企业节点n_i的相邻企业节点，k为迭代轮次，D(n_i)为企业节点n_i所属于的企业集群子图G_sub，W_ij为Step2得到的风险传播概率矩阵，Risk_comi为风险初始向量PR₀中企业节点n_i对应的初始风险值。Specifically, PR represents the node risk iteration value of the current round,

is the damping coefficient, and graph propagation research usually takes 0.85, n_j is the adjacent enterprise node of enterprise node n_i , k is the iteration round, D(n_i ) is the enterprise cluster subgraph G_sub to which enterprise node n_i belongs, W_ij is the risk propagation probability matrix obtained in Step 2, and Risk_comi is the initial risk value corresponding to the enterprise node n_i in the risk initial vector PR₀ .

需要补充说明的是，LeagalRank风险模拟传播算法有一定概率存在迭代周期过长现象经测试，k_max取20时，迭代效率与风险识别精度均达到较优效果。It needs to be added that the LeagalRank risk simulation propagation algorithm has a certain probability that the iteration period is too long. After testing, when k_max is set to 20, the iteration efficiency and risk identification accuracy both achieve better results.

步骤七，将趋于稳态后的各子图中所涉及的上市企业违法风险指数存入Neo4j图数据库，通过Cypher语句添加节点属性，作为各企业节点的违法风险传播评分LegalRiskScore属性字段。Step 7: Store the illegal risk indices of listed companies involved in each sub-graph that tends to the steady state into the Neo4j graph database, and add node attributes through the Cypher statement as the LegalRiskScore attribute field of the illegal risk propagation score of each enterprise node.

步骤八，根据企业间存在相同股东、审计师、投资方讯息构建企业关联方超图，进而搭建具备企业关联方集群特征表示能力的Hyper-GNN超图神经网络，基于超图神经网络完成企业关联方集群特征的向量化表示，其流程图如图6所示，具体包括如下步骤：Step 8: Construct a hypergraph of related parties of the enterprise according to the information of the same shareholders, auditors, and investors among enterprises, and then build a Hyper-GNN hypergraph neural network with the ability to express the characteristics of the cluster of related parties of the enterprise, and complete the enterprise association based on the hypergraph neural network The vectorized representation of square cluster features, the flow chart of which is shown in Figure 6, specifically includes the following steps:

Step1：超边构建：基于步骤三已构建的上市企业关联方及违法信息知识图谱，分别将涉及相同股东、审计师、投资方、高管的企业化纳入同一子集，每个子集中的节点构成一条超边e_i，其中超边的符号表示如下：Step1: Ultra-edge construction: Based on the knowledge map of related parties and illegal information of listed companies that has been constructed in step 3, corporates involving the same shareholders, auditors, investors, and executives are included in the same subset, and the nodes in each subset are composed A hyperedge e_i , where the symbol of the hyperedge is as follows:

e_i＝{key_type:{v_com1,v_com2,v_com3,...,v_comn}}e_i ＝{key_type :{v_com1 ,v_com2 ,v_com3 ,...,v_comn }}

其中，type包括相同股东、审计师、投资方、高管四种类型子集，key_type代表该子集对应的具体属性，即股东、审计师、投资方、高管的名称，v_com代表超边关联的企业顶点。Among them, type includes four subsets of the same shareholders, auditors, investors, and executives, key_type represents the specific attributes corresponding to the subset, that is, the names of shareholders, auditors, investors, and executives, and v_com represents super Edge-associated enterprise vertices.

具体地，超边e_i存在重叠部分，处于重叠交集部分中的节点同时归属于两个超边e_i。Specifically, there is an overlapping part of the hyperedge e_i , and the nodes in the overlapping intersection part belong to two hyperedges e_i at the same time.

Step2：超图构建：根据构造的超边e_i与知识图谱中的企业类型顶点v_com，以及需要经过模型训练获得的各超边权重ω_i，进而依次构造超边种类为大于1且小于等于4的所有上市企业关联方超图H_com，并利用各超边权重ω_i初始值构建对角矩阵W，超图H_com表示如下：Step2: Hypergraph construction: According to the constructed hyperedge e_i and the enterprise type vertex v_com in the knowledge graph, and the weight ω_i of each hyperedge that needs to be obtained through model training, then sequentially construct hyperedge types that are greater than 1 and less than or equal to 4’s hypergraph H_com of related parties of all listed companies, and use the initial value of each hyperedge weight ω_i to construct a diagonal matrix W. The hypergraph H_com is expressed as follows:

H_com＝{(V_com,ε,W)|V_com＝(v_com1,v_com2,…,v_comn),ε＝(e₁,e₂,…,e_n),W＝(ω₁,ω₂,...,ω_n)}H_com ＝{(V_com ,ε,W)|V_com ＝(v_com1 ,v_com2 ,…,v_comn ),ε=(e₁ ,e₂ ,…,e_n ),W=(ω₁ , ω₂ ,...,ω_n )}

需要补充说明的是，超边权重对角矩阵W可作为超参数，在后续迭代更新。此外，孤立的上市企业节点不参与超图构建，予以剔除。It should be added that the hyperedge weight diagonal matrix W can be used as a hyperparameter and updated in subsequent iterations. In addition, isolated listed company nodes do not participate in hypergraph construction and are eliminated.

Step3：求解超边的节点度和边缘度，并生成节点度对角矩阵D_v和边缘度对角矩阵D_e。Step3: Solve the node degree and edge degree of hyperedge, and generate node degree diagonal matrix D_v and edge degree diagonal matrix D_e .

具体地，不同于普通图，超图中的超边可以同时连接多个节点，节点度d(v_com1)和边缘度δ(e_i)的定义有所不同，首先生成各超图的邻接矩阵h_i(v_i,e_i)，而后根据各节点连接的超边e_i及各超边的权重ω_i进行加权求和，计算节点度d(v_com1)；对各超边所连接的节点数量求和，得到超边的边缘度为δ(e_i)。将各超图的节点度和边缘度放置入矩阵并执行对角化操作，得到节点度对角矩阵D_v和边缘度对角矩阵D_e，两类对角矩阵非对角线元素均为0。超图节点度与边缘度技术公式具体如下：Specifically, unlike ordinary graphs, hyperedges in hypergraphs can connect multiple nodes at the same time, and the definitions of node degree d(v_com1 ) and edge degree δ(e_i ) are different. First, the adjacency matrix of each hypergraph is generated h_i (v_i , e_i ), and then carry out weighted summation according to the hyperedge e_i connected by each node and the weight ω_i of each hyperedge, and calculate the node degree d(v_com1 ); for the nodes connected by each hyperedge The numbers are summed to obtain the edge degree of the hyperedge as δ(e_i ). Put the node degree and edge degree of each hypergraph into the matrix and perform diagonalization operation to obtain the node degree diagonal matrix D_v and the edge degree diagonal matrix D_e , the off-diagonal elements of the two types of diagonal matrices are all 0 . The technical formulas of hypergraph node degree and edge degree are as follows:

d(v_i)＝∑_e∈εω(e_i)h_i(v_i,e_i)d(v_i )＝∑_e∈ε ω(e_i )h_i (v_i ,e_i )

δ(e_i)＝∑_v∈Vh_i(v_i,e_i)δ(e_i )=∑_v∈V h_i (v_i ,e_i )

Step4：设计超图神经网络，并利用超图神经网络对邻域节点特征聚合更新，最终得到企业关联方集群特征Vec_relate。Step4: Design the hypergraph neural network, and use the hypergraph neural network to aggregate and update the characteristics of the neighborhood nodes, and finally obtain the Vec_{relate of} the enterprise related party cluster characteristics.

具体地，搭建Hyper-GNN超图神经网络，需要进行两个阶段的神经网络计算，第一阶段为根据各超图内的企业节点特征，得到各超图的超边特征，通过

操作，即将超边连接的点的特征向量求和；第二阶段为汇聚超边的特征，完成各超图内企业节点的特征更新，通过/>

完成该操作。具体利用Hyper-GNN网络得到得到企业关联方集群特征Vec_relate过程的公式如下：Specifically, building a Hyper-GNN hypergraph neural network requires two stages of neural network calculations. The first stage is to obtain the hyperedge features of each hypergraph based on the characteristics of enterprise nodes in each hypergraph.

The operation is to sum the feature vectors of the points connected by the hyperedge; the second stage is to gather the features of the hyperedge, and complete the feature update of the enterprise nodes in each hypergraph, through />

This is done. Specifically, the Hyper-GNN network is used to obtain the Vec_relate process formula of the cluster characteristics of the related parties of the enterprise as follows:

其中，l为网络迭代层数，

为第l轮迭代后超图内各节点的特征向量表示，D_v、D_e为Step3所求的节点度对角矩阵与边缘度对角矩阵，W为超边权重对角矩阵，需要通过模型训练获得的超参数，Θ^l为l层的向量维度转换矩阵，h_com为各超图的不加权条件下的邻接矩阵。Among them, l is the number of network iteration layers,

is the eigenvector representation of each node in the hypergraph after the l-th iteration, D_v and D_e are the node degree diagonal matrix and edge degree diagonal matrix obtained in Step3, and W is the hyperedge weight diagonal matrix, which needs to be passed through the model The hyperparameters obtained by training, Θ^l is the vector dimension transformation matrix of layer l, and h_com is the adjacency matrix of each hypergraph under unweighted conditions.

需要补充说明的是，若l为0，则为知识图谱生成的超图中各节点的初始特征表示。通过迭代重复执行Step4的超图神经网络模块的向量特征更新操作，最后一轮次迭代后所输出向量为最终的企业关联方集群特征Vec_relate。It needs to be added that if l is 0, it is the initial feature representation of each node in the hypergraph generated by the knowledge graph. The vector feature update operation of the hypergraph neural network module of Step4 is iteratively executed, and the output vector after the last round of iterations is the final enterprise related party cluster feature Vec_relate .

步骤九，首先根据上市企业基础信息及企业违法风险传播评分LegalRiskScore完成企业自身特征的向量化表示，而后与基于Hyper-GNN超图神经网络获得的企业关联方集群特征向量有效拼接，进而拟定企业违法预测对应的神经网络损失函数，构建神经网络层前向传播与反向传播模块单元，完成Legal-GNN企业次年违法预测模型的搭建。Step 9: First, complete the vectorized representation of the company’s own characteristics based on the basic information of the listed company and the LegalRiskScore of the company’s illegal risk communication score, and then effectively splicing with the cluster feature vectors of related parties of the company obtained based on the Hyper-GNN supergraph neural network, and then formulate the corporate illegal Predict the corresponding neural network loss function, construct the forward propagation and back propagation module units of the neural network layer, and complete the construction of the Legal-GNN enterprise illegal prediction model for the next year.

具体地，Legal-GNN企业次年违法预测模型搭建流程如图7所示，包括如下步骤：Specifically, the construction process of Legal-GNN enterprise's illegal prediction model for the next year is shown in Figure 7, including the following steps:

Step1：向量化表示企业自身经营信息、企业违法风险特征，实现一阶段的特征拼接。Step1: Vectorization represents the company's own business information and company's illegal risk characteristics, and realizes a stage of feature splicing.

具体地，企业自身经营信息包含国内上市企业近五年净资产收益率、资产负债率、股东持股比例、流动比率、流动资产周转率、审计评估意见等43类维度特征。通过Cypher语句依次检索知识图谱各企业节点的LegalRiskScore属性字段，获取企业违法风险传播评分，作为企业违法风险特征。将企业经营信息特征与违法风险特征共同组成44维的字段属性，通过一层全连接网络得到企业自身经营信息与风险的混合特征，记作Vec_mix。Specifically, the company's own operating information includes 43 types of dimensional characteristics such as return on net assets, asset-liability ratio, shareholder shareholding ratio, current ratio, current asset turnover rate, and audit evaluation opinions of domestic listed companies in the past five years. The LegalRiskScore attribute field of each enterprise node in the knowledge graph is searched in turn through the Cypher statement, and the enterprise's illegal risk propagation score is obtained as the enterprise's illegal risk characteristics. Combining the characteristics of business information and illegal risk characteristics of the enterprise into a 44-dimensional field attribute, the mixed characteristics of the enterprise's own business information and risks are obtained through a layer of fully connected network, which is recorded as Vec_mix .

Step2：通过LeakyReLU激活函数针对Vec_relate完成非线性特征变换，并进一步通过设计MLP多层感知机神经网络，包含两层隐层以及ReLU激活层在内，使得Vec_mix的特征维度对标Vec_relate的特征维度，并引入超参数ξ，可通过模型训练调优，将企业自身经营信息与法律风险信息混合特征Vec_mix与基于Hyper-GNN超图神经网络获得的企业关联方集群特征Vec_relate有效拼接，得到各个企业节点的最终特征表示Vec_com，具体设计的特征拼接公式如下：Step2: Use the LeakyReLU activation function to complete the nonlinear feature transformation for Vec_relate , and further design the MLP multi-layer perceptron neural network, including two hidden layers and ReLU activation layer, so that the feature dimension of Vec_mix can be compared with that of Vec_relate The feature dimension and the introduction of hyperparameter ξ can be tuned through model training, and the mixed feature Vec_mix of the company's own business information and legal risk information can be effectively spliced with the cluster feature Vec_{relate of} related parties obtained based on the Hyper-GNN hypergraph neural network. The final feature representation Vec_com of each enterprise node is obtained, and the specifically designed feature splicing formula is as follows:

Step3：设计Legal-GNN企业次年违法预测模型的神经网络分类层，通过多层神经网络层级间的线性变换与激活函数涉及的非线性变换，构造神经网络前向传播单元，并设计企业违法预测相应的损失函数，完成神经网络的反向传播模块，经过各层级神经元的迭代更新，最终输出上市企业次年违法倾向的预测值。Step3: Design the neural network classification layer of the Legal-GNN enterprise illegal prediction model for the next year, construct the neural network forward propagation unit through the linear transformation between the layers of the multi-layer neural network and the nonlinear transformation involved in the activation function, and design the enterprise illegal prediction The corresponding loss function completes the backpropagation module of the neural network. After iterative updating of neurons at each level, the predicted value of the listed company's illegal tendency in the next year is finally output.

具体地，通过softmax激活函数输出最后的企业违法类别，操作公式表达如下：Specifically, the final corporate illegal category is output through the softmax activation function, and the operation formula is expressed as follows:

其中，

为最终输出的企业次年违法预测类别，为二分类输出，W_x为需要预测模型训练的超参数，b_x表示偏置向量。in,

is the final output of the company's illegal prediction category for the next year, which is the binary classification output, W_x is the hyperparameter that needs to be trained for the prediction model, and b_x represents the bias vector.

具体地，损失函数采用最小交叉熵损失函数，具体表示如下：Specifically, the loss function adopts the minimum cross-entropy loss function, which is expressed as follows:

其中，

步骤十，对搭建的Legal-GNN企业次年违法预测模型采用k折交叉验证的方式迭代训练，并与基线模型对比测试，利用训练后的预测模型进行上市企业违法风险的预测，检验模型的预测精度，同时，通过特征分析评估本方法获得的企业违法风险传播评分LegalRiskScore与企业关联方集群特征在上市企业违法预测任务中的重要性。Step 10: Use k-fold cross-validation to iteratively train the established Legal-GNN enterprise violation prediction model for the next year, and compare and test it with the baseline model, use the trained prediction model to predict the violation risk of listed companies, and test the prediction of the model At the same time, the importance of the LegalRiskScore obtained by this method and the characteristics of corporate related party clusters in the task of predicting illegality of listed companies is evaluated through feature analysis.

具体地，经过多轮对企业违法预测模型的测试验证，在防止模型过拟合的前提下，选取以下参数，模型效果最佳，其中学习率为0.01，权重衰减率为0.0004，优化器选择为AdamW，每批次样本数batch_size为128，训练迭代次数epoch为70。Specifically, after multiple rounds of testing and verification of the enterprise’s illegal prediction model, on the premise of preventing the model from overfitting, the following parameters are selected, and the model has the best effect. The learning rate is 0.01, the weight decay rate is 0.0004, and the optimizer is selected as AdamW, the batch_size of each batch of samples is 128, and the number of training iterations epoch is 70.

具体地，基线模型选取金融风控领域预测主流的三类机器学习模型，随机森林(RF)、LightGBM、XGBoost模型，通过对模型与基线模型对比测试，验证模型能够较有效的完成上市企业违法预测任务，并取得较优于基线模型，具体实验结果如表2所示。Specifically, the baseline model selects three types of machine learning models that are mainstream in the field of financial risk control prediction, Random Forest (RF), LightGBM, and XGBoost models. Through the comparison test of the model and the baseline model, it is verified that the model can effectively complete the illegal prediction of listed companies. Task, and achieved better than the baseline model, the specific experimental results are shown in Table 2.

表2Table 2

模型ModelPPRRF1-scoreF1-scoreRFRF73.91％73.91%70.53％70.53%72.18％72.18%XGBoostXGBoost79.69％79.69%78.02％78.02%78.85％78.85%LightGBMLight GBM80.29％80.29%77.14％77.14%78.68％78.68%本方法This method86.53％86.53%84.65％84.65%85.57％85.57%

具体地，为检验方法中涉及的企业违法风险传播评分LegalRiskScore与基于Hyper-GNN获得的企业关联集群特征Vec_relate对于违法预测任务的重要性，将本方法迭代训练后获得的最优模型中的各维度特征向量保存为pkl格式文件，继而将pkl文件中各维度特征向量导入上述XGBoost模型进行特征分析实验，基于XGBoost模型对应的特征评估方式，评估各维度特征的重要性。Specifically, in order to test the importance of the corporate illegal risk propagation score LegalRiskScore involved in the method and the corporate associated cluster feature Vec_relate obtained based on Hyper-GNN to the illegal prediction task, each of the optimal models obtained after iterative training of this method The dimensional feature vectors are saved as pkl format files, and then the dimensional feature vectors in the pkl file are imported into the above-mentioned XGBoost model for feature analysis experiments, and the importance of each dimensional feature is evaluated based on the feature evaluation method corresponding to the XGBoost model.

具体地，XGBoost模型需要利用梯度提升树架构完成预测分类，所述特征评估方式即计算各特征在梯度提升树架构中参与的子树分裂次数，且此特征评估方式同样适用于该方法的特征重要性分析。各特征重要性指数如图9所示，为便于表示将企业关联集群特征Vec_relate记作HyperVector，企业违法传播评分LegalRiskScore与HyperVector分别在45个维度的企业综合特征中重要性排名第1与第7，在企业违法风险预测任务中，重要程度均超过绝大多数经济学指标特征，进一步验证了本方法中所获取特征的有效性。Specifically, the XGBoost model needs to use the gradient boosting tree architecture to complete the prediction classification. The feature evaluation method is to calculate the number of subtree splits that each feature participates in in the gradient boosting tree architecture, and this feature evaluation method is also applicable to the feature of this method. gender analysis. The importance index of each feature is shown in Figure 9. In order to facilitate the expression, the Vec_relate of the enterprise related cluster feature is recorded as HyperVector, and the enterprise’s illegal communication score LegalRiskScore and HyperVector rank first and seventh in the importance of the comprehensive characteristics of the enterprise in 45 dimensions. , in the enterprise illegal risk prediction task, the importance is more than most economic index features, which further verifies the effectiveness of the features obtained in this method.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting illegal risk of a marketing enterprise based on a knowledge graph is characterized by comprising the following steps:

step one, acquiring enterprise operation indexes, illegal events and historical records of related parties on the market, acquiring daily increment information of the illegal events of the enterprise through a crawler, cleaning the acquired information, and converting the acquired data into structured data;

step two, designing a marketing enterprise association party and illegal information knowledge graph mode layer, converting the multidimensional structured data into a knowledge graph triplet format, and storing the constructed knowledge graph entity and relationship data based on a Neo4j graph database;

Thirdly, based on the association party of the enterprises on the market and the knowledge graph of illegal information, sequentially searching the enterprise nodes which have direct or indirect association relations with the enterprise nodes in the second-order neighborhood range of each enterprise node on the market by using a Cypher language, and acquiring potential association relation categories among the enterprises on the market and association degree values under different dimensions;

step four, according to the query result obtained by retrieval, a two-stage risk transfer probability calculation method combining the quantile division and the conditional probability evaluation is utilized to quantify the compactness of the benefit association among enterprises, so as to evaluate the illegal risk transfer probability among enterprises;

fifthly, taking the probability of illegal risk transfer among enterprises as the weight of the edge, taking the enterprises on the market as the nodes, constructing a large-scale enterprise simulation cluster based on a networkx tool kit, and converting the knowledge graph of the initial state into an enterprise illegal risk propagation network G_risk ；

Step six, based on Louvain algorithm, enterprise illegal risk propagation network G_risk Dividing into enterprise risk propagation subgraphs G_sub Designing a Leagalrank graph transfer algorithm, and simulating a sub graph G of poor risk sources in enterprise risk transmission_sub The map migration path of diffusion propagation is deduced and evaluated by the convergence trend after deep iteration of the map transfer algorithm;

Step seven, storing the illegal risk indexes of the marketed enterprises, which are related in each sub-graph after the trend to be stable, into a Neo4j graph database, adding node attributes through a Cypher statement, and taking the node attributes as illegal risk propagation score LegalRiskScore attribute fields of each enterprise node;

step eight, constructing an enterprise association party hypergraph according to the information of the same stakeholders, auditors and investors among enterprises, further constructing a Hyper-GNN hypergraph neural network with the enterprise association party cluster feature representation capability, and completing vectorization representation of the enterprise association party cluster features based on the hypergraph neural network;

step nine, firstly, according to the basic information of the enterprises on the market and the enterprise illegal risk propagation score LegalRiskScore, the vectorization representation of the characteristics of the enterprises is completed, then the vectorization representation is effectively spliced with the cluster characteristic vector of the enterprise associated party obtained based on the Hyper-GNN hypergraph neural network, and further, the neural network loss function corresponding to enterprise illegal prediction is formulated, a neural network layer forward propagation and backward propagation module unit is constructed, and the establishment of a next-year illegal prediction model of the Legal-GNN enterprises is completed;

and step ten, carrying out iterative training on the built next-year illegal prediction model of the Legal-GNN enterprise in a k-fold cross validation mode, and predicting illegal risks of the marketing enterprise by using the trained prediction model.

2. The knowledge-graph-based illegal risk prediction method for a marketing enterprise of claim 1, wherein the specific method of the fourth step is as follows:

step1: searching first-order benefit association information of enterprises stored in a Neo4j graph database by using a Cypher statement, adding the searched benefit relation attribute values to two groups of lists, and sorting according to the share holding proportion or investment limit ascending sequence among enterprises;

step2: solving the five-point position of a first-order benefit association list between two groups of enterprises, and sequentially evaluating the first-order benefit association compactness grade between the enterprises on the market according to the grade interval of the relationship attribute value of the target enterprise;

step3: according to the enterprise illegal records of the last five years, calculating the next-year enterprise illegal probability of different relation categories and different benefit association closeness degrees by taking years as a first-order risk transfer probability P of risk propagation among enterprises_lv1 The specific calculation formula is as follows:

P_lv1-rs ＝(Num_illegal /Num_pair-rs )，r∈{r_stock ，r_amount }，l∈[1,5]

wherein r represents the interest relationship class, s represents the interest closeness level of the relationship class obtained by Step2, num_pair-rs I.e. the total number of enterprise-associated parties meeting the benefit association and risk level, num_illegal I.e. Num_pair-rs The related two parties have a yearly illegal event of one party and the other party has the total number of the illegal event of the next year;

Step4: aiming at the benefit association among three enterprises of holding, controlling and investment, adopting a mode of disassembling the second-order benefit association, turning to Step3, respectively calculating the conditional probabilities of two stages, and further solving the risk transition probability of the second-order benefit association class of the enterprise; aiming at the common associated party character association comprising four types of a board length, a total manager, a high manager and an auditor, merging the first three types of relations into any job association, and obtaining the risk transfer probability of the second-order common associated party character of the enterprise by adopting a quantile and probability calculation method related to Step2 and Step 3;

step5: solving illegal association effect risk transition probability P among marketing enterprises by adopting a multivariate conditional probability calculation method in combination with first-order risk transition probability of enterprises_(A,B) The specific calculation formula is as follows:

P_(A,B)＝ 1-Π_d(A,B) (1-P_d )，P_d ∈{P_lv1 ,P_lv2-α ,P_lv2-β }

wherein d (A, B) represents the association path between any enterprises, P_d Representing the risk transition probability corresponding to the path.

3. The knowledge-graph-based illegal risk prediction method for a marketing enterprise of claim 1, wherein the specific method in the step six is as follows:

step1: enterprise risk propagation subgraph G based on Louvain community division algorithm_sub Taking modularity as an optimization target, and propagating enterprise illegal risk to the network G_risk Dividing into sub-graphs with unequal scales, and taking the sub-graphs as an enterprise risk transfer simulation cluster, wherein a model evaluation formula of a Louvain algorithm is as follows:

wherein Q is_part For the community modularity value, Σin represents the sum of the weights of all the edges in community c, Σout represents the sum of the weights of the edges connected to the nodes in community c, and m represents the enterprise illegal risk propagation network G_risk Sum of the edge numbers of (2);

step2: sequentially combining each sub-graph G_sub Converting into an adjacent matrix form, and carrying out normalization processing on each column element of the adjacent matrix to obtain a risk propagation probability matrix W_ij ；

Step3: according to the number of times of the annual illegal behaviors of the enterprise in each subgraph, designing an enterprise illegal risk degree evaluation function as an initial risk value carried by enterprise risk propagation, and generating a risk initial vector PR in an enterprise risk propagation algorithm₀ ；

Step4: simulating enterprise illegal risk propagation based on a designed Leagalrank graph propagation algorithm, and obtaining a LegalRiskScore of each enterprise illegal risk propagation after the graph propagation algorithm converges through deep iteration migration of a risk source; the specific formula of the LeagalRank risk simulation propagation is as follows:

where PR represents the node risk iteration value of the current round,

N is the damping coefficient_j For enterprise node n_i K is the iteration round, D (n_i ) For enterprise node n_i Subgraph G of enterprise cluster to which it belongs_sub ，W_ij Risk spread probability matrix obtained for Step2, < >>

For the risk initial vector PR₀ Middle enterprise node n_i Corresponding initial risk values.

4. The knowledge-graph-based method for predicting illegal risk of a marketing enterprise according to claim 3, wherein Step1 is specifically as follows: sequentially attempting to combine the enterprise node and each adjacent enterprise node, calculating a modularity gain value, selecting a community with the largest modularity gain, and iteratively executing the operation until communities to which all enterprise nodes belong are not changed any more, wherein the operation is performed for one round; then compressing all nodes of each community as a new node, taking the edge weight of the compressed node as the sum of the edge weights of all nodes in the original community, and performing a new round of module degree calculation and node merging operation until the module degree value of each community is basically constant, thus completing the enterprise associated subgraph G_sub Is divided into (1).

5. The knowledge-graph-based method for predicting illegal risk of a marketing enterprise according to claim 3, wherein Step3 is specifically as follows: for any sub-graph, sequentially searching the number of associated illegal event entities stored in the knowledge graph of all enterprises in the sub-graph by using a Cypher language, determining the risk value carried by each enterprise node based on a risk value evaluation formula, forming a vector representing the illegal risk of the enterprise, and carrying out normalization processing to obtain an initialized enterprise risk propagation vector PR₀ 。

6. The knowledge-graph-based method for predicting illegal risk of a marketing enterprise according to claim 3, wherein Step4 is specifically as follows: on the basis of a PageRank graph propagation algorithm, three optimization and improvement are carried out, so that the PageRank graph propagation algorithm is matched with an enterprise legal risk propagation task: the risk propagation range is limited to be the enterprise associated party cluster range obtained by the louvain community division algorithm, so that the convergence efficiency of large-scale graph propagation is improved; converting the random propagation mechanism into the probability of the company illegal act association effect according to the enterprise, and propagating the probability in a trend; and evaluating the risk value initially carried by each node according to the annual illegal times of the enterprise, wherein each new iteration trip is started by taking the illegal enterprise as a starting point.

7. The knowledge-graph-based method for predicting illegal risk of a marketing enterprise according to claim 1, wherein the step eight is specifically as follows:

step1: and (3) superb construction: based on the established knowledge graphs of the relevant parties and illegal information of the enterprises on the market, enterprises related to the same stakeholders, auditors, investors and high management are respectively brought into the same subset, and nodes in each subset form an superb e_i ；

Step2: hypergraph construction: superb e according to the construction_i And the vertex v of the enterprise type in the knowledge graph_com And each hyperedge weight omega which is required to be obtained through model training_i Further construct a marketing enterprise associated party hypergraph H with a hyperedge class of 4_com And utilize each hyperedge weight omega_i Constructing a diagonal matrix W from initial values of (1);

step3: solving the node degree and edge degree of the superside and generating a node degree diagonal matrix D_v And edge diagonal matrix D_e ；

Step4: designing a hypergraph neural network, and utilizing the hypergraph neural network to aggregate and update the neighborhood node characteristics to finally obtain enterprise associated party cluster characteristics Vec_relate The method comprises the steps of carrying out a first treatment on the surface of the Obtaining enterprise associated party cluster feature Vec by utilizing hypergraph neural network_relate The formula of (2) is as follows:

wherein, l is the iterative layer number of the network,

for the characteristic vector representation of each node in the hypergraph after the first round of iteration, D_v 、D_e Respectively obtaining a node degree diagonal matrix and an edge degree diagonal matrix by Step3, wherein W is a superedge weight diagonal matrix, and the superparameters required to be obtained through model training are theta^l For vector dimension conversion matrix of layer l, h_com Is an adjacency matrix under the unweighted condition of each hypergraph.

8. The knowledge-graph-based method for predicting risk of illegal use by a marketable enterprise as set forth in claim 7, wherein Step3 is specifically as follows, wherein an adjacency matrix h of each hypergraph is first generated_i (v_i ,e_i ) Then according to the superedge e of each node connection_i Weights ω for each hyperedge_i Weighted summation is performed to calculate node degree d (v_com1 ) The method comprises the steps of carrying out a first treatment on the surface of the Summing the number of nodes connected by each superside to obtain the edge degree delta (e)_i ) The method comprises the steps of carrying out a first treatment on the surface of the Placing the node degree and the edge degree of each hypergraph into a matrix and executing diagonalization operation to obtain a node degree diagonal matrix D_v And edge diagonal matrix D_e Both types of diagonal matrix off-diagonal elements are 0.

9. The knowledge-graph-based method for predicting illegal risk of a marketing enterprise according to claim 1, wherein in step nine, the method is specifically as follows:

step1: vectorization represents enterprise self-operation information and enterprise illegal risk characteristics, and realizes one-stage characteristic splicing;

specifically, the enterprise management information features and the illegal risk features are combined to form a 44-dimensional field attribute, and the mixed features of the enterprise self management information and the risk are obtained through a layer of fully-connected network and are recorded as Vec_mix ；

Step2: targeting Vec through the LeakyReLU activation function_relate Completing non-linear featuresTransforming, and further designing MLP multi-layer perceptron neural network including two hidden layers and ReLU activation layer to make Vec_mix Feature dimension pair labels Vec_relate Is characterized by mixing the business management information and legal risk information of enterprises with the feature Vec_mix Enterprise associated party cluster feature Vec obtained based on Hyper-GNN hypergraph neural network_relate Effectively splicing to obtain final characteristic representation Vec of each enterprise node_com The specific design characteristic splicing formula is as follows:

Vec_com ＝ξ*LeakyReLU(Vec_relate )+(1-ξ)*MLP(Vec_mix )

wherein, xi is the super parameter obtained in the model training process;

step3: designing a neural network classification layer of a next-year illegal prediction model of a Legal-GNN enterprise, constructing a neural network forward propagation unit through linear transformation among layers of the multi-layer neural network and nonlinear transformation related to an activation function, designing an enterprise illegal prediction corresponding loss function, completing a reverse propagation module of the neural network, and finally outputting a predicted value of next-year illegal tendency of a marketing enterprise through iterative updating of neurons of each layer;

the loss function adopts a minimum cross entropy loss function, and is expressed as follows:

wherein,,

for the type of the next-year violation of the enterprise predicted by the model, y_com Y is the enterprise entry in the training set of marked violation tags for the true next year violation category of the enterprise.