CN118798633A

Movatterモバイル変換

Info

Publication number: CN118798633A
Application number: CN202410733147.5A
Authority: CN
Inventors: 王裕祥; 代睿; 周宏林; 石致远; 王立闻
Original assignee: Dongfang Electric Group Research Institute of Science and Technology Co Ltd
Current assignee: Dongfang Electric Group Research Institute of Science and Technology Co Ltd
Priority date: 2024-06-07
Filing date: 2024-06-07
Publication date: 2024-10-18

Abstract

The invention discloses an industrial chain risk information monitoring method based on network data and artificial intelligence technology, which comprises the following steps: establishing an industrial chain knowledge graph of the energy equipment manufacturing industry; acquiring information of the latest industry, companies and main products, performing data cleaning and natural language processing, and marking related terms; analyzing and quantifying risk factors of an industrial chain and a supply chain, estimating the risk level of the industrial chain, estimating the risk condition of the industrial chain, finding a risk event and determining the risk level; finally, searching companies and products with potential relation with the risk event by utilizing the industrial chain knowledge graph of the energy equipment manufacturing industry; if the company and the product with potential relation are found, the risk event is pushed to related department personnel; if not, supplementing nodes and information corresponding to the potential relationship in the industrial chain knowledge graph of the energy equipment manufacturing industry. The method and the system can effectively monitor and acquire the associated information of the risk event, and are convenient to know and deal with in time.

Description

Translated fromChinese

基于网络数据和人工智能技术的产业链风险信息监测方法Industrial chain risk information monitoring method based on network data and artificial intelligence technology

技术领域Technical Field

本发明涉及产业链、供应链风险管理技术领域，具体涉及一种基于网络数据和人工智能技术的产业链风险信息监测方法。The present invention relates to the technical field of industrial chain and supply chain risk management, and in particular to an industrial chain risk information monitoring method based on network data and artificial intelligence technology.

背景技术Background Art

在当今竞争激烈的商业环境中，企业需要不断了解和监测其所在产业链中的风险信息。这些风险可能来自于供应链中的某个环节、市场需求的变化、法律法规的调整或技术创新的崛起等多种因素。这种风险信息监测对于企业来说至关重要，以便及时应对和降低潜在风险对业务运营的不利影响。In today's highly competitive business environment, companies need to constantly understand and monitor risk information in their industry chain. These risks may come from a certain link in the supply chain, changes in market demand, adjustments to laws and regulations, or the rise of technological innovation. This risk information monitoring is crucial for companies to respond to and reduce the adverse effects of potential risks on business operations in a timely manner.

产业链风险信息监测技术是一种整合了数据分析、人工智能、自然语言处理以及信息采集和处理的先进技术。通过大规模数据的收集、分析和处理，这种技术能够识别并预测可能对产业链造成影响的潜在风险。它依赖于大数据的采集和处理，通过各种渠道收集来自供应商、合作伙伴、市场趋势、法律法规变化等方面的数据，并利用先进的算法和分析工具对这些数据进行挖掘和分析。这样的数据处理能力有助于企业识别和评估潜在的风险因素，提前采取相应措施以减少风险带来的损失。Industrial chain risk information monitoring technology is an advanced technology that integrates data analysis, artificial intelligence, natural language processing, and information collection and processing. Through the collection, analysis, and processing of large-scale data, this technology can identify and predict potential risks that may affect the industrial chain. It relies on the collection and processing of big data, collects data from suppliers, partners, market trends, changes in laws and regulations, etc. through various channels, and uses advanced algorithms and analytical tools to mine and analyze these data. Such data processing capabilities help companies identify and evaluate potential risk factors and take corresponding measures in advance to reduce losses caused by risks.

与传统的通过人工搜索、调查或阅读文献来获取相关数据方法相比，产业链风险信息监测技术具有自动化数据采集、数据驱动客观分析、主动性反应模式、多样化数据来源、实时性和高效率等优点。Compared with the traditional method of obtaining relevant data through manual search, investigation or reading literature, industrial chain risk information monitoring technology has the advantages of automated data collection, data-driven objective analysis, proactive response mode, diversified data sources, real-time and high efficiency.

然而能源装备制造业在风险信息监测技术方面存在一些不足之处。尽管这个行业在生产设备和技术方面取得了显著进步，但仍面临着一些挑战。However, the energy equipment manufacturing industry has some deficiencies in risk information monitoring technology. Although the industry has made significant progress in production equipment and technology, it still faces some challenges.

首先，能源装备制造业往往依赖于传统的生产模式和方法，对新兴的风险信息监测技术采用较为保守。这可能导致企业在监测供应链风险、市场变化以及法规调整方面落后于其他行业，错失了利用先进技术提前预知和规避风险的机会。First, the energy equipment manufacturing industry often relies on traditional production models and methods, and is relatively conservative in adopting emerging risk information monitoring technologies. This may cause companies to lag behind other industries in monitoring supply chain risks, market changes, and regulatory adjustments, and miss the opportunity to use advanced technologies to predict and avoid risks in advance.

其次，这个行业通常拥有复杂的供应链结构，涉及多个环节和合作伙伴。然而，对供应链中各个环节风险的监测和管理可能存在局限，尤其是在信息收集、整合和实时性方面存在挑战。这可能使企业难以全面了解和应对潜在的风险。Secondly, this industry usually has a complex supply chain structure involving multiple links and partners. However, there may be limitations in monitoring and managing risks at each link in the supply chain, especially in terms of information collection, integration and real-time challenges. This may make it difficult for companies to fully understand and respond to potential risks.

另外，能源装备制造业通常在对新技术的应用上较为谨慎，这可能导致在风险信息监测技术方面的投资不足。缺乏对新技术的全面了解和投入可能会使企业错失提高竞争力和应对风险的机会。In addition, the energy equipment manufacturing industry is usually cautious in the application of new technologies, which may lead to insufficient investment in risk information monitoring technology. Lack of comprehensive understanding and investment in new technologies may cause companies to miss opportunities to improve competitiveness and cope with risks.

发明内容Summary of the invention

本发明为了解决能源装备制造业的风险预警问题，提出了基于网络数据和人工智能技术的产业链风险信息监测方法，本发明通过爬取网络数据，监控到影响产业链的风险信息，根据建立的可扩展的产业链知识图谱，通过对图谱中的关联信息进行处理，得到与风险事件关联的公司及产品信息，从而推送给相应管理人员，及时了解和应对相应风险。In order to solve the risk warning problem in the energy equipment manufacturing industry, the present invention proposes an industrial chain risk information monitoring method based on network data and artificial intelligence technology. The present invention monitors the risk information affecting the industrial chain by crawling network data, and obtains company and product information associated with risk events by processing the associated information in the graph according to the established scalable industrial chain knowledge graph, which is then pushed to the corresponding managers to timely understand and respond to the corresponding risks.

为了实现上述目的，本发明采用了如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

基于网络数据和人工智能技术的产业链风险信息监测方法，其监测步骤如下：The industrial chain risk information monitoring method based on network data and artificial intelligence technology has the following monitoring steps:

步骤1，在已有的先验知识基础上，建立能源装备制造业产业链知识图谱。Step 1: Based on the existing prior knowledge, establish a knowledge graph of the energy equipment manufacturing industry chain.

所述能源装备制造业产业链知识图谱包含实体的信息和相互关系。具体的，所述实体主要包括上市公司、行业类别和产品类型。构建能源装备制造业产业链知识图谱的过程中，对各类公司、行业和产品进行识别和收集，建立对应实体的基本信息，根据所述基本信息建立实体之间的联系，得到实体之间的关联和相互作用；按照实体的建立、不同实体之间的关系建立，逐步绘制整个图谱。The energy equipment manufacturing industry chain knowledge graph contains the information and relationships of entities. Specifically, the entities mainly include listed companies, industry categories and product types. In the process of constructing the energy equipment manufacturing industry chain knowledge graph, various companies, industries and products are identified and collected, basic information of corresponding entities is established, and connections between entities are established based on the basic information to obtain the associations and interactions between entities; according to the establishment of entities and the relationship between different entities, the entire graph is gradually drawn.

步骤2，采用爬虫程序，从第三方网站和/或微信小程序等获取最新的产业、公司和主营产品的信息，对获取的信息依次进行数据清理和自然语言处理。Step 2: Use a crawler program to obtain the latest information on industries, companies, and main products from third-party websites and/or WeChat applets, and perform data cleaning and natural language processing on the obtained information.

所述获取的信息为合法信息，包括公开的供应商信息、市场报告、经济数据、政策法规等。The information obtained is legal information, including public supplier information, market reports, economic data, policies and regulations, etc.

所述数据清理和自然语言处理的过程中，包括对产业、公司、产品和风险相关术语的识别和标签打标，以便进一步筛选和处理信息。The data cleaning and natural language processing process includes the identification and labeling of industry, company, product and risk related terms to further filter and process the information.

所述数据清理包括：第一步，进行数据清洗时，首先使用正则表达式、数据清洗库去除HTML标签，其次去除文本数据中异常字符或无效数据，如特殊符号、控制字符等。第二步，进行数据预处理时，首先对清洗后的文本数据进行去重、缺失值填充、数据格式转换的操作，用于确保数据的一致性和准确性；其次将文本数据进行归一化处理，如：进行小写转化，将文本全部转换为小写，以消除大小写差异；使用停用词表(如NLTK的停用词表)去除常见但无实际意义的词语。The data cleaning includes: the first step, when performing data cleaning, first use regular expressions and data cleaning libraries to remove HTML tags, and then remove abnormal characters or invalid data in text data, such as special symbols, control characters, etc. The second step, when performing data preprocessing, first perform deduplication, missing value filling, and data format conversion operations on the cleaned text data to ensure the consistency and accuracy of the data; secondly, normalize the text data, such as: perform lowercase conversion, convert all text to lowercase to eliminate uppercase and lowercase differences; use a stop word list (such as the stop word list of NLTK) to remove common but meaningless words.

对经过清理的数据进行自然语言处理，所述自然语言处理依次包括文本分词、语义分析和关键术语识别。所述文本分词是使用自然语言处理库(如NLTK、spacy)将文本拆分为单词或短语或句子，所述语义分析是利用机器学习模型进行语义理解和识别，所述关键术语识别是使用词袋模型、TF-IDF算法或深度学习模型识别特定领域的关键词和术语。The cleaned data is subjected to natural language processing, which includes text segmentation, semantic analysis, and key term identification in sequence. The text segmentation is to split the text into words, phrases, or sentences using a natural language processing library (such as NLTK, spacy), the semantic analysis is to use a machine learning model for semantic understanding and identification, and the key term identification is to use a bag-of-words model, TF-IDF algorithm, or deep learning model to identify keywords and terms in a specific field.

步骤3，对产业链、供应链的风险因素进行分析和量化，对步骤2获取的信息进行整理并建立文本数据结构，将所述文本数据结构存储到文档型数据库NoSQL中；筛选影响产业链的各种风险因素，确定风险必选指标，建立概率统计模型评估不同风险因素的概率和影响程度；根据所述概率统计模型，使用风险必选指标，对风险事件进行定量评估。Step 3: Analyze and quantify the risk factors of the industrial chain and supply chain, organize the information obtained in step 2 and establish a text data structure, and store the text data structure in the document-type database NoSQL; screen various risk factors that affect the industrial chain, determine the required risk indicators, and establish a probability statistical model to evaluate the probability and impact of different risk factors; according to the probability statistical model, use the required risk indicators to quantitatively evaluate risk events.

鉴于订单驱动的制造产业链中风险因素的因果关系不明确、重要性难以区分，风险因素权重难以通过具体指标进行量化，将根据风险概率值的大小来确定风险等级。Given that the causal relationship of risk factors in the order-driven manufacturing industry chain is unclear and its importance is difficult to distinguish, the weight of risk factors is difficult to quantify through specific indicators. The risk level will be determined based on the size of the risk probability value.

步骤4，利用建立的能源装备制造业产业链知识图谱寻找与风险事件存在潜在关系的公司和产品。Step 4: Use the established knowledge graph of the energy equipment manufacturing industry chain to find companies and products that have potential relationships with risk events.

如果找到存在潜在关系的公司和产品，则将风险事件及其影响的行业、公司和产品信息推送给相关部门人员，以便及时了解和应对可能出现的风险；If companies and products with potential relationships are found, the risk events and the information on the industries, companies and products they affect will be pushed to personnel in relevant departments so that they can understand and respond to possible risks in a timely manner;

如果未找到存在潜在关系的公司和产品，即能源装备制造业产业链知识图谱中缺乏与风险事件相关的信息，则在能源装备制造业产业链知识图谱中补充对应的节点和潜在关系的信息，以扩充该能源装备制造业产业链知识图谱的内容。If no companies and products with potential relationships are found, that is, the energy equipment manufacturing industry chain knowledge graph lacks information related to the risk event, then the corresponding nodes and potential relationship information are supplemented in the energy equipment manufacturing industry chain knowledge graph to expand the content of the energy equipment manufacturing industry chain knowledge graph.

作为上述技术方案的进一步描述：As a further description of the above technical solution:

步骤1中，知识图谱的构建是针对产业链中的主要实体，主要包括三大类：A股上市公司、行业类别和产品类型。实体之间的关系，涵盖上市公司与其所属行业之间的关联、行业之间的层级关系、产品生产所需的原材料来源、产品流向下游相关产品的关系等。总体上可分为六大类关系，包括上市公司的所属行业关系、行业之间的上下级关系、产品生产所需的原材料链路、产品销售后续相关产品链路、公司主要经营的产品范畴以及产品的详细分类。In step 1, the knowledge graph is constructed for the main entities in the industrial chain, which mainly include three categories: A-share listed companies, industry categories and product types. The relationship between entities covers the relationship between listed companies and their industries, the hierarchical relationship between industries, the source of raw materials required for product production, and the relationship between products and downstream related products. In general, it can be divided into six categories of relationships, including the industry relationship of listed companies, the superior-subordinate relationship between industries, the raw material link required for product production, the subsequent related product link of product sales, the main product scope of the company, and the detailed classification of products.

步骤2中，构建知识图谱的过程中，对各类公司、行业和产品等重要实体进行识别和收集，建立对应实体的基本信息；在实体之间建立相互联系的桥梁，揭示它们之间的关联和相互作用。In step 2, in the process of building the knowledge graph, important entities such as various companies, industries, and products are identified and collected to establish basic information of the corresponding entities; bridges of connection are established between entities to reveal the connections and interactions between them.

知识图谱的构建过程涵盖了大量的数据收集、清洗、整合以及复杂的关联分析和图谱绘制工作，为产业链的深入研究提供了强有力的数据支持和可视化呈现。所述构建过程包括实体的建立、不同实体之间关系的建立以及整个图谱的绘制。The construction process of the knowledge graph covers a large amount of data collection, cleaning, integration, and complex association analysis and graph drawing, providing strong data support and visual presentation for in-depth research on the industrial chain. The construction process includes the establishment of entities, the establishment of relationships between different entities, and the drawing of the entire graph.

步骤3中，采用主成分分析法和Logistic回归方法构建概率统计模型。其中，对风险必选指标采用主成分分析方法进行处理，用于确定主成分的标准有两种：第一，以方差累积贡献率大于某一比例为标准；第二，以特征值大于1为标准，如果特征值小于1，说明该主成分的解释力度不如直接引入一个原变量的平均解释力度大，则用特征值大于1作为纳入标准。可以根据具体情况选择确定主成分的标准。In step 3, the principal component analysis method and the logistic regression method are used to construct a probability statistical model. Among them, the principal component analysis method is used to process the required risk indicators. There are two standards for determining the principal component: first, the cumulative variance contribution rate is greater than a certain proportion; second, the eigenvalue is greater than 1. If the eigenvalue is less than 1, it means that the explanatory power of the principal component is not as strong as the average explanatory power of directly introducing an original variable. In this case, the eigenvalue is greater than 1 as the inclusion standard. The standard for determining the principal component can be selected according to the specific situation.

进一步，通过Logistic回归方法建立模型时，引入样本企业的风险必选指标特征后，进行主成分分析的基础上得到降维数据，按照确定主成分数量的标准选用的主成分数据通过Logistic回归计算其在一段时间内的风险概率。Furthermore, when the model is established through the Logistic regression method, the characteristics of the required risk indicators of the sample enterprises are introduced, and the dimensionality reduction data is obtained based on the principal component analysis. The principal component data selected according to the standard for determining the number of principal components are used to calculate their risk probability over a period of time through Logistic regression.

综上所述，本发明的有益效果是：.In summary, the beneficial effects of the present invention are:

本发明通过建立对应产业链知识图谱，利用知识图谱结合获取的即时数据，可以有效及时地进行风险因素分析，预测相应的风险事件等级，从而使得关联人员得到风险的预判信息，以避免企业风险事件和避免重大错误决策。The present invention establishes a corresponding industrial chain knowledge graph, and utilizes the knowledge graph in combination with the acquired real-time data to effectively and timely analyze risk factors and predict the corresponding risk event levels, so that related personnel can obtain risk prediction information to avoid enterprise risk events and major wrong decisions.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法流程图。FIG1 is a flow chart of the method of the present invention.

图2为实施例中构建知识图谱的架构示意图。FIG2 is a schematic diagram of the architecture for constructing a knowledge graph in an embodiment.

图3为实施例中的信息获取与数据预处理模块流程图。FIG3 is a flow chart of the information acquisition and data preprocessing module in the embodiment.

图4为实施例中查询知识图谱找出的风险事件相关的公司和产品节点及其关联信息的示意图。FIG4 is a schematic diagram of company and product nodes related to risk events and their associated information found by querying the knowledge graph in an embodiment.

图5为实施例中推送风险的示意图。FIG. 5 is a schematic diagram of pushing risks in an embodiment.

具体实施方式DETAILED DESCRIPTION

如图1所示，本实施例提供了一种基于网络数据和人工智能技术的产业链风险信息监测方法，其监测步骤如下：As shown in FIG1 , this embodiment provides a method for monitoring industrial chain risk information based on network data and artificial intelligence technology, and the monitoring steps are as follows:

步骤1，建立能源装备制造业产业链知识图谱。Step 1: Establish a knowledge graph of the energy equipment manufacturing industry chain.

步骤2，获取最新的产业、公司和产品信息，并进行数据清理和自然语言处理。Step 2: Obtain the latest industry, company, and product information, and perform data cleaning and natural language processing.

步骤3，对产业链、供应链的风险因素进行分析和量化，预估产业链的风险等级，对产业链的风险状况进行评估。Step 3: Analyze and quantify the risk factors of the industrial chain and supply chain, estimate the risk level of the industrial chain, and evaluate the risk status of the industrial chain.

步骤4，利用建立的能源装备制造业产业链知识图谱寻找与风险事件存在潜在关系的公司和产品；找到后，将风险事件及其影响的行业、公司和产品信息推送给相关部门人员，以便及时了解和应对可能出现的风险；如果未找到，则补充其节点和关系信息。Step 4: Use the established knowledge graph of the energy equipment manufacturing industry chain to search for companies and products that have potential relationships with risk events. Once found, push information about risk events and the industries, companies, and products they affect to personnel in relevant departments so that they can understand and respond to possible risks in a timely manner. If not found, supplement their node and relationship information.

以风电产业链为研究对象，典型的产业链结构。该产业链涵盖了上游、中游和下游多个环节，其中涉及大量公司，各自专注于不同环节的生产和服务。这些公司间存在错综复杂的关系，其主营产品相互交织，相互依存并形成了一个相互支撑的产业生态系统。以下为风电产业链风险信息监测方法案例实施。Taking the wind power industry chain as the research object, the typical industry chain structure. The industry chain covers multiple links in the upstream, midstream and downstream, involving a large number of companies, each focusing on production and services in different links. There are intricate relationships between these companies, and their main products are intertwined and interdependent, forming a mutually supportive industrial ecosystem. The following is a case implementation of the risk information monitoring method of the wind power industry chain.

1.知识图谱构建1. Knowledge graph construction

知识图谱的构建是针对产业链中的主要实体，主要包括三大类：A股上市公司、行业类别和产品类型。这些实体之间的关系，涵盖上市公司与其所属行业之间的关联、行业之间的层级关系、产品生产所需的原材料来源、产品流向下游相关产品的关系等，如图2所示。总体上可分为六大类关系，包括上市公司的所属行业关系、行业之间的上下级关系、产品生产所需的原材料链路、产品销售后续相关产品链路、公司主要经营的产品范畴以及产品的详细分类。The construction of the knowledge graph is aimed at the main entities in the industrial chain, which mainly includes three categories: A-share listed companies, industry categories and product types. The relationship between these entities covers the relationship between listed companies and their industries, the hierarchical relationship between industries, the source of raw materials required for product production, the relationship between products and downstream related products, etc., as shown in Figure 2. In general, it can be divided into six categories of relationships, including the industry relationship of listed companies, the superior-subordinate relationship between industries, the raw material link required for product production, the subsequent related product link of product sales, the main product scope of the company and the detailed classification of products.

构建这一知识图谱的过程中，会进行多种数据处理和抽取技术，最终形成一个规模庞大的图谱，涵盖数十万个节点。这些节点所涉及的数据来源主要包括申万行业分类、深圳证券交易所、上海证券交易所等多个权威数据来源、以及公司官网。主要通过从结构化数据源(经过人工整理过后得到的知识表格)和非结构化数据源(如文本、网页)中收集数据，最终将汇总得到的专家知识表格导入neo4j图数据库，构建出风电产业链知识图谱。In the process of building this knowledge graph, a variety of data processing and extraction technologies will be used to eventually form a large-scale graph covering hundreds of thousands of nodes. The data sources involved in these nodes mainly include multiple authoritative data sources such as Shenwan Industry Classification, Shenzhen Stock Exchange, Shanghai Stock Exchange, and company official websites. Mainly by collecting data from structured data sources (knowledge tables obtained after manual sorting) and unstructured data sources (such as text, web pages), the expert knowledge tables obtained are finally imported into the neo4j graph database to build a wind power industry chain knowledge graph.

构建知识图谱的过程中，实体的建立是指，对各类公司、行业和产品等重要实体进行识别和收集，建立它们的基本信息；关系的建立指，在实体之间建立相互联系的桥梁，揭示它们之间的关联和相互作用；知识图谱的绘制则是整合实体和关系，将其呈现为一个可视化、具有结构性和连通性的整体。In the process of constructing a knowledge graph, the establishment of entities refers to the identification and collection of important entities such as various companies, industries and products, and the establishment of their basic information; the establishment of relationships refers to the establishment of bridges of connection between entities, revealing the connections and interactions between them; the drawing of a knowledge graph is to integrate entities and relationships, and present them as a visual, structured and connected whole.

知识图谱的构建过程涵盖了大量的数据收集、清洗、整合以及复杂的关联分析和图谱绘制工作，为产业链的深入研究提供了强有力的数据支持和可视化呈现。过程主要包括实体的建立、不同实体之间关系的建立以及整个图谱的绘制。The construction process of the knowledge graph covers a large amount of data collection, cleaning, integration, and complex correlation analysis and graph drawing, providing strong data support and visual presentation for in-depth research on the industrial chain. The process mainly includes the establishment of entities, the establishment of relationships between different entities, and the drawing of the entire graph.

2.信息获取和数据预处理2. Information acquisition and data preprocessing

信息获取与数据预处理流程图如图3所示：The information acquisition and data preprocessing flow chart is shown in Figure 3:

(1)数据爬取阶段：(1) Data crawling stage:

爬虫程序：使用Python等编程语言中的爬虫框架(如Scrapy、BeautifulSoup、)或网络抓取工具，合法从第三方网站(如北极星风力发电网)和微信小程序(如智风电商)等获取文本信息数据。Crawler program: Use crawler frameworks (such as Scrapy, BeautifulSoup, etc.) or web crawling tools in programming languages such as Python to legally obtain text information data from third-party websites (such as Polaris Wind Power Grid) and WeChat applets (such as Zhifeng E-commerce).

(2)数据清洗与预处理：(2) Data cleaning and preprocessing:

对爬取得到的文本数据进行清洗与预处理是确保数据质量和准确性的重要步骤。Cleaning and preprocessing the crawled text data is an important step to ensure data quality and accuracy.

数据清洗：使用正则表达式、数据清洗库(如Pandas)等工具去除HTML标签、异常字符或无效数据。Data cleaning: Use regular expressions, data cleaning libraries (such as Pandas), and other tools to remove HTML tags, abnormal characters, or invalid data.

数据预处理：包括去除重复数据(使用Pandas的drop_duplicates()方法去除重复数据)、缺失值填充(使用Pandas的fillna()方法对缺失值进行填充，可以根据具体情况选择均值、中位数、众数等进行填充)、数据格式转换等操作，确保数据的一致性和准确性。Data preprocessing: including removing duplicate data (using Pandas' drop_duplicates() method to remove duplicate data), filling missing values (using Pandas' fillna() method to fill missing values, you can choose mean, median, mode, etc. to fill according to the specific situation), data format conversion and other operations to ensure data consistency and accuracy.

(3)自然语言处理阶段：(3) Natural language processing stage:

对预处理后的文本数据进行识别和分析。Identify and analyze the preprocessed text data.

文本分词：使用自然语言处理库(如NLTK、jieba)将文本拆分为单词或短语。Text segmentation: Use natural language processing libraries (such as NLTK, Jieba) to split text into words or phrases.

语义分析：借助机器学习模型(如BERT、Word2Vec)进行语义理解，识别上下文、情感、语义关系等。Semantic analysis: Use machine learning models (such as BERT, Word2Vec) to perform semantic understanding and identify context, sentiment, semantic relationships, etc.

关键术语识别：本实施例使用词袋模型、TF-IDF算法来识别风电领域的关键词和术语。关键词的确定基于事先确定的列表，例如根据表1中的风电产品目录。在处理文本数据时，将每个文档表示为一个向量，其中每个元素对应一个词语或者词汇项，表示该词在文档中的出现频率、TF-IDF值等。通过这种方式，可以将文本数据转换为数值型数据，使得可以应用PCA等基于数值型数据的分析方法。Key term identification: This embodiment uses the bag-of-words model and the TF-IDF algorithm to identify keywords and terms in the field of wind power. The keywords are determined based on a predetermined list, such as the wind power product catalog in Table 1. When processing text data, each document is represented as a vector, in which each element corresponds to a word or vocabulary item, indicating the frequency of occurrence of the word in the document, the TF-IDF value, etc. In this way, the text data can be converted into numerical data, so that analysis methods based on numerical data such as PCA can be applied.

表1风电产品目录Table 1 Wind power product catalog

(4)信息处理与分类：(4) Information processing and classification:

术语标记：标记识别到的行业、公司、产品、风险相关术语。Terminology tagging: Tag identified industry, company, product, and risk-related terms.

信息筛选与分类：根据标记的关键词和术语对信息进行分类、归档或筛选，以便进一步处理或展示可视化结果。Information filtering and classification: Classify, archive or filter information based on tagged keywords and terms for further processing or display of visualization results.

3.风险因素分析及量化3. Risk factor analysis and quantification

对于步骤2获取的文本数据，其中可能包括供应商信息、市场报告、经济数据、政策法规等进行整理并建立一个清晰的文本数据结构并存储到文档型数据库NoSQL中。结合所制定的风电行业风险指标，如表2所示，其中一级指标为风险源，二级指标为风电行业风险筛选必选指标采用Logostic回归模型，对风险因素进行量化分析和评估。For the text data obtained in step 2, which may include supplier information, market reports, economic data, policies and regulations, etc., a clear text data structure is established and stored in the document-based database NoSQL. Combined with the risk indicators of the wind power industry developed, as shown in Table 2, the first-level indicators are risk sources, and the second-level indicators are the required indicators for risk screening in the wind power industry. The Logostic regression model is used to quantitatively analyze and evaluate risk factors.

表2风电行业风险指标Table 2 Risk indicators of wind power industry

具体方案：Specific plan:

Logistic回归是一种概率型非线性回归模型，其研究的是观察结果Y与多个影响因素X之间的关系。作为一种多变量分析方法，在对企业是否面临供应链风险的二分类情况进行预测时，Logistic回归具有较高稳定性和准确性，其对变量和数据的分布情况、性质没有很高的要求。比如，Logistic回归并不需要数据服从正态分布。根据已确立的供应链风险评价指标体系，用主成分分析法和Logistic回归方法构建评价模型，评估与风险事件相关企业的风险等级。Logistic regression is a probabilistic nonlinear regression model that studies the relationship between the observed result Y and multiple influencing factors X. As a multivariate analysis method, Logistic regression has high stability and accuracy when predicting the binary classification of whether an enterprise faces supply chain risks. It does not have high requirements on the distribution and properties of variables and data. For example, Logistic regression does not require data to follow a normal distribution. Based on the established supply chain risk evaluation indicator system, the principal component analysis method and Logistic regression method are used to construct an evaluation model to evaluate the risk level of enterprises related to risk events.

3.1数据处理3.1 Data processing

本实施例所使用的样本数据均来自风电行业的企业。The sample data used in this embodiment are all from enterprises in the wind power industry.

用于供应链风险评估实证分析的指标共有13个(即表2中的二级指标)，如果不经处理直接进行Logistic回归，可能会把许多包含了企业供应链风险信息的指标排除在模型之外。为了避免这种情况的发生，最大限度地检验各种指标对企业供应链风险的解释作用，本实施例设计了三个指标等级，在进行指标之间的相关性分析剔除高度相关的指标后，再用主成分分析提取数量更少的几个新变量作为自变量。经过上述处理后，可以解决高维度指标之间的相关问题，避免多重共线性对模型产生的不利影响。There are 13 indicators used for empirical analysis of supply chain risk assessment (i.e., the secondary indicators in Table 2). If Logistic regression is performed directly without processing, many indicators containing enterprise supply chain risk information may be excluded from the model. In order to avoid this situation and to maximize the explanatory effect of various indicators on enterprise supply chain risks, this embodiment designs three indicator levels. After performing correlation analysis between indicators to eliminate highly correlated indicators, principal component analysis is used to extract a smaller number of new variables as independent variables. After the above processing, the correlation problem between high-dimensional indicators can be solved, avoiding the adverse effects of multicollinearity on the model.

主成分分析的基本思路是：将原来众多具有一定相关性的变量通过线性变换重新组合成一组新的互相无关的几个综合变量，同时按照一定的标准从中选取少数几个综合变量，使其尽可能多地反映原来变量的信息。一般来说，用于确定主成分的标准有两种：第一，以方差累积贡献率大于某一比例(如80％)为标准；第二，以特征值大于1为标准，如果特征值小于1，说明该主成分的解释力度还不如直接引入一个原变量的平均解释力度大，因此一般可以用特征值大于1作为纳入标准。实际过程中，可以根据具体情况选择两种标准中的一种，确定主成分的数量。The basic idea of principal component analysis is to recombine the original numerous variables with certain correlation into a new set of several independent comprehensive variables through linear transformation, and select a few comprehensive variables according to certain standards to reflect as much information as possible from the original variables. Generally speaking, there are two standards for determining the principal components: first, the cumulative variance contribution rate is greater than a certain proportion (such as 80%); second, the eigenvalue is greater than 1. If the eigenvalue is less than 1, it means that the explanatory power of the principal component is not as strong as the average explanatory power of directly introducing an original variable. Therefore, the eigenvalue greater than 1 can generally be used as the inclusion standard. In the actual process, one of the two standards can be selected according to the specific situation to determine the number of principal components.

本实施例中，主成分分析的数学模型如下：In this embodiment, the mathematical model of principal component analysis is as follows:

式(1)中：X表示影响因素的矩阵，包括原始变量X₁，X₂，……，X_m；m表示影响因素的数量，本实施例中，m＝13；ZX₁，ZX₂，……，ZX_m表示进过标准化处理后的第m个变量的值；F₁，F₂，……，F_n表示提取出的主成分，是通过将原始变量X₁，X₂，……，X_m进行线性组合得到的新变量；a_1n，a_2n，……，a_mn为与X的协方差阵Σ的特征值相对应的特征向量；n表示主成分的数量，本实施例中，n也为13；Z表示标准化处理，如公式(2)所示。In formula (1), X represents a matrix of influencing factors, including original variables_X1 ,_X2 , ...,_Xm ; m represents the number of influencing factors, and in this embodiment, m=13;_ZX1 ,_ZX2 , ...,_ZXm represent the value of the mth variable after standardization;_F1 ,_F2 , ...,_Fn represent the extracted principal components, which are new variables obtained by linearly combining the original variables_X1 ,_X2 , ...,_Xm ;_a1n ,_a2n , ...,_amn are eigenvectors corresponding to the eigenvalues of the covariance matrix Σ of X; n represents the number of principal components, and in this embodiment, n is also 13; Z represents standardization, as shown in formula (2).

对于风险筛选的必选指标，其指标数据标准化处理的具体方法如下：For the mandatory indicators for risk screening, the specific methods for standardizing the indicator data are as follows:

式(2)中：ZX_i表示经过标准化处理后的第i个变量的值，i＝1，2，……，m；表示第i个变量的均值，σ_i表示第i个变量的标准差。In formula (2), ZX_i represents the value of the i-th variable after standardization, i = 1, 2, ..., m; represents the mean of the ith variable, and σ_i represents the standard deviation of the ith variable.

计算标准化后的数据X的协方差矩阵Σ，其元素为：Calculate the covariance matrix Σ of the standardized data X, whose elements are:

式(3)中：τ_ij表示协方差矩阵Σ的元素；和分别是特征值i和特征值j的均值；x_ki、x_kj分别表示第k个主成分的第i个、第j个变量的值。In formula (3): τ_ij represents the element of the covariance matrix Σ; and are the means of eigenvalue i and eigenvalue j respectively; x_ki and x_kj represent the values of the i-th and j-th variables of the k-th principal component respectively.

第k个主成分所对应的特征值为λ_k，那么主成分的方差贡献率V为：The eigenvalue corresponding to the kth principal component is λ_k , so the variance contribution rate V of the principal component is:

式(3)中：λ_j表示单个主成分的特征值，表示n个主成分的特征值总和。In formula (3),_λj represents the eigenvalue of a single principal component, Represents the sum of the eigenvalues of n principal components.

通过对表2中的13项二级指标采用主成分分析方法进行公因子的提取，结果如表3所示。The principal component analysis method was used to extract common factors from the 13 secondary indicators in Table 2, and the results are shown in Table 3.

表3总方差解释表Table 3 Total variance explained

本实施例中，按特征值大于1的标准，提取4个主成分因子，这4项的方差贡献率为74.358％，提取了对企业供应链风险概率有重要影响的特征。In this embodiment, four principal component factors are extracted according to the standard that the characteristic value is greater than 1. The variance contribution rate of these four factors is 74.358%, and the characteristics that have a significant impact on the enterprise supply chain risk probability are extracted.

3.2Logistic回归模型构建3.2 Logistic regression model construction

Logistic回归模型中的响应变量是一个值域为{0,1}的二分类变量。若企业供应链面临风险，则Y＝1；企业无风险，则Y＝0。Y_i＝0表示第i个企业供应链无风险；Y_i＝1表示第i个企业供应链具有风险。The response variable in the logistic regression model is a binary variable with a range of {0,1}. If the enterprise supply chain faces risks, then Y = 1; if the enterprise is risk-free, then Y = 0._Yi = 0 means that the supply chain of the ith enterprise is risk-free;_Yi = 1 means that the supply chain of the ith enterprise is risky.

本模型构建的思路是，引入样本企业的13个必选指标特征后，在对原数据进行主成分分析的基础上得到降维数据，选用特征值大于1的4个主成分数据通过Logistic回归计算其在一段时间内的风险概率。假设该企业供应链存在风险的概率为P，则1-P为企业无风险的概率，定义The idea of building this model is to introduce the 13 required indicator characteristics of the sample enterprise, obtain the dimension reduction data based on the principal component analysis of the original data, and select the 4 principal component data with eigenvalues greater than 1 to calculate the risk probability within a period of time through Logistic regression. Assuming that the probability of the enterprise's supply chain risk is P, then 1-P is the probability of the enterprise being risk-free, and the definition is

式(5)中：P表示企业供应链存在风险的概率；a为常数项；b₁,b₂,……,φ_n′称为Logistic回归系数，表示每个自变量x₁,x₂,……,x_n′对因变量P的影响程度，这些系数是模型拟合过程中估计得到的并反映了每个主成分对风险概率的贡献；n’表示通过主成分分析对原始数据进行降维得到的n’个主成分，本实施例中n’＝4，即降维后的数据特征数量；x₁,x₂,……,x_n′表示主成分分析得到的降维数据中的特征。In formula (5), P represents the probability that there is risk in the enterprise supply chain; a is a constant term;_b1 ,_b2 , ...,_φn' are called Logistic regression coefficients, which represent the influence of each independent variable_x1 ,_x2 , ...,_xn' on the dependent variable P. These coefficients are estimated in the model fitting process and reflect the contribution of each principal component to the risk probability; n' represents the n' principal components obtained by reducing the dimension of the original data through principal component analysis. In this embodiment, n' = 4, that is, the number of data features after dimensionality reduction;_x1 ,_x2 , ...,_xn' represent the features in the reduced dimensionality data obtained by principal component analysis.

式(5)经过演算可得：Formula (5) can be obtained by calculation:

式(6)即为Logistic回归方程。Formula (6) is the Logistic regression equation.

由于Logistic回归模型为非线性模型，且其误差项服从二项分布而非正态分布，因此在拟合时采用最大似然估计法进行参数估计。Logistic回归模型理论，选择将0.5作为衡量企业供应链水平高低的阈值，即通过Logistic回归模型计算出企业的供应链评分小于或等于0.5，则将企业判定为供应链状况良好的企业；反之，则判定为供应链状况不佳的企业。Since the Logistic regression model is a nonlinear model and its error term follows a binomial distribution rather than a normal distribution, the maximum likelihood estimation method is used for parameter estimation during fitting. The Logistic regression model theory selects 0.5 as the threshold for measuring the level of the enterprise's supply chain. That is, if the supply chain score of the enterprise calculated by the Logistic regression model is less than or equal to 0.5, the enterprise is judged to have a good supply chain; otherwise, it is judged to have a poor supply chain.

3.3风险结果评估及分析3.3 Risk outcome assessment and analysis

本发明分别从供应链风险评价指标体系和传统供应链风险评价指标体系出发，供应链风险评价指标体系是基于定量数据和量化指标构建的，通过收集数据和测量来量化供应链的风险，并且通常会使用数学模型或算法进行分析和评估；传统供应链风险评价指标体系是基于定性信息和人为判断构建的，这些指标不容易量化，更多地依赖于主观判断和专业知识。传统的方法包括SWOT分析(优势、劣势、机会和威胁分析)、专家访谈、调查问卷等，用以评估供应链的风险情况。本方法采用了一种综合的评价框架，通过对供应链评价所涉及的13个二级指标进行主成分分析，提取主成分后运用Logistic回归进行制造业供应链风险评价。The present invention starts from the supply chain risk evaluation index system and the traditional supply chain risk evaluation index system respectively. The supply chain risk evaluation index system is constructed based on quantitative data and quantitative indicators. It quantifies the risk of the supply chain by collecting data and measuring, and usually uses mathematical models or algorithms for analysis and evaluation; the traditional supply chain risk evaluation index system is constructed based on qualitative information and human judgment. These indicators are not easy to quantify and rely more on subjective judgment and professional knowledge. Traditional methods include SWOT analysis (strengths, weaknesses, opportunities and threats analysis), expert interviews, questionnaires, etc., to evaluate the risk of the supply chain. This method adopts a comprehensive evaluation framework. By performing principal component analysis on the 13 secondary indicators involved in the supply chain evaluation, Logistic regression is used to perform manufacturing supply chain risk evaluation after extracting the principal components.

4.风险事件与知识图谱关联4. Risk events and knowledge graph association

4.1寻找与风险事件相关的节点4.1 Finding nodes related to risk events

风险事件识别：根据技术2存储的文本数据标签对风险源进行分类识别。Risk event identification: Classify and identify risk sources based on the text data labels stored in Technology 2.

图谱查询：利用图数据库或知识图谱查询语言(SPARQL)，通过查询知识图谱，找到与识别出的风险事件相关的公司和产品节点及其关联信息，如图4所示。Graph query: Use graph database or knowledge graph query language (SPARQL) to query the knowledge graph to find the company and product nodes and their associated information related to the identified risk events, as shown in Figure 4.

4.2知识图谱扩展与补充4.2 Knowledge Graph Expansion and Supplementation

信息补充：如果知识图谱中缺乏与风险相关的信息，利用自然语言处理、数据挖掘等技术从可靠来源获取新的信息。Information supplement: If the knowledge graph lacks risk-related information, use natural language processing, data mining and other technologies to obtain new information from reliable sources.

节点扩充：根据已有信息和新获得的数据，添加新节点(公司、产品)到图谱中。Node expansion: Add new nodes (companies, products) to the graph based on existing information and newly acquired data.

关系更新：建立新的关系连接新节点与已有节点，更新已有关系或属性信息。Relationship update: establish new relationships to connect new nodes with existing nodes, and update existing relationships or attribute information.

4.3知识图谱分析与应用4.3 Knowledge Graph Analysis and Application

路径分析：利用neo4j图数据库，寻找公司、产品与风险事件之间的潜在关联路径。Path analysis: Use the neo4j graph database to find potential correlation paths between companies, products and risk events.

关联度分析：计算节点之间的关联度或相似度指标，以确定潜在关联性。Relevance analysis: Calculates the degree of association or similarity between nodes to determine potential relationships.

结果展示与利用：将分析结果可视化呈现，支持决策制定、风险评估或其他业务应用。Result display and utilization: Visualize the analysis results to support decision making, risk assessment or other business applications.

5.风险信息推送5. Risk information push

最终，将上述技术的结果整理为风险事件及其影响的行业、公司和产品信息推送给相关部门人员，以便及时了解和应对可能出现的风险，如图5所示。Finally, the results of the above techniques are organized into risk events and the information of the industries, companies and products they affect, and pushed to the personnel of relevant departments so that they can understand and respond to possible risks in a timely manner, as shown in Figure 5.