CN116681056B

Movatterモバイル変換

Info

Publication number: CN116681056B
Application number: CN202310596067.5A
Authority: CN
Inventors: 张勇东; 毛震东; 刘毅; 郭俊波; 陈伟东
Original assignee: University of Science and Technology of China USTC; People Co Ltd
Current assignee: University of Science and Technology of China USTC; Konami Sports Club Co Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2024-01-26
Anticipated expiration: 2043-05-24
Also published as: CN116681056A

Abstract

The embodiment of the invention discloses a text value calculation method and a text value calculation device based on a value table, wherein the method comprises the following steps: word segmentation is carried out on the text to obtain a keyword set containing a plurality of keywords; traversing the keyword set based on a preset value table, and inquiring node keywords matched with the keywords to obtain matched node sets with different levels; the preset value table comprises a plurality of preset level nodes; each node includes a node key; and calculating the value data of the text according to the number and the weight of the matched node sets of different levels. And segmenting the text, determining matching node sets of different levels contained in the text by matching keywords in the text with node keywords in a preset value table, and further calculating to obtain value data of the text according to the number and the weight of the matching node sets of different levels, so as to determine the value of the text based on the preset value table.

Description

Translated fromChinese

基于价值量表的文本价值计算方法及装置Text value calculation method and device based on value scale

技术领域Technical field

本发明实施例涉及人工智能技术领域，具体涉及一种基于价值量表的文本价值计算方法及装置。Embodiments of the present invention relate to the field of artificial intelligence technology, and specifically relate to a text value calculation method and device based on a value scale.

背景技术Background technique

随着科技的发展，进入自媒体时代，自媒体与传统的媒体生态不同，传统的媒体生态主要由专业主体生产和发布信息，信息具有较高的公信力和严格的内容把关等特征。自媒体时代，任何人都能够通过互联网创作和发布内容，使得网络中传播的信息质量严重缺乏保障。各个媒体平台的内容良莠不齐，存在大量价值取向不高的内容。由于这类内容生产成本低，接受门槛低，网络中存在着大量低价值内容，易导致低价值内容的过度传播，对主流价值观内容的传播构成挑战。如果不加引导地任由低价值内容自由生长，无用、不良等信息将会在网络中泛滥，污染网络空间，对社会风气也会产生负面影响，潜移默化地带偏公众的价值观。With the development of science and technology, we have entered the era of self-media. We-media is different from the traditional media ecology. The traditional media ecology mainly consists of professional entities producing and releasing information. Information has the characteristics of high credibility and strict content control. In the era of self-media, anyone can create and publish content through the Internet, causing a serious lack of guarantee in the quality of information spread on the Internet. The content of various media platforms varies from good to bad, and there is a large amount of content with low value orientation. Due to the low production cost and low acceptance threshold of this type of content, there is a large amount of low-value content on the Internet, which can easily lead to the excessive dissemination of low-value content and pose a challenge to the dissemination of mainstream value content. If low-value content is allowed to grow freely without guidance, useless, harmful and other information will flood the Internet, pollute the cyberspace, have a negative impact on social atmosphere, and subtly influence public values.

现有的网络信息引导方法主要包括如谣言检测、舆情监测、标准制定、流行度预测等。以上各方法的主要目的是鉴别伪造信息、监测热点事件的发展态势等。如标准制定是通过制定相关的标准和规范，明确发布网络信息的内容和形式，从而对信息的发布者和传播者进行管理和引导，但这种方法较为刻板，缺乏灵活性。在信息流行度预测当中，一般认为拥有较大流行度的信息往往具有更大的价值，但这与实际存在偏差，如哗众取宠、廉价的低价值信息有时反而更容易流传。因此，需要从价值层面对网络内容的文本进行价值计算，而不仅仅局限于关注伪造或热点等片面性的内容。Existing network information guidance methods mainly include rumor detection, public opinion monitoring, standard setting, popularity prediction, etc. The main purpose of each of the above methods is to identify forged information and monitor the development trend of hot events. For example, standard setting is to formulate relevant standards and specifications to clearly publish the content and form of network information, thereby managing and guiding information publishers and disseminators. However, this method is relatively rigid and lacks flexibility. In the prediction of information popularity, it is generally believed that information with greater popularity tends to have greater value, but this is inconsistent with reality. For example, sensational, cheap and low-value information is sometimes more likely to spread. Therefore, it is necessary to calculate the value of the text of online content from the value level, and not just focus on one-sided content such as forgery or hot spots.

发明内容Contents of the invention

鉴于上述问题，提出了本发明实施例以便提供一种克服上述问题或者至少部分地解决上述问题的基于价值量表的文本价值计算方法及装置。In view of the above problems, embodiments of the present invention are proposed to provide a text value calculation method and device based on a value scale that overcome the above problems or at least partially solve the above problems.

根据本发明实施例的一个方面，提供了一种基于价值量表的文本价值计算方法，其包括：According to one aspect of the embodiment of the present invention, a text value calculation method based on a value scale is provided, which includes:

对文本进行分词处理，得到包含多个关键词的关键词集合；Perform word segmentation processing on the text to obtain a keyword set containing multiple keywords;

基于预设价值量表，遍历关键词集合，查询与关键词匹配的节点关键词，得到不同级别的匹配节点集合；其中，预设价值量表包括预设多个级别节点；每个节点包括节点关键词；Based on the preset value scale, traverse the keyword set, query the node keywords that match the keywords, and obtain matching node sets of different levels; among them, the preset value scale includes preset nodes of multiple levels; each node includes node keywords ;

根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据。Based on the number and weight of matching node sets at different levels, the value data of the text is calculated.

根据本发明实施例的另一方面，提供了一种基于价值量表的文本价值计算装置，装置包括：According to another aspect of the embodiment of the present invention, a text value calculation device based on a value scale is provided. The device includes:

分词模块，适于对文本进行分词处理，得到包含多个关键词的关键词集合；The word segmentation module is suitable for word segmentation processing of text to obtain a keyword set containing multiple keywords;

匹配模块，适于基于预设价值量表，遍历关键词集合，查询与关键词匹配的节点关键词，得到不同级别的匹配节点集合；其中，预设价值量表包括预设多个级别节点；每个节点包括节点关键词；The matching module is suitable for traversing the keyword set based on the preset value scale, querying the node keywords that match the keywords, and obtaining matching node sets of different levels; wherein the preset value scale includes preset multiple levels of nodes; each Nodes include node keywords;

价值计算模块，适于根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据。The value calculation module is suitable for calculating the value data of text based on the number and weight of matching node sets at different levels.

根据本发明实施例的又一方面，提供了一种计算设备，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；According to another aspect of the embodiment of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. communication between;

所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行上述基于价值量表的文本价值计算方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the above text value calculation method based on a value scale.

根据本发明实施例的再一方面，提供了一种计算机存储介质，所述存储介质中存储有至少一可执行指令，所述可执行指令使处理器执行如上述基于价值量表的文本价值计算方法对应的操作。According to yet another aspect of the embodiment of the present invention, a computer storage medium is provided. At least one executable instruction is stored in the storage medium. The executable instruction causes the processor to execute the above text value calculation method based on a value scale. corresponding operation.

根据本发明实施例的提供的基于价值量表的文本价值计算方法及装置，将文本分词，通过匹配文本中的关键词与预设价值量表中节点关键词，确定文本包含的不同级别的匹配节点集合，进而根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据，实现基于预设价值量表来确定文本价值。According to the text value calculation method and device based on the value scale provided by the embodiment of the present invention, the text is divided into words, and by matching the keywords in the text with the node keywords in the preset value scale, the different levels of matching node sets contained in the text are determined. , and then calculate the value data of the text based on the number and weight of matching node sets at different levels, and determine the text value based on the preset value scale.

上述说明仅是本发明实施例技术方案的概述，为了能够更清楚了解本发明实施例的技术手段，而可依照说明书的内容予以实施，并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明实施例的具体实施方式。The above description is only an overview of the technical solutions of the embodiments of the present invention. In order to have a clearer understanding of the technical means of the embodiments of the present invention, they can be implemented according to the content of the description, and in order to achieve the above and other purposes, features and The advantages can be more clearly understood, and the specific implementation methods of the embodiments of the present invention are listed below.

附图说明Description of the drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明实施例的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for the purpose of illustrating preferred embodiments and are not to be considered as limiting the embodiments of the present invention. Also throughout the drawings, the same reference characters are used to designate the same components. In the attached picture:

图1示出了根据本发明一个实施例的基于价值量表的文本价值计算方法的流程图；Figure 1 shows a flow chart of a text value calculation method based on a value scale according to one embodiment of the present invention;

图2示出了更新预设价值量表的流程图；Figure 2 shows a flow chart for updating the preset value scale;

图3示出了根据本发明一个实施例的基于价值量表的文本价值计算装置的结构示意图；Figure 3 shows a schematic structural diagram of a text value calculation device based on a value scale according to an embodiment of the present invention;

图4示出了根据本发明一个实施例的一种计算设备的结构示意图。Figure 4 shows a schematic structural diagram of a computing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本发明，并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a thorough understanding of the invention, and to fully convey the scope of the invention to those skilled in the art.

图1示出了根据本发明一个实施例的基于价值量表的文本价值计算方法的流程图，如图1所示，该方法包括如下步骤：Figure 1 shows a flow chart of a text value calculation method based on a value scale according to an embodiment of the present invention. As shown in Figure 1, the method includes the following steps:

步骤S101，对文本进行分词处理，得到包含多个关键词的关键词集合。Step S101: Perform word segmentation processing on the text to obtain a keyword set containing multiple keywords.

本实施例对文本价值的计算是通过解析用户在网络中发布的各种文本，计算文本与主流价值观的匹配度，以维护社会正义的导向性为根本，保障主流价值观内容正确认知与精准传播。This embodiment calculates text value by analyzing various texts published by users on the Internet, and calculating the matching degree between the text and mainstream values. It is based on the orientation of maintaining social justice and ensures the correct recognition and accurate dissemination of the content of mainstream values. .

具体的，在获取到文本后，先对文本进行预处理，预处理包括如格式过滤处理和停用词过滤处理等。通过预处理可以去掉与文本价值计算无关的各种格式化信息、无价值含义的词语，减少对价值计算无关的词语，保障后续分词的准确性。如将文本中的日期、新闻中的“本报电”、URL等格式过滤。对于停用词，可以预设停用词表，根据预设停用词表进行停用词过滤处理，预设停用词表包括无价值含义的词语或者符号等，如“@”，“()”，“emmmm”等。以上格式化信息、预设停用词表为举例说明，具体可以根据实施情况设置，此处不做限定。Specifically, after obtaining the text, the text is first preprocessed. The preprocessing includes format filtering and stop word filtering. Through preprocessing, various formatting information and words with no valuable meanings that are irrelevant to text value calculation can be removed, and words that are irrelevant to value calculation can be reduced to ensure the accuracy of subsequent word segmentation. For example, the date in the text, "this newspaper" in the news, URL and other formats can be filtered. For stop words, you can preset a stop word list and filter the stop words according to the preset stop word list. The preset stop word list includes words or symbols with no value, such as "@", "( )", "emmmm", etc. The above formatting information and default stop word list are examples. The details can be set according to the implementation situation, and are not limited here.

对预处理后的文本，根据标点符号对文本进行处理，如按照标点符号将文本先拆分为多个句子，再对每个句子进行分词处理，得到每个句子包含的各个短语，分词处理时可以采用如自然语言处理的NER(Named Entity Recognition，命名实体识别)工具进行分词，得到如“人类”、“命运”、“共同体”等短语。进一步，基于分词处理得到的各个短语是对句子进行切分，没有考虑各短语间的关联关系，因此，本实施例还基于预设扩展词表对各个短语进行组合，得到对应的关键词，组成关键词集合。预设扩展词表根据实施情况设置。For the preprocessed text, process the text according to punctuation marks. For example, split the text into multiple sentences according to punctuation marks, and then perform word segmentation processing on each sentence to obtain each phrase contained in each sentence. During word segmentation processing Tools such as NER (Named Entity Recognition) for natural language processing can be used for word segmentation to obtain phrases such as "humanity", "destiny", and "community". Furthermore, each phrase obtained based on the word segmentation process is segmented into sentences without considering the correlation between each phrase. Therefore, this embodiment also combines each phrase based on the preset expanded vocabulary to obtain the corresponding keywords, which form Keyword collection. The default expansion vocabulary is set according to the implementation situation.

关键词集合中包含了从文本中得到的多个关键词，基于多个关键词进行后续的文本价值计算。The keyword set contains multiple keywords obtained from the text, and subsequent text value calculations are performed based on the multiple keywords.

步骤S102，基于预设价值量表，遍历关键词集合，查询与关键词匹配的节点关键词，得到不同级别的匹配节点集合。Step S102: Based on the preset value scale, traverse the keyword set, query node keywords that match the keywords, and obtain matching node sets of different levels.

预设价值量表可以预先设置，采用如层级化标签语义知识图方式，其中包括预设多个级别节点，预设多个级别节点依次为核心节点、次核心节点、外围节点。核心节点的价值量高于次核心节点，次核心节点价值量高于外围节点。核心节点、次核心节点、外围节点的划分根据实施情况设置，结合当前的主流价值观确定，此处不做限定。每个节点均包括节点关键词，还包括如节点频率、相关节点和相似节点、节点编号、实体类型等。节点编号可以方便快速检索定位节点关键词所属的节点，节点编号与节点级别对应，如节点编号以A开头，即为核心节点，节点编号以B开头，即为次核心节点，节点编号以C开头，即为外围节点等，以上为举例说明，具体根据实施情况，根据查询还可以相应的返回节点的各种信息如节点频率、相关节点和相似节点等。根据返回的相关节点和相似节点的数量还可以累计得到节点的度(即相关节点和相似节点的总数量)。此处，根据节点关键词可以查询返回的相关节点和相似节点，根据相关节点和相似节点也可以相应的查找原节点的次相关节点和次相似节点(即根据相关节点和相似节点作为查询词来查询预设价值量表，得到查询词的相关节点和相似节点)，具体查询可以根据实施情况选择一次查询或者根据查询结果再进行多次查询，此处不做限定。The preset value scale can be set in advance, using a hierarchical label semantic knowledge graph method, which includes preset multiple levels of nodes, and the preset multiple levels of nodes are core nodes, sub-core nodes, and peripheral nodes in order. The value of core nodes is higher than that of sub-core nodes, and the value of sub-core nodes is higher than that of peripheral nodes. The division of core nodes, sub-core nodes, and peripheral nodes is set according to the implementation situation and determined based on the current mainstream values, and is not limited here. Each node includes node keywords, such as node frequency, related nodes and similar nodes, node number, entity type, etc. The node number can be used to quickly search and locate the node to which the node keyword belongs. The node number corresponds to the node level. If the node number starts with A, it is a core node. If the node number starts with B, it is a sub-core node. The node number starts with C. , that is, peripheral nodes, etc. The above is an example. Depending on the implementation situation, various information about the node such as node frequency, related nodes, similar nodes, etc. can also be returned accordingly according to the query. According to the number of returned related nodes and similar nodes, the degree of the node can also be accumulated (ie, the total number of related nodes and similar nodes). Here, the returned related nodes and similar nodes can be queried based on the node keywords, and the sub-related nodes and sub-similar nodes of the original node can also be searched based on the related nodes and similar nodes (that is, based on the related nodes and similar nodes as query words) Query the preset value scale to obtain the relevant nodes and similar nodes of the query word). For specific queries, you can select one query according to the implementation situation or perform multiple queries based on the query results. There is no limit here.

在得到关键词集合后，可以遍历关键词集合，针对其中包含的任一关键词，查询预设价值量表，从中得到与关键词匹配的节点关键词，即根据关键词查询预设价值量表是否存在对应的节点关键词，若是，则将节点关键词按照所属节点的级别进行归类，得到不同级别的匹配节点集合。匹配节点集合包括核心节点集合、次核心节点集合、外围节点集合。如节点关键词的节点编号为AXXXX，根据节点编号可以确定节点关键词所属节点的级别，将该节点关键词归类为核心节点集合等。若根据关键词查询预设价值量表，未得到与关键词匹配的节点关键词，可以将关键词集合中的该关键词归类至非价值匹配节点集合，非价值匹配节点集合不用于文本价值计算。此处，核心节点集合、次核心节点集合、外围节点集合、非价值匹配节点集合各自包含的关键词均不重复。After obtaining the keyword set, you can traverse the keyword set, query the preset value scale for any keyword contained in it, and obtain the node keywords matching the keyword, that is, query whether the preset value scale exists based on the keyword Corresponding node keyword, if yes, then classify the node keyword according to the level of the node it belongs to, and obtain matching node sets of different levels. The matching node set includes a core node set, a sub-core node set, and a peripheral node set. For example, the node number of a node keyword is AXXXX. According to the node number, the level of the node to which the node keyword belongs can be determined, and the node keyword can be classified into a core node set, etc. If the preset value scale is queried based on keywords and no node keyword matching the keyword is obtained, the keyword in the keyword set can be classified into a non-value matching node set. The non-value matching node set is not used for text value calculation. . Here, the keywords contained in each of the core node set, the sub-core node set, the peripheral node set, and the non-value matching node set are not repeated.

进一步，预设价值量表可以预先设置，还可以根据新的文本进行更新，具体的，如图2所示：Furthermore, the preset value scale can be set in advance and can also be updated according to new text. Specifically, as shown in Figure 2:

步骤S201，将第一文本拆分为多个句子，对多个句子进行第一分词处理，并获取每个第一分词的词性信息、语法依存关系、语义依存关系信息。Step S201: Split the first text into multiple sentences, perform first word segmentation processing on the multiple sentences, and obtain the part-of-speech information, grammatical dependency relationship, and semantic dependency relationship information of each first word segmentation.

对应任一新的文本(以下称第一文本)，将其先拆分为多个句子，在对每个句子进行第一分词处理，如使用hanlp(Han Language Processing，汉语言处理包)进行第一分词处理，可以进行分词、词性标注、实体识别等，从而得到句子的各个第一分词，以及第一分词的词性信息、语法依存关系、语义依存关系信息。如将第一文本拆分得到多个句子，D＝{s_i，i＝1，2，...，N，}，其中s_i代表第一文本中的第i个句子，N代表第一文本中句子的总数，s_i＝{w_j，j＝1，2，...，V}，其中w_j代表句子s_i中的第j个第一分词，V代表第一分词的总数。Corresponding to any new text (hereinafter referred to as the first text), first split it into multiple sentences, and then perform the first word segmentation processing on each sentence, such as using hanlp (Han Language Processing, Chinese language processing package) for the third One-word segmentation processing can perform word segmentation, part-of-speech tagging, entity recognition, etc., thereby obtaining each first participle of the sentence, as well as the part-of-speech information, grammatical dependency relationship, and semantic dependence relationship information of the first participle. For example, the first text is split into multiple sentences, D={_si ,i=1,2,...,N,}, where_si represents the i-th sentence in the first text, and N represents the first The total number of sentences in the text,_si = {w_j , j = 1, 2,..., V}, where w_j represents the j-th first participle in sentence_si , and V represents the total number of first participles.

步骤S202，根据每个第一分词的词性信息、语法依存关系、语义依存关系信息提取得到待处理分词，并对待处理分词进行过滤处理，得到待处理分词集合。Step S202: Extract the word segments to be processed based on the part-of-speech information, grammatical dependency relationship, and semantic dependency relationship information of each first word segmentation, and filter the word segments to be processed to obtain a set of word segments to be processed.

根据每个第一分词w_j的词性信息、语法依存关系、语义依存关系信息，可以统计各个第一分词w_j的频度，按照字节大小进行大小为n的滑动窗口操作，提取得到待处理分词。待处理分词采用如n gram方式。According to the part-of-speech information, grammatical dependencies, and semantic dependencies of each first participle w_j , the frequency of each first participle w_j can be counted, and a sliding window operation of size n is performed according to the byte size to extract the results to be processed Participle. The word segmentation to be processed adopts n-gram method.

进一步，在得到待处理分词后，对待处理分词进行过滤处理，过滤处理包括停用词过滤、数字过滤、低频人物名过滤、数词量词过滤、词性过滤、分词词性过滤、关键词过滤，以上过滤时可以设置过滤名单，根据过滤名单去除日常常用的分词，以便更快发现新分词用于更新预设价值量表。过滤处理后，得到待处理分词集合。Further, after obtaining the word segmentation to be processed, the word segmentation to be processed is filtered. The filtering process includes stop word filtering, digital filtering, low-frequency character name filtering, numeral quantifier filtering, part-of-speech filtering, part-of-speech filtering, keyword filtering, and the above filtering. You can set up a filter list to remove commonly used participles according to the filter list, so that new participles can be discovered more quickly and used to update the default value scale. After filtering, a set of word segments to be processed is obtained.

步骤S203，基于预设模型抽取得到待处理分词集合的分词特征集合、预设价值量表的核心节点关键词的核心特征集合、次核心节点关键词的次核心特征集合、外围节点关键词的外围特征集合；根据分词特征集合、核心特征集合以及核心节点关键词的数量计算得到待处理分词集合中各分词的核心相似度，根据分词特征集合、次核心特征集合以及次核心节点关键词的数量计算得到待处理分词集合中各分词的次核心相似度，根据分词特征集合、外围特征集合以及外围节点关键词的数量计算得到待处理分词集合中各分词的外围相似度。Step S203, based on the preset model, extract the segmentation feature set of the to-be-processed segmentation set, the core feature set of the core node keywords of the preset value scale, the sub-core feature set of the sub-core node keywords, and the peripheral features of the peripheral node keywords. Set; calculate the core similarity of each segmentation in the word segmentation set to be processed based on the number of segmentation feature sets, core feature sets, and core node keywords, and calculate the core similarity of each segmentation in the segmentation set to be processed based on the number of segmentation feature sets, sub-core feature sets, and sub-core node keywords. The sub-core similarity of each segment in the set of word segments to be processed is calculated based on the segment feature set, the peripheral feature set and the number of peripheral node keywords to obtain the peripheral similarity of each segment in the set of word segments to be processed.

针对待处理分词集合O＝{n₁，n₂，...，n_i，...，n_m}，其中，m表示得到的ngram的待处理分词的总数，可以采用预设模型，如预先训练的BERT等自编码语言模型，来抽取得到待处理分词集合O的分词特征集合，f_O＝{f_Oi，i＝1，2，...，m}，f_O∈R^m×d，R^m×d为m*d维度的实数空间，d为特征维度。其中，f_Oi的获取可以基于以下公式得到：For the set of word segments to be processed O={n₁ , n₂ ,..., n_i ,..., n_m }, where m represents the total number of word segments to be processed in ngram, a preset model can be used, such as Pre-trained self-encoding language models such as BERT are used to extract the word segmentation feature set of the word segmentation set O to be processed, f_O = {f_Oi , i = 1, 2,..., m}, f_O ∈R^m×d , R^m×d is the real number space of m*d dimension, and d is the feature dimension. Among them, f_Oi can be obtained based on the following formula:

f_Oi＝LM(n_i)f_Oi =LM(n_i )

其中，LM代表预设模型，n_i为待处理分词集合中的第i个分词，f_Oi为n_i的分词特征。对应的，根据上述公式可以，利用预设模型可以获取到预设价值量表的各个核心节点关键词的核心特征集合f_A、各个次核心节点关键词的次核心特征集合f_B、各个外围节点关键词的外围特征集合f_C。Among them, LM represents the default model, n_i is the i-th word segmentation in the word segmentation set to be processed, and f_Oi is the word segmentation feature of n_i . Correspondingly, according to the above formula, the preset model can be used to obtain the core feature set f_A of each core node keyword of the preset value scale, the sub-core feature set f_B of each sub-core node keyword, and each peripheral node key The peripheral feature set f_C of the word.

在得到分词特征集合f_O、预设价值量表的核心节点关键词的核心特征集合f_A、次核心节点关键词的次核心特征集合f_B、外围节点关键词的外围特征集合f_C后，可以基于基于各个特征集合计算得到各个相似度，根据分词特征集合f_O、核心特征集合f_A以及核心节点关键词的数量计算得到核心相似度，根据分词特征集合f_O、次核心特征集合f_B以及次核心节点关键词的数量计算得到次核心相似度，根据分词特征集合f_O、外围特征集合f_C以及外围节点关键词的数量计算得到外围相似度，具体的，以核心相似度为例，参照如下公式：After obtaining the word segmentation feature set f_O , the core feature set f_A of the core node keywords of the preset value scale, the sub-core feature set f_B of the sub-core node keywords, and the peripheral feature set f_C of the peripheral node keywords, you can Each similarity is calculated based on each feature set. The core similarity is calculated based on the word segmentation feature set f_O , core feature set f_A and the number of core node keywords. According to the word segmentation feature set f_O , sub-core feature set f_B and The number of sub-core node keywords is calculated to obtain the sub-core similarity. The peripheral similarity is calculated based on the word segmentation feature set f_O , the peripheral feature set f_C and the number of peripheral node keywords. Specifically, taking the core similarity as an example, refer to The following formula:

其中，|A|为核心节点关键词的数量，f_Aj为第j个核心节点关键词的分词特征，T为转置函数，sim_A为核心相似度。对应的，根据以上公式，可以根据分词特征集合f_O、次核心特征集合f_B以及次核心节点关键词的数量计算得到次核心相似度sim_B，根据分词特征集合f_O、外围特征集合f_C以及外围节点关键词的数量计算得到外围相似度sim_C。sim_A、sim_B、sim_C的取值范围为0-1。Among them, |A| is the number of core node keywords, f_Aj is the word segmentation feature of the jth core node keyword, T is the transposition function, and sim_A is the core similarity. Correspondingly, according to the above formula, the sub-core similarity sim_B can be calculated based on the word segmentation feature set f_O , the sub-core feature set f_B and the number of sub-core node keywords. According to the word segmentation feature set f_O , the peripheral feature set f_C And the number of peripheral node keywords is calculated to obtain the peripheral similarity sim_C. The value range of sim_A , sim_B and sim_C is 0-1.

步骤S204，遍历待处理分词集合，针对任一分词，将分词的核心相似度与预设核心阈值进行比较，判断核心相似度是否大于等于预设核心阈值。Step S204: Traverse the set of word segments to be processed, compare the core similarity of the word segmentation with the preset core threshold for any word segmentation, and determine whether the core similarity is greater than or equal to the preset core threshold.

在计算得到待处理分词集合中各个分词的核心相似度、次核心相似度、外围相似度后，遍历待处理分词集合，针对任一分词，先将分词的核心相似度sim_A与预设核心阈值进行比较，若sim_A大于等于预设核心阈值，则执行步骤S207，将该分词加入预设价值量表，若sim_A＝1，说明该分词已经预设价值量表，无需加入。若核心相似度sim_A小于预设核心阈值，执行步骤S205。After calculating the core similarity, sub-core similarity, and peripheral similarity of each segment in the set of word segments to be processed, traverse the set of word segments to be processed, and for any segment, first compare the core similarity sim_A of the segment with the preset core threshold Compare, if sim_A is greater than or equal to the preset core threshold, step S207 is executed to add the segment to the preset value scale. If sim_A = 1, it means that the segment has already been preset to the value scale and does not need to be added. If the core similarity sim_A is less than the preset core threshold, step S205 is executed.

步骤S205，将分词的次核心相似度与预设次核心阈值进行比较，判断次核心相似度是否大于等于预设次核心阈值。Step S205: Compare the sub-core similarity of the word segmentation with the preset sub-core threshold, and determine whether the sub-core similarity is greater than or equal to the preset sub-core threshold.

若核心相似度sim_A小于预设核心阈值，则进一步将该分词的次核心相似度sim_B与预设次核心阈值进行比较，若sim_B大于等于预设次核心阈值，则执行步骤S207，将该分词加入预设价值量表，若sim_B＝1，说明该分词已经预设价值量表，无需加入。若次核心相似度sim_B小于预设次核心阈值，执行步骤S206。If the core similarity sim_A is less than the preset core threshold, then the sub-core similarity sim_B of the word segmentation is further compared with the preset sub-core threshold. If sim_B is greater than or equal to the preset sub-core threshold, step S207 is executed. This participle is added to the preset value scale. If sim_B = 1, it means that the participle has a preset value scale and does not need to be added. If the sub-core similarity sim_B is less than the preset sub-core threshold, step S206 is executed.

步骤S206，将分词的外围相似度与预设外围阈值进行比较，判断外围相似度是否大于等于预设外围阈值。Step S206: Compare the peripheral similarity of the word segmentation with the preset peripheral threshold, and determine whether the peripheral similarity is greater than or equal to the preset peripheral threshold.

若次核心相似度sim_B小于次预设核心阈值，则进一步将该分词的外围相似度sim_C与预设外围阈值进行比较，若sim_C大于等于预设外围阈值，则执行步骤S207，将该分词加入预设价值量表，若sim_C＝1，说明该分词已经预设价值量表，无需加入。若外围相似度sim_C小于预设外围阈值，则说明该分词不符合预设价值量表的要求，分词不属于主流价值观，丢弃该分词。此处，丢弃该分词后，可以遍历待处理分词集合获取下一分词，将下一分词的核心相似度、次核心相似度、外围相似度进行判断，直至遍历完成待处理分词集合中所有分词，完成对预设价值量表的更新。If the secondary core similarity sim_B is less than the secondary preset core threshold, then further compare the peripheral similarity sim_C of the word segmentation with the preset peripheral threshold. If sim_C is greater than or equal to the preset peripheral threshold, step S207 is executed. The participle is added to the preset value scale. If sim_C = 1, it means that the participle has a preset value scale and does not need to be added. If the peripheral similarity sim_C is less than the preset peripheral threshold, it means that the participle does not meet the requirements of the preset value scale, the participle does not belong to the mainstream values, and the participle is discarded. Here, after discarding the word segmentation, you can traverse the set of word segmentations to be processed to obtain the next word segmentation, and judge the core similarity, sub-core similarity, and peripheral similarity of the next word segmentation until all the word segments in the set of word segmentations to be processed are traversed. Completed the update to the default value scale.

步骤S207，将分词加入预设价值量表。Step S207: Add participles to the preset value scale.

在判断核心相似度大于等于预设核心阈值，或者，次核心相似度大于等于预设次核心阈值，或者，外围相似度大于等于预设外围阈值，则可以将分词加入预设价值量表，对应的可以将其按照判断条件，加入对应的核心节点关键词、次核心节点关键词、外围节点关键词等。此处，将该分词加入预设价值量表后，可以遍历待处理分词集合获取下一分词，将下一分词的核心相似度、次核心相似度、外围相似度进行判断，直至遍历完成待处理分词集合中所有分词，完成对预设价值量表的更新。When judging that the core similarity is greater than or equal to the preset core threshold, or that the sub-core similarity is greater than or equal to the preset sub-core threshold, or that the peripheral similarity is greater than or equal to the preset peripheral threshold, the segmentation can be added to the preset value scale, corresponding to You can add corresponding core node keywords, sub-core node keywords, peripheral node keywords, etc. according to the judgment conditions. Here, after adding the segmentation to the preset value scale, you can traverse the collection of pending segmentations to obtain the next segmentation, and judge the core similarity, sub-core similarity, and peripheral similarity of the next segmentation until the traversal completes the segmentation to be processed. All participles in the collection are collected to complete the update of the preset value scale.

步骤S103，根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据。Step S103: Calculate the value data of the text based on the number and weight of matching node sets at different levels.

在得到匹配节点集合后，可以根据不同级别的匹配节点集合中分别包含的关键词数量，以及不同级别的匹配节点集合对应的权重，来计算得到文本的价值数据。具体的，根据匹配节点集合，分别计算得到核心节点集合的数量与核心节点权重的第一乘积、次核心节点集合的数量与次核心节点权重的第二乘积、外围节点集合的数量与外围节点权重的第三乘积，以及，关键词集合的数量与核心节点权重的第四乘积，累加第一乘积、第二乘积及第三乘积，计算累加结果与第四乘积的比值，具体参照如下公式：After obtaining the matching node set, the value data of the text can be calculated based on the number of keywords contained in the matching node sets at different levels and the corresponding weights of the matching node sets at different levels. Specifically, according to the matching node set, the first product of the number of core node sets and the core node weight, the second product of the number of sub-core node sets and the sub-core node weight, and the number of peripheral node sets and the peripheral node weight are respectively calculated. The third product, and the fourth product of the number of keyword sets and the weight of the core node, accumulate the first product, the second product and the third product, and calculate the ratio of the cumulative result to the fourth product, specifically refer to the following formula:

其中，公式(1)中|A|为核心节点集合的数量，|B|为次核心节点集合的数量，|C|为外围节点集合的数量，|S|为关键词集合的数量，α′_A为核心节点权重，α′_B为次核心节点权重，α′_C为外围节点权重，v为文本的价值中间数据。Among them, in formula (1), |A| is the number of core node sets, |B| is the number of sub-core node sets, |C| is the number of peripheral node sets, |S| is the number of keyword sets, α′_A is the core node weight, α′_B is the sub-core node weight, α′_C is the peripheral node weight, and v is the value intermediate data of the text.

考虑到分词得到的关键词集合中可能会存在某些无价值的关键词，导致非价值匹配节点集合包含的关键词过多，导致计算得到文本的价值中间数据v偏小，因此，本实施例根据预设指数对比值进行修正，得到文本的价值数据，参照如下公式：Considering that there may be some worthless keywords in the keyword set obtained by word segmentation, resulting in too many keywords in the non-value matching node set, resulting in the calculated value intermediate data v of the text being too small, therefore, this embodiment Correct according to the preset index comparison value to obtain the value data of the text, refer to the following formula:

v′＝v^0.3 (2)v′＝v^0.3 (2)

其中，公式(2)中v’为文本的价值数据，预设指数采用如0.3，利用幂函数对v进行拉伸得到修正之后的文本的价值数据v’。基于以上计算，若关键词集合匹配得到的匹配节点集合均为核心节点集合，得到的文本的价值数据v’＝1，若匹配节点集合为非价值匹配节点集合，则确定文本的价值数据v’为0。Among them, v’ in formula (2) is the value data of the text, and the preset index is, for example, 0.3. The power function is used to stretch v to obtain the corrected value data v’ of the text. Based on the above calculation, if the matching node set obtained by matching the keyword set is a core node set, the value data v' of the text obtained is 1. If the matching node set is a non-value matching node set, the value data v' of the text is determined. is 0.

进一步，上述各个权重的计算具体为：核心节点权重根据对核心节点集合中各个节点关键词的第一和值进行归一化处理得到；第一和值根据累加核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；次核心节点权重根据对次核心节点集合中各个节点关键词的第二和值进行归一化处理得到；第二和值根据累加次核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；外围节点权重根据对外围节点集合中各个节点关键词的第三和值进行归一化处理得到；第三和值根据累加外围节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到，具体参考如下公式：Further, the calculation of each of the above weights is specifically as follows: the core node weight is obtained by normalizing the first sum value of each node keyword in the core node set; the first sum value is obtained by accumulating the first sum value of each node keyword in the core node set. The product of the number of relevant nodes and similar nodes and the preset weight is obtained by the sum of the node frequencies of the node keywords; the weight of the sub-core node is obtained by normalizing the second sum value of each node keyword in the sub-core node set. ;The second sum value is obtained by accumulating the number of related nodes and similar nodes of each node keyword in the secondary core node set and the sum of the product of the preset weight and the node frequency of the node keyword; the peripheral node weight is obtained by summing up the number of related nodes and similar nodes of each node keyword in the peripheral node set; The third sum value of each node keyword is obtained by normalization; the third sum value is based on the product of the number of related nodes and similar nodes of each node keyword in the accumulated peripheral node set and the preset weight and the node keyword The sum of frequencies is obtained. For details, refer to the following formula:

α′_A＝softmax(∑_x∈A[f_x+λd_x]) (3)α′_A =softmax(∑_x∈A [f_x +λd_x ]) (3)

其中，公式(3)中α′_A为核心节点权重，d_x表示核心节点集合中各个节点关键词的相关节点和相似节点的数量，A为核心节点集合，x的取值范围为核心节点集合，f_x表示核心节点集合中各个节点关键词的节点频率，λ为预设权重，softmax为归一化函数。Among them, α′_A in formula (3) is the core node weight, d_x represents the number of related nodes and similar nodes of each node keyword in the core node set, A is the core node set, and the value range of x is the core node set. , f_x represents the node frequency of each node keyword in the core node set, λ is the preset weight, and softmax is the normalization function.

α′_B＝softmax(∑_x∈B[f_x+λd_x]) (4)α′_B =softmax(∑_x∈B [f_x +λd_x ]) (4)

其中，公式(4)中α′_B为次核心节点权重，d_x表示次核心节点集合中各个节点关键词的相关节点和相似节点的数量，B为次核心节点集合，x的取值范围为次核心节点集合，f_x表示次核心节点集合中各个节点关键词的节点频率，λ为预设权重，softmax为归一化函数。Among them, α′_B in formula (4) is the weight of the sub-core node, d_x represents the number of related nodes and similar nodes of each node keyword in the sub-core node set, B is the sub-core node set, and the value range of x is Sub-core node set, f_x represents the node frequency of each node keyword in the sub-core node set, λ is the preset weight, and softmax is the normalization function.

α′_C＝softmax(∑_x∈C[f_x+λd_x]) (5)α′_C =softmax(∑_x∈C [f_x +λd_x ]) (5)

其中，公式(5)中α′_C为外围节点权重，d_x表示外围节点集合中各个节点关键词的相关节点和相似节点的数量，C为外围节点集合，x的取值范围为外围节点集合，f_x表示外围节点集合中各个节点关键词的节点频率，λ为预设权重，用于平衡相关节点和相似节点的数量与节点频率间的尺度差异，具体根据实施情况设置，softmax为归一化函数。Among them, α′_C in formula (5) is the weight of peripheral nodes, d_x represents the number of related nodes and similar nodes of each node keyword in the peripheral node set, C is the peripheral node set, and the value range of x is the peripheral node set ,_f function.

各个权重根据不同级别的匹配节点集合中各个关键词在预设价值量表的各种属性信息确定，如相关节点和相似节点的数量、节点频率，对应的预设价值量表中的节点频率越高，则价值数据更大，即权重更大；若对应的相关节点和相似节点的数量越多，说明关键词在预设价值量表中属于重要枢纽，同样其权重更大。Each weight is determined based on various attribute information of each keyword in the preset value scale in different levels of matching node sets, such as the number of related nodes and similar nodes, and node frequency. The higher the node frequency in the corresponding preset value scale, the higher the node frequency in the corresponding preset value scale. The value data is larger, that is, the weight is larger; if the number of corresponding related nodes and similar nodes is larger, it means that the keyword is an important hub in the preset value scale, and its weight is also larger.

根据本发明实施例提供的基于价值量表的文本价值计算方法，将文本分词，通过匹配文本中的关键词与预设价值量表中节点关键词，确定文本包含的不同级别的匹配节点集合，进而根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据，实现基于预设价值量表来确定文本价值。According to the text value calculation method based on the value scale provided by the embodiment of the present invention, the text is divided into words, and by matching the keywords in the text with the node keywords in the preset value scale, the different levels of matching node sets contained in the text are determined, and then according to The number and weight of matching node sets at different levels are calculated to obtain the value data of the text, and the value of the text is determined based on the preset value scale.

图3示出了本发明实施例提供的基于价值量表的文本价值计算装置的结构示意图。如图3所示，该装置包括：Figure 3 shows a schematic structural diagram of a text value calculation device based on a value scale provided by an embodiment of the present invention. As shown in Figure 3, the device includes:

分词模块310，适于对文本进行分词处理，得到包含多个关键词的关键词集合；The word segmentation module 310 is suitable for word segmentation processing of text to obtain a keyword set containing multiple keywords;

匹配模块320，适于基于预设价值量表，遍历关键词集合，查询与关键词匹配的节点关键词，得到不同级别的匹配节点集合；其中，预设价值量表包括预设多个级别节点；每个节点包括节点关键词；The matching module 320 is adapted to traverse the keyword set based on the preset value scale, query node keywords that match the keywords, and obtain matching node sets of different levels; wherein the preset value scale includes preset multiple levels of nodes; each Nodes include node keywords;

价值计算模块330，适于根据不同级别的匹配节点集合的数量及权重，计算得到文本的价值数据。The value calculation module 330 is adapted to calculate the value data of the text based on the number and weight of matching node sets at different levels.

可选地，预设多个级别节点包括：核心节点、次核心节点、外围节点；每个节点还包括：节点编号、节点频率、相关节点和相似节点。Optionally, multiple levels of nodes are preset including: core nodes, sub-core nodes, and peripheral nodes; each node also includes: node number, node frequency, related nodes, and similar nodes.

可选地，匹配模块320进一步适于：Optionally, the matching module 320 is further adapted to:

遍历关键词集合，针对任一关键词，查询预设价值量表，得到与关键词匹配的节点关键词；Traverse the keyword collection, query the preset value scale for any keyword, and obtain the node keywords matching the keyword;

将节点关键词按照所属节点的级别进行归类，得到不同级别的匹配节点集合；匹配节点集合包括核心节点集合、次核心节点集合、外围节点集合。The node keywords are classified according to the level of the node to which they belong, and matching node sets of different levels are obtained; the matching node set includes a core node set, a sub-core node set, and a peripheral node set.

可选地，价值计算模块330进一步适于：Optionally, the value calculation module 330 is further adapted to:

计算得到核心节点集合的数量与核心节点权重的第一乘积、次核心节点集合的数量与次核心节点权重的第二乘积、外围节点集合的数量与外围节点权重的第三乘积，以及，关键词集合的数量与核心节点权重的第四乘积；其中，核心节点权重根据对核心节点集合中各个节点关键词的第一和值进行归一化处理得到；第一和值根据累加核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；次核心节点权重根据对次核心节点集合中各个节点关键词的第二和值进行归一化处理得到；第二和值根据累加次核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；外围节点权重根据对外围节点集合中各个节点关键词的第三和值进行归一化处理得到；第三和值根据累加外围节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；Calculate the first product of the number of core node sets and the weight of the core nodes, the second product of the number of sub-core node sets and the weight of the sub-core nodes, the third product of the number of peripheral node sets and the weight of the peripheral nodes, and, keywords The fourth product of the number of sets and the core node weight; where, the core node weight is obtained by normalizing the first sum of the node keywords in the core node set; the first sum is obtained by accumulating each node keyword in the core node set. The product of the number of related nodes and similar nodes of the node keyword and the preset weight is obtained by the sum of the node frequencies of the node keyword; the weight of the sub-core node is calculated based on the second sum of the node keywords in the sub-core node set. The second sum value is obtained by accumulating the number of related nodes and similar nodes of each node keyword in the sub-core node set and the sum of the preset weight and the node frequency of the node keyword; the peripheral node weight is obtained by The third sum value of each node keyword in the peripheral node set is obtained by normalization; the third sum value is obtained by accumulating the number of related nodes and similar nodes of each node keyword in the peripheral node set and the product of the preset weight and the node The sum of node frequencies of keywords is obtained;

累加第一乘积、第二乘积及第三乘积，计算累加结果与第四乘积的比值，并根据预设指数对比值进行修正，得到文本的价值数据。Accumulate the first product, the second product and the third product, calculate the ratio of the accumulated result and the fourth product, and correct it according to the preset index comparison value to obtain the value data of the text.

可选地，装置还包括：非匹配模块340，适于若查询预设价值量表，未得到与关键词匹配的节点关键词，将关键词归类至非价值匹配节点集合。Optionally, the device further includes: a non-matching module 340, adapted to classify the keyword into a non-value matching node set if no node keyword matching the keyword is obtained when querying the preset value scale.

可选地，装置还包括：非匹配价值模块350，适于若匹配节点集合为非价值匹配节点集合，则确定文本的价值数据为0。Optionally, the device further includes: a non-matching value module 350, adapted to determine that the value data of the text is 0 if the matching node set is a non-value matching node set.

可选地，分词模块310进一步适于：Optionally, the word segmentation module 310 is further adapted to:

对文本进行预处理；预处理包括格式过滤处理和停用词过滤处理；Preprocess the text; preprocessing includes format filtering and stop word filtering;

根据标点符号对文本进行处理，将文本拆分为多个句子；Process the text based on punctuation marks and split the text into multiple sentences;

对每个句子进行分词处理，得到每个句子包含的各个短语；Perform word segmentation processing on each sentence to obtain the individual phrases contained in each sentence;

基于预设扩展词表对各个短语进行组合，得到对应的关键词，组成关键词集合。Combining each phrase based on the preset expanded word list to obtain the corresponding keywords to form a keyword set.

可选地，装置还包括：更新模块360，适于将第一文本拆分为多个句子，对多个句子进行第一分词处理，并获取每个第一分词的词性信息、语法依存关系、语义依存关系信息；根据每个第一分词的词性信息、语法依存关系、语义依存关系信息提取得到待处理分词，并对待处理分词进行过滤处理，得到待处理分词集合；过滤处理包括停用词过滤、数字过滤、低频人物名过滤、数词量词过滤、词性过滤、分词词性过滤、关键词过滤；基于预设模型抽取得到待处理分词集合的分词特征集合、预设价值量表的核心节点关键词的核心特征集合、次核心节点关键词的次核心特征集合、外围节点关键词的外围特征集合；根据分词特征集合、核心特征集合以及核心节点关键词的数量计算得到待处理分词集合中各分词的核心相似度，根据分词特征集合、次核心特征集合以及次核心节点关键词的数量计算得到待处理分词集合中各分词的次核心相似度，根据分词特征集合、外围特征集合以及外围节点关键词的数量计算得到待处理分词集合中各分词的外围相似度；遍历待处理分词集合，针对任一分词，将分词的核心相似度与预设核心阈值进行比较，若核心相似度大于等于预设核心阈值，则将分词加入预设价值量表；若核心相似度小于预设核心阈值，则将分词的次核心相似度与预设次核心阈值进行比较，若次核心相似度大于等于预设次核心阈值，则将分词加入预设价值量表；若次核心相似度小于预设次核心阈值，则将分词的外围相似度与预设外围阈值进行比较，若外围相似度大于等于预设外围阈值，则将分词加入预设价值量表。Optionally, the device also includes: an update module 360, adapted to split the first text into multiple sentences, perform first word segmentation processing on the multiple sentences, and obtain the part-of-speech information and grammatical dependencies of each first word segment, Semantic dependency information; extract the word segmentation to be processed based on the part-of-speech information, grammatical dependency relationship, and semantic dependency relationship information of each first segmentation, and filter the word segmentation to be processed to obtain a set of word segmentations to be processed; the filtering process includes stop word filtering , digital filtering, low-frequency person name filtering, numeral quantifier filtering, part-of-speech filtering, part-of-speech filtering, keyword filtering; based on the preset model, the set of word segmentation features to be processed and the core node keywords of the preset value scale are extracted based on the preset model. The core feature set, the sub-core feature set of sub-core node keywords, and the peripheral feature set of peripheral node keywords; the core of each word segmentation in the word segmentation set to be processed is calculated based on the number of word segmentation feature sets, core feature sets, and core node keywords. The similarity is calculated according to the number of word segmentation feature sets, sub-core feature sets and sub-core node keywords to obtain the sub-core similarity of each segment in the word segmentation set to be processed. According to the number of word segmentation feature sets, peripheral feature sets and peripheral node keywords Calculate the peripheral similarity of each segment in the set of word segments to be processed; traverse the set of word segments to be processed, and compare the core similarity of the segment with the preset core threshold for any segment. If the core similarity is greater than or equal to the preset core threshold, Then add the segmentation to the preset value scale; if the core similarity is less than the preset core threshold, then compare the sub-core similarity of the segmentation with the preset sub-core threshold; if the sub-core similarity is greater than or equal to the preset sub-core threshold, then Add the segment to the preset value scale; if the sub-core similarity is less than the preset sub-core threshold, compare the peripheral similarity of the segment with the preset peripheral threshold; if the peripheral similarity is greater than or equal to the preset peripheral threshold, add the segment Default value scale.

以上各模块的描述参照方法实施例中对应的描述，在此不再赘述。For the description of each module above, refer to the corresponding description in the method embodiment, and will not be described again here.

本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有至少一可执行指令，可执行指令可执行上述任意方法实施例中的基于价值量表的文本价值计算方法。Embodiments of the present invention also provide a non-volatile computer storage medium. The computer storage medium stores at least one executable instruction. The executable instruction can execute the text value calculation method based on the value scale in any of the above method embodiments.

图4示出了根据本发明实施例的一种计算设备的结构示意图，本发明实施例的具体实施例并不对计算设备的具体实现做限定。FIG. 4 shows a schematic structural diagram of a computing device according to an embodiment of the present invention. The specific implementation of the embodiment of the present invention does not limit the specific implementation of the computing device.

如图4所示，该计算设备可以包括：处理器(processor)402、通信接口(Communications Interface)404、存储器(memory)406、以及通信总线408。As shown in FIG. 4 , the computing device may include: a processor 402 , a communications interface 404 , a memory 406 , and a communications bus 408 .

其中：in:

处理器402、通信接口404、以及存储器406通过通信总线408完成相互间的通信。The processor 402, the communication interface 404, and the memory 406 complete communication with each other through the communication bus 408.

通信接口404，用于与其它设备比如客户端或其它服务器等的网元通信。The communication interface 404 is used to communicate with network elements of other devices such as clients or other servers.

处理器402，用于执行程序410，具体可以执行上述基于价值量表的文本价值计算方法实施例中的相关步骤。The processor 402 is configured to execute the program 410. Specifically, it can execute the relevant steps in the above embodiment of the text value calculation method based on the value scale.

具体地，程序410可以包括程序代码，该程序代码包括计算机操作指令。Specifically, program 410 may include program code including computer operating instructions.

处理器402可能是中央处理器CPU，或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 402 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computing device may be the same type of processor, such as one or more CPUs; or they may be different types of processors, such as one or more CPUs and one or more ASICs.

存储器406，用于存放程序410。存储器406可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。Memory 406 is used to store programs 410. The memory 406 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

程序410具体可以用于使得处理器402执行上述任意方法实施例中的基于价值量表的文本价值计算方法。程序410中各步骤的具体实现可以参见上述基于价值量表的文本价值计算实施例中的相应步骤和单元中对应的描述，在此不赘述。所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的设备和模块的具体工作过程，可以参考前述方法实施例中的对应过程描述，在此不再赘述。The program 410 may be specifically used to cause the processor 402 to execute the text value calculation method based on the value scale in any of the above method embodiments. For the specific implementation of each step in program 410, please refer to the corresponding steps and corresponding descriptions in the units in the above embodiment of text value calculation based on value scale, and will not be described again here. Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be described again here.

在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明实施例也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明实施例的内容，并且上面对特定语言所做的描述是为了披露本发明实施例的较佳实施方式。The algorithms or displays provided herein are not inherently associated with any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. From the above description, the structure required to construct such a system is obvious. Furthermore, embodiments of the present invention are not directed to any specific programming language. It should be understood that various programming languages may be used to implement the embodiments of the present invention described herein, and the above description of specific languages is for the purpose of disclosing preferred implementations of the embodiments of the present invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the instructions provided here, a number of specific details are described. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.

类似地，应当理解，为了精简本发明实施例并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明实施例要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be understood that in the above description of exemplary embodiments of the invention, in order to streamline the embodiments of the invention and aid in understanding one or more of the various inventive aspects, various features of embodiments of the invention are sometimes grouped together into in a single embodiment, figure, or description thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that modules in the devices in the embodiment can be adaptively changed and arranged in one or more devices different from that in the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of the equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments herein include certain features included in other embodiments but not others, combinations of features of different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些或者全部部件的一些或者全部功能。本发明实施例还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明实施例的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。Various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components according to embodiments of the present invention. Embodiments of the present invention may also be implemented as equipment or device programs (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program for implementing embodiments of the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any other form.

应该注意的是上述实施例对本发明实施例进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明实施例可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤，除有特殊说明外，不应理解为对执行顺序的限定。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the element claim enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, third, etc. does not indicate any order. These words can be interpreted as names. Unless otherwise specified, the steps in the above embodiments should not be understood as limiting the order of execution.

Claims

Translated fromChinese

1.一种基于价值量表的文本价值计算方法，其特征在于，方法包括：1. A text value calculation method based on a value scale, characterized in that the method includes:

基于预设价值量表，遍历所述关键词集合，针对任一关键词，查询所述预设价值量表，得到与所述关键词匹配的节点关键词；将所述节点关键词按照所属节点的级别进行归类，得到不同级别的匹配节点集合；其中，所述预设价值量表包括预设多个级别节点；每个节点包括节点关键词；所述预设多个级别节点包括：核心节点、次核心节点、外围节点；每个节点还包括：节点编号、节点频率、相关节点和相似节点；所述匹配节点集合包括核心节点集合、次核心节点集合、外围节点集合；所述预设价值量表：将第一文本拆分为多个句子，对所述多个句子进行第一分词处理，并获取每个第一分词的词性信息、语法依存关系、语义依存关系信息；所述第一文本为任一新的文本；根据所述每个第一分词的词性信息、语法依存关系、语义依存关系信息提取得到待处理分词，并对所述待处理分词进行过滤处理，得到待处理分词集合；所述过滤处理包括停用词过滤、数字过滤、低频人物名过滤、数词量词过滤、词性过滤、分词词性过滤、关键词过滤；基于预设模型抽取得到所述待处理分词集合的分词特征集合、所述预设价值量表的核心节点关键词的核心特征集合、次核心节点关键词的次核心特征集合、外围节点关键词的外围特征集合；根据所述分词特征集合、所述核心特征集合以及核心节点关键词的数量计算得到待处理分词集合中各分词的核心相似度，根据所述分词特征集合、所述次核心特征集合以及次核心节点关键词的数量计算得到待处理分词集合中各分词的次核心相似度，根据所述分词特征集合、所述外围特征集合以及外围节点关键词的数量计算得到待处理分词集合中各分词的外围相似度；遍历所述待处理分词集合，针对任一分词，将分词的所述核心相似度与预设核心阈值进行比较，若所述核心相似度大于等于所述预设核心阈值，则将所述分词加入所述预设价值量表；若所述核心相似度小于所述预设核心阈值，则将分词的所述次核心相似度与预设次核心阈值进行比较，若所述次核心相似度大于等于所述预设次核心阈值，则将所述分词加入所述预设价值量表；若所述次核心相似度小于所述预设次核心阈值，则将分词的所述外围相似度与预设外围阈值进行比较，若所述外围相似度大于等于所述预设外围阈值，则将所述分词加入所述预设价值量表；Based on the preset value scale, the keyword set is traversed, and for any keyword, the preset value scale is queried to obtain the node keywords that match the keywords; the node keywords are classified according to the level of the node to which they belong. Classification is performed to obtain matching node sets of different levels; wherein, the preset value scale includes preset multiple level nodes; each node includes node keywords; the preset multiple level nodes include: core nodes, secondary nodes Core nodes, peripheral nodes; each node also includes: node number, node frequency, related nodes and similar nodes; the matching node set includes a core node set, a sub-core node set, and a peripheral node set; the preset value scale: Split the first text into multiple sentences, perform first word segmentation processing on the multiple sentences, and obtain part-of-speech information, grammatical dependencies, and semantic dependency information of each first segment; the first text is any A new text; extracting the word segments to be processed according to the part-of-speech information, grammatical dependency relationship, and semantic dependency relationship information of each first word segmentation, and filtering the word segments to be processed to obtain a set of word segments to be processed; said The filtering process includes stop word filtering, number filtering, low-frequency person name filtering, numeral quantifier filtering, part-of-speech filtering, part-of-speech filtering, and keyword filtering; based on the preset model, the word segmentation feature set and all the word segmentation sets to be processed are extracted based on the preset model. The core feature set of the core node keywords of the preset value scale, the sub-core feature set of the sub-core node keywords, and the peripheral feature set of the peripheral node keywords; according to the word segmentation feature set, the core feature set and the core node Calculate the number of keywords to obtain the core similarity of each segment in the set of word segments to be processed, and calculate the number of keywords in each segment in the set of segmented words to be processed based on the set of segment features, the set of sub-core features and the number of sub-core node keywords. The core similarity is calculated according to the word segmentation feature set, the peripheral feature set and the number of peripheral node keywords to obtain the peripheral similarity of each word segmentation in the word segmentation set to be processed; traverse the word segmentation set to be processed, and for any word segmentation, Compare the core similarity of the word segmentation with the preset core threshold. If the core similarity is greater than or equal to the preset core threshold, then add the word segmentation to the preset value scale; if the core similarity is less than the preset core threshold, then compare the sub-core similarity of the word segmentation with the preset sub-core threshold. If the sub-core similarity is greater than or equal to the preset sub-core threshold, then add the word segmentation to the preset sub-core threshold. The preset value scale; if the sub-core similarity is less than the preset sub-core threshold, then compare the peripheral similarity of the word segmentation with the preset peripheral threshold, if the peripheral similarity is greater than or equal to the If the peripheral threshold is preset, the segmentation is added to the preset value scale;

根据不同级别的匹配节点集合的数量及权重，计算得到核心节点集合的数量与核心节点权重的第一乘积、次核心节点集合的数量与次核心节点权重的第二乘积、外围节点集合的数量与外围节点权重的第三乘积，以及，所述关键词集合的数量与核心节点权重的第四乘积；其中，所述核心节点权重根据对所述核心节点集合中各个节点关键词的第一和值进行归一化处理得到；所述第一和值根据累加所述核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；所述次核心节点权重根据对所述次核心节点集合中各个节点关键词的第二和值进行归一化处理得到；所述第二和值根据累加所述次核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；所述外围节点权重根据对所述外围节点集合中各个节点关键词的第三和值进行归一化处理得到；所述第三和值根据累加所述外围节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；累加所述第一乘积、第二乘积及第三乘积，计算累加结果与所述第四乘积的比值，并根据预设指数对所述比值进行修正，得到所述文本的价值数据。According to the number and weight of matching node sets at different levels, the first product of the number of core node sets and the core node weight, the second product of the number of sub-core node sets and the sub-core node weight, the number of peripheral node sets and The third product of peripheral node weights, and the fourth product of the number of keyword sets and the core node weight; wherein the core node weight is based on the first sum of the node keywords in the core node set Obtained by performing normalization processing; the first sum value is obtained by accumulating the sum of the product of the number of relevant nodes and similar nodes of each node keyword in the core node set and the preset weight and the node frequency of the node keyword; The secondary core node weight is obtained by normalizing the second sum value of each node keyword in the secondary core node set; the second sum value is obtained by accumulating each node keyword in the secondary core node set. The number of related nodes and similar nodes and the sum of the preset weights and the node frequencies of the node keywords are obtained; the peripheral node weights are normalized according to the third sum of the node keywords in the peripheral node set Obtained by processing; the third sum value is obtained by accumulating the sum of the product of the number of relevant nodes and similar nodes of each node keyword in the peripheral node set and the preset weight and the node frequency of the node keyword; accumulating the The first product, the second product and the third product are used to calculate the ratio of the cumulative result to the fourth product, and the ratio is corrected according to a preset index to obtain the value data of the text.

2.根据权利要求1所述的方法，其特征在于，所述方法还包括：2. The method according to claim 1, characterized in that, the method further comprises:

若查询所述预设价值量表，未得到与所述关键词匹配的节点关键词，将所述关键词归类至非价值匹配节点集合。If the preset value scale is queried and no node keyword matching the keyword is obtained, the keyword is classified into a non-value matching node set.

3.根据权利要求2所述的方法，其特征在于，所述方法还包括：3. The method according to claim 2, characterized in that, the method further comprises:

若匹配节点集合为非价值匹配节点集合，则确定所述文本的价值数据为0。If the matching node set is a non-value matching node set, it is determined that the value data of the text is 0.

4.根据权利要求1所述的方法，其特征在于，所述对文本进行分词处理，得到包含多个关键词的关键词集合进一步包括：4. The method according to claim 1, characterized in that performing word segmentation processing on the text to obtain a keyword set containing multiple keywords further includes:

对所述文本进行预处理；所述预处理包括格式过滤处理和停用词过滤处理；Preprocess the text; the preprocessing includes format filtering and stop word filtering;

根据标点符号对所述文本进行处理，将所述文本拆分为多个句子；Process the text according to punctuation marks and split the text into multiple sentences;

5.一种基于价值量表的文本价值计算装置，其特征在于，装置包括：5. A text value calculation device based on a value scale, characterized in that the device includes:

价值计算模块，适于根据不同级别的匹配节点集合的数量及权重，计算得到核心节点集合的数量与核心节点权重的第一乘积、次核心节点集合的数量与次核心节点权重的第二乘积、外围节点集合的数量与外围节点权重的第三乘积，以及，所述关键词集合的数量与核心节点权重的第四乘积；其中，所述核心节点权重根据对所述核心节点集合中各个节点关键词的第一和值进行归一化处理得到；所述第一和值根据累加所述核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；所述次核心节点权重根据对所述次核心节点集合中各个节点关键词的第二和值进行归一化处理得到；所述第二和值根据累加所述次核心节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；所述外围节点权重根据对所述外围节点集合中各个节点关键词的第三和值进行归一化处理得到；所述第三和值根据累加所述外围节点集合中各个节点关键词的相关节点和相似节点的数量及预设权重的乘积与节点关键词的节点频率之和得到；累加所述第一乘积、第二乘积及第三乘积，计算累加结果与所述第四乘积的比值，并根据预设指数对所述比值进行修正，得到所述文本的价值数据。The value calculation module is adapted to calculate the first product of the number of core node sets and the core node weight, the second product of the number of sub-core node sets and the sub-core node weight, based on the number and weight of matching node sets at different levels. The third product of the number of peripheral node sets and the weight of peripheral nodes, and the fourth product of the number of keyword sets and the weight of core nodes; wherein the core node weight is based on the key to each node in the core node set The first sum value of the word is obtained by normalization; the first sum value is calculated based on the product of the number of related nodes and similar nodes of each node keyword in the core node set and the preset weight and the node keyword. The sub-core node weight is obtained by the sum of node frequencies; the sub-core node weight is obtained by normalizing the second sum value of each node keyword in the sub-core node set; the second sum value is obtained by accumulating the sub-core node The peripheral node weight is obtained by the sum of the product of the number of related nodes and similar nodes of each node keyword in the set and the preset weight and the node frequency of the node keyword; the peripheral node weight is obtained based on the number of the node keywords in the peripheral node set. The third sum value is obtained by normalization processing; the third sum value is calculated based on the product of the number of relevant nodes and similar nodes of each node keyword in the peripheral node set and the preset weight and the node frequency of the node keyword. and obtain; accumulate the first product, the second product and the third product, calculate the ratio of the cumulative result and the fourth product, and correct the ratio according to a preset index to obtain the value data of the text.

6.一种计算设备，其特征在于，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；6. A computing device, characterized in that it includes: a processor, a memory, a communication interface and a communication bus, and the processor, the memory and the communication interface complete communication with each other through the communication bus;

所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行如权利要求1-4中任一项所述的基于价值量表的文本价值计算方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the text value calculation method based on a value scale according to any one of claims 1-4.

7.一种计算机存储介质，其特征在于，所述存储介质中存储有至少一可执行指令，所述可执行指令使处理器执行如权利要求1-4中任一项所述的基于价值量表的文本价值计算方法对应的操作。7. A computer storage medium, characterized in that at least one executable instruction is stored in the storage medium, and the executable instruction causes the processor to execute the value-based scale according to any one of claims 1-4. The text value calculation method corresponds to the operation.