CN115204519A

Movatterモバイル変換

Info

Publication number: CN115204519A
Application number: CN202210972465.8A
Authority: CN
Inventors: 吕学强; 游新冬; 董志安; 滕尚志
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-10-18

Abstract

The invention relates to a patent quality grade prediction research in the field of natural language processing, which mainly comprises the following steps: 1. identifying the effect phrase by using an effect word extraction model fusing multiple characteristics; 2. extracting subject words contained in the patent text based on an Albert-BilSTM model; 3. clustering the extracted effect phrases and subject words by using a K-means algorithm, and manually constructing a technical effect matrix to obtain corresponding technical effects and technical scales; 4. independently quantifying or combining structured digital information contained in a patent, combining a long text to obtain 132 evaluation indexes, training a transfer learning model by using US patent data, and expanding a Chinese data set by using an active learning technology; 5. and combining the technical efficacy matrix with 132 indexes for migration training and updating parameters to obtain a final prediction model. The invention effectively improves the accuracy of patent quality evaluation.

Description

Translated fromChinese

一种融合知识信息的领域专利质量等级预测方法A method for predicting the quality level of domain patents based on the fusion of knowledge information

技术领域technical field

本发明涉及自然语言处理领域的专利质量等级预测研究，特别涉及针对专利主题词与功效短语抽取及技术功效矩阵的构建方法。The invention relates to research on patent quality grade prediction in the field of natural language processing, in particular to a method for extracting patent subject words and efficacy phrases and constructing a technical efficacy matrix.

背景技术Background technique

一项专利包含专利标题、摘要、权利说明书等内容，早期对于专利质量多基于技术、法律、经济等单维度指标进行评估。然而，单一的仅从某一维度难以全方面的判断一项专利的质量高低，因此，相关专家学者提出的评估指标开始将上述三个维度进行组合衍生出更多的评价指标。目前，国内外对于专利质量的评价指标仍然没有一个明确的界定，但可以肯定的是多维度多层次多指标对专利质量进行评估是能够更全面准确地对专利文本进行分析的。因此，相关研究也不再仅限于三个维度，逐渐开始趋于更多维度化，包括战略价值、专利活动、需求水平、股票价值和转化率等。随着研究的不断深入，越来越多的专家不再局限于数量指标，而是将其与专利文本内容相结合。由此可见，关键指标的组合与选取逐渐多维度化，质量评估的标准呈现多元化，如何更为精准的发掘影响质量评估结果的指标成为日后需要深入研究的热点内容。A patent contains the patent title, abstract, description of rights, etc. In the early days, the quality of patents was mostly evaluated based on single-dimensional indicators such as technology, law, and economy. However, it is difficult to judge the quality of a patent from a single dimension only. Therefore, the evaluation indicators proposed by relevant experts and scholars begin to combine the above three dimensions to derive more evaluation indicators. At present, there is still no clear definition of the evaluation indicators of patent quality at home and abroad, but it is certain that the multi-dimensional, multi-level and multi-indicator evaluation of patent quality can analyze the patent text more comprehensively and accurately. Therefore, relevant research is no longer limited to three dimensions, and gradually begins to tend to more dimensions, including strategic value, patent activity, demand level, stock value and conversion rate. With the deepening of research, more and more experts are no longer limited to quantitative indicators, but combine them with the content of patent texts. It can be seen that the combination and selection of key indicators are gradually multi-dimensional, and the standards of quality assessment are diversified. How to more accurately discover the indicators that affect the results of quality assessment has become a hot topic that needs to be studied in depth in the future.

事实上，专利质量等级的评估预测指标及相关数据的量化具有一定难度，一般难以获取，评估工作很大程度上会偏重于依赖人工统计和相关领域专家主观预测二者相结合的方式进行，这就决定了专利质量等级预测的结果具有极大的主观性和不确定性，基于不同专家各自角度的判断，其结果也可能存在一定偏差。随着技术的迅速创新及发展不难发现，利用机器学习可以利用模型自主的学习到专利文本中蕴含的特征。目前融合机器学习用于专利质量评估的方法主要有层次分析法、模糊综合评价法、逻辑回归法和决策树法等。近年来，利用神经网络及深度学习中的技术，迁移学习、主动学习等方法也逐步应用于专利等级的预测任务，其预测效果获得一定提升的同时曾存在着较大的提升空间。In fact, it is difficult to quantify the evaluation and prediction indicators of the patent quality level and related data, and it is generally difficult to obtain. It is determined that the results of patent quality grade prediction are highly subjective and uncertain, and the results may also have certain deviations based on the judgments of different experts from their respective perspectives. With the rapid innovation and development of technology, it is not difficult to find that using machine learning can use the model to learn the features contained in the patent text autonomously. At present, the methods of integrating machine learning for patent quality evaluation mainly include analytic hierarchy process, fuzzy comprehensive evaluation method, logistic regression method and decision tree method. In recent years, using techniques in neural networks and deep learning, methods such as transfer learning and active learning have also been gradually applied to patent-level prediction tasks. While the prediction effect has been improved to a certain extent, there was a large room for improvement.

发明内容SUMMARY OF THE INVENTION

为解决上述技术问题，本发明的目的是基于现有维度和利用深度学习及迁移学习技术的基础上，提出一种融合知识信息的领域专利质量等级预测方法。In order to solve the above technical problems, the purpose of the present invention is to propose a field patent quality level prediction method integrating knowledge information based on the existing dimensions and the use of deep learning and transfer learning technologies.

本发明的一种融合知识信息的领域专利质量等级预测方法，包括以下步骤：A method for predicting the quality level of a field patent by integrating knowledge information of the present invention includes the following steps:

1、获取实验所需数据，主要对专利摘要中涉及功效短语的句子进行筛选，通过训练得到融合多特征（偏旁部首、五笔、词长、词性）的抽取模型，用于专利文本中功效词的识别。1. Obtain the data required for the experiment, mainly filter the sentences involving efficacy phrases in the patent abstract, and obtain an extraction model that integrates multiple features (radicals, five strokes, word length, part of speech) through training, which is used in patent texts for efficacy words. identification.

2、针对专利标题与专利摘要中的主题词通过自构建的词库进行一轮标注，后进行多轮人工校对，基于Albert-BiLSTM模型进行训练得到抽取模型，用于专利文本中技术主题的抽取。2. For the subject words in the patent title and patent abstract, one round of labeling is carried out through the self-constructed thesaurus, and then multiple rounds of manual proofreading are carried out, and the extraction model is obtained by training based on the Albert-BiLSTM model, which is used for the extraction of technical topics in the patent text. .

3、将抽取出的主题词和功效短语利用K-means算法进行聚类，并经过进一步的人工审查和补充，最终根据专利文本构建技术功效矩阵，矩阵中新能源专利领域内的规模大小将用于后续对该领域专利质量的等级评估。3. Use the K-means algorithm to cluster the extracted subject words and efficacy phrases. After further manual review and supplementation, a technical efficacy matrix is finally constructed according to the patent text. The scale of the new energy patent field in the matrix will be used. In the follow-up evaluation of the quality of patents in this field.

4、利用已经相对成熟的美国专利质量评估模型，将美国专利翻译成中文进行训练，并利用少数具有质量标签的中文专利文本完成对模型的微调，同时将长文本与数字指标分别量化和组合后划分为132个指标，训练得到迁移学习模型。4. Use the relatively mature US patent quality assessment model to translate US patents into Chinese for training, and use a small number of Chinese patent texts with quality labels to complete the fine-tuning of the model, and quantify and combine long text and numerical indicators respectively. Divided into 132 indicators, and trained to obtain a transfer learning model.

5、将知识信息（技术功效矩阵）作为一种新的维度指标并与其他132个指标相结合用以专利质量评级，即将所有指标向量化或归一化后进行向量的拼接，并在效果较好的基于迁移学习的模型的基础上进行预测模型的训练，在测试集上进行专利质量等级的预测。5. Take knowledge information (technical efficacy matrix) as a new dimension indicator and combine it with other 132 indicators for patent quality rating, that is, all indicators are vectorized or normalized and then spliced with vectors, and the results are compared. On the basis of a good transfer learning-based model, the prediction model is trained, and the patent quality level is predicted on the test set.

本发明的一种融合知识信息的领域专利质量等级预测方法，所述步骤1中，首先分析中国专利数据库专利数据并设计相应的爬取规则，通过检索“新能源汽车”获取实验所需专利数据。将爬取的内容去除停用词后，将文本中包含的功效短语进行标注，并按照特征分类进行标记，主要包括：偏旁部首、五笔、词长和词性四部分。将标注好的专利文本进行数据预处理，使用Bert进行词向量训练，学习专利文本的语义特征。将向量化后的专利数据用于训练基于注意力机制的BiLSTM-CRF模型，最终选取功效短语识别效果最好的特征模型用于抽取任务。According to a method for predicting the quality level of patents in the field that integrates knowledge information, in the step 1, the patent data in the Chinese patent database is firstly analyzed and the corresponding crawling rules are designed, and the patent data required for the experiment is obtained by retrieving "new energy vehicles" . After removing the stop words from the crawled content, mark the efficacy phrases contained in the text, and mark them according to feature classification, mainly including four parts: radicals, five strokes, word length and part of speech. Data preprocessing is performed on the marked patent text, and Bert is used for word vector training to learn the semantic features of the patent text. The vectorized patent data is used to train the BiLSTM-CRF model based on the attention mechanism, and finally the feature model with the best efficacy phrase recognition effect is selected for the extraction task.

本发明的一种融合知识信息的领域专利质量等级预测方法，所述步骤2中，首先利用爬虫技术得到新能源汽车领域的相关专利文本，选取专利标题及摘要进行文本预处理。其次将处理好的数据经过Albert预训练模型层，将输入的文本进行词嵌入训练。将已经向量化的数据接入BiLSTM层进行编码，进行主题词抽取模型的训练。In the present invention, a method for predicting the quality level of patents in the field integrating knowledge information, in thestep 2, firstly, the relevant patent texts in the field of new energy vehicles are obtained by using the crawler technology, and the patent titles and abstracts are selected for text preprocessing. Secondly, the processed data is passed through the Albert pre-training model layer, and the input text is subjected to word embedding training. The vectorized data is inserted into the BiLSTM layer for encoding, and the subject word extraction model is trained.

本发明的一种融合知识信息的领域专利质量等级预测方法，所述步骤3中，选用K-means聚类算法，将相似度较高的词语归类到同一类别中，将完成聚类的k个簇心词语作为功效短语和主题词的重点研究内容。最终分别得到11个具有代表性的功效短语，通过人工数据标注和统计，以技术主题为横坐标，功效短语作为纵坐标构建技术功效矩阵。矩阵中的数据用以表述该技术主题下具有不同功效成果的专利文本规模。In the present invention, a method for predicting the quality level of a field patent that integrates knowledge information. In thestep 3, the K-means clustering algorithm is selected to classify words with high similarity into the same category, and k-means clustering is completed. A cluster of words as the key research content of efficacy phrases and subject words. Finally, 11 representative efficacy phrases were obtained respectively. Through manual data annotation and statistics, a technical efficacy matrix was constructed with the technical theme as the abscissa and the efficacy phrase as the ordinate. The data in the matrix is used to express the size of patent texts with different efficacy results under this technical topic.

本发明的一种融合知识信息的领域专利质量等级预测方法，所述步骤4中，美国专利作为源数据，中国专利作为目标数据域选择效果最佳的3-14层进行迁移，利用少数具有质量标签的中文专利文本完成对模型的微调。然后将长文本信息和数字信息量化组合纳入质量评估指标的范畴，将其划分为六个维度132个指标。迁移学习利用Bert将文本向量后与指标向量进行拼接得到包含专利文本信息的特征向量，得到专利质量自动评估迁移模型。In the method for predicting the quality level of a field patent that integrates knowledge information, in thestep 4, the US patent is used as the source data, and the Chinese patent is used as the target data field to select the best 3-14 layers for migration, and use a small number of high-quality The Chinese patent text of the label completes the fine-tuning of the model. Then, the quantitative combination of long text information and digital information is included in the category of quality evaluation indicators, and it is divided into six dimensions and 132 indicators. The transfer learning uses Bert to splicing the text vector with the index vector to obtain the feature vector containing the patent text information, and obtains the patent quality automatic evaluation transfer model.

本发明的一种融合知识信息的领域专利质量等级预测方法，所述步骤5中，将构建好的技术功效矩阵作为新的知识信息维度与其他132个向量进行拼接，构成新的特征向量。由于美国专利质量评估已经发展相对成熟，将拼接好的向量作为基于迁移的质量评估模型的输入，经过由512、128和32个节点组成的全连接神经网络及SoftMax层后，得到专利文本的质量等级预测结果，有效提升了结果的准确率。In the present invention, a method for predicting the quality level of a field patent that integrates knowledge information, in thestep 5, the constructed technical efficacy matrix is used as a new knowledge information dimension and spliced with other 132 vectors to form a new feature vector. Due to the relatively mature development of US patent quality assessment, the spliced vector is used as the input of the migration-based quality assessment model, and the quality of the patent text is obtained after the fully connected neural network composed of 512, 128 and 32 nodes and the SoftMax layer. Level prediction results, effectively improving the accuracy of the results.

与现有技术相比本发明的有益效果为：在以往的研究中，技术功效矩阵构建的最大难点是构建出的技术功效矩阵质量欠佳，导致专利分析失误。其根本原因就在于技术主题和技术功效判断和提取的准确性。本发明利用融合多特征的识别模型与基于Albert嵌入BiLSTM的抽取模型进行专利文本中功效词与主题词的识别，保障了实验中技术功效矩阵的质量和准确，为后续将其作为一项新维度的评价指标奠定了坚实的基础。研究表明多维度多层次的指标对专利质量进行评估能够更全面准确地对专利文本进行分析，本发明首次将技术功效矩阵作为知识信息维度用于专利质量等级预测工作，矩阵数字越大表示该领域内技术功效成果越多，一定程度上代表了该领域内发明创新的饱和程度。反之表示该领域内专利发明成果较少，一方面代表此处可能是技术难点，急需创新突破。另一方面代表该技术发明无实际意义。将技术功效矩阵反映的创新性、技术规模、未来发展等内容与其他指标相结合对专利质量等级预测的工作起到了积极的作用。Compared with the prior art, the present invention has the following beneficial effects: in previous studies, the biggest difficulty in constructing the technical efficacy matrix is that the constructed technical efficacy matrix is of poor quality, resulting in errors in patent analysis. The fundamental reason lies in the accuracy of technical subject and technical efficacy judgment and extraction. The invention uses the multi-feature recognition model and the extraction model based on Albert embedded BiLSTM to identify the efficacy words and subject words in the patent text, which ensures the quality and accuracy of the technical efficacy matrix in the experiment, and takes it as a new dimension in the future. The evaluation indicators have laid a solid foundation. Research shows that multi-dimensional and multi-level indicators can evaluate patent quality more comprehensively and accurately to analyze patent texts. For the first time, the invention uses technology efficacy matrix as knowledge information dimension for patent quality grade prediction work. The larger the matrix number, the more the field. The more internal technical efficacy results, to a certain extent, represents the degree of saturation of inventions and innovations in this field. On the contrary, it means that there are few patented invention achievements in this field. On the one hand, it means that there may be technical difficulties here, and innovation breakthroughs are urgently needed. On the other hand, it means that the technical invention has no practical significance. Combining the innovation, technology scale and future development reflected by the technology efficacy matrix with other indicators has played a positive role in the prediction of patent quality grades.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1是本发明一种融合知识信息的领域专利质量等级预测方法的流程图；Fig. 1 is the flow chart of a kind of field patent quality grade prediction method of fusion knowledge information of the present invention;

图2是不同模型用于专利质量等级预测和融合知识信息的预测模型结果对比示意图。Figure 2 is a schematic diagram showing the comparison of the results of the prediction models used by different models for patent quality level prediction and fusion of knowledge information.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施方式。虽然附图中显示了本发明的示例性实施方式，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施方式所限制。相反，提供这些实施方式是为了能够更透彻地理解本发明，并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present invention will be more thoroughly understood, and will fully convey the scope of the present invention to those skilled in the art.

图1为本发明融合知识信息的领域专利质量等级预测方法及流程图，包括如下步骤：Fig. 1 is the field patent quality grade prediction method and flow chart of the fusion knowledge information of the present invention, including the following steps:

1.获取实验所需的数据集，进行数据标注，嵌入层由偏旁部首、五笔特征、词长及词性四个特征组成，将四个特征分别使用Word2vec获取数据向量，将不同特征的向量进行拼接从而得到模型的向量输入进行训练。将输入文本的一句话表示为序列s = {s₁,s₂,s₃,…，s_n-1,s_n}∈Vs ,其中，V_s表示字符集合，利用Word2vec向量化：1. Obtain the data set required for the experiment and perform data annotation. The embedding layer is composed of four features: radicals, five features, word length and part of speech. The four features are respectively used Word2vec to obtain the data vector, and the vectors of different features are processed. Splicing to get the vector input of the model for training. A sentence of the input text is represented as a sequence s = {s₁ , s₂ , s₃ ,..., s_n-1 , s_n } ∈ Vs , where V_s represents a set of characters, vectorized using Word2vec:

利用Bert模型进行向量化过程如下：The vectorization process using the Bert model is as follows:

专利文本通过构建好的新编新华字典词库分别获取每个汉字相对应的偏旁部首，若该字未包含在词库内，则调用接口进行百度搜索，再将得到的偏旁部首使用Word2vec得到偏旁部首与向量相应的映射表示如下。The patent text obtains the corresponding radicals of each Chinese character through the newly constructed Xinhua Dictionary thesaurus. If the word is not included in the thesaurus, call the interface to search on Baidu, and then use Word2vec for the obtained radicals. The corresponding mapping of radicals and vectors is obtained as follows.

将专利文本利用官方五笔转换表获取每个字符对应的五笔编码，再将得到的五笔编码使用Word2vec得到其向量化后的相应表示。Use the official Wubi conversion table to obtain the Wubi code corresponding to each character of the patent text, and then use Word2vec to obtain the corresponding vectorized representation of the obtained Wubi code.

本文将专利文本利用词性标注获取每个字符对应的词性，再将得到的词性编码使用Word2vec得到其向量化后的相应表示：This paper uses part-of-speech tagging to obtain the part-of-speech corresponding to each character of the patent text, and then uses Word2vec to obtain the corresponding representation after vectorization of the obtained part-of-speech code:

同时给每个字符标注其对应所在功效词包含的字数，也就是词长，再将得到的数字编码利用Word2vec得到其向量化后的相应表示：At the same time, mark each character with the number of words contained in its corresponding function word, that is, the word length, and then use Word2vec to obtain the corresponding vectorized representation of the obtained digital code:

将BiLSTM作为编码层，CRF为解码层，训练得到融合多特征的功效短语识别模型。Taking BiLSTM as the encoding layer and CRF as the decoding layer, a multi-feature-integrated efficacy phrase recognition model is obtained by training.

2.首先利用爬虫技术得到新能源汽车领域的相关专利文本，选取专利标题及摘要进行文本预处理。其次将处理好的数据经过Albert预训练模型层，将输入的文本进行词嵌入训练。将已经向量化的数据接入BiLSTM层进行编码，通过BiLSTM模型的训练学习到输入样本的语义特征和上下文信息。2. First, use crawler technology to obtain relevant patent texts in the field of new energy vehicles, and select patent titles and abstracts for text preprocessing. Secondly, the processed data is passed through the Albert pre-training model layer, and the input text is subjected to word embedding training. The vectorized data is inserted into the BiLSTM layer for encoding, and the semantic features and contextual information of the input samples are learned through the training of the BiLSTM model.

3.完成K-means聚类的步骤如下：首先将数据进行基本的处理，根据数据的类别种类确定k的数目，即得到最终分类簇的数量，目前确定k数量常用“肘子”算法和Canopy算法。然后定位k个点为初始的聚类中心，通过计算距离将聚类中心附近的点归类到中心点所在的类，完成第一次聚类的初始化。接下来通过计算每一个关联点坐标的平均值确定新的簇类中心，在计算点到中心的距离时常采用欧氏距离、余弦相似度等方法进行计算。不断迭代上述过程并进行簇心的更新，直到得到的中心点不再产生变化，此时得到了最优的结果，结束迭代，完成聚类。进而利用人工标注数据，构建技术功效矩阵。3. The steps to complete K-means clustering are as follows: First, perform basic processing on the data, and determine the number of k according to the type of data, that is, to obtain the number of final classification clusters. Currently, the "elbow" algorithm and Canopy algorithm are commonly used to determine the number of k. . Then locate k points as the initial cluster center, and classify the points near the cluster center to the class where the center point is located by calculating the distance to complete the initialization of the first cluster. Next, the new cluster center is determined by calculating the average value of the coordinates of each associated point. The Euclidean distance, cosine similarity and other methods are often used to calculate the distance from the point to the center. The above process is continuously iterated and the cluster center is updated until the obtained center point no longer changes. At this time, the optimal result is obtained, the iteration is ended, and the clustering is completed. And then use manual labeling data to build a technical efficacy matrix.

4. 通过对中美专利数据进行对比分析，总结了两国专利的异同，将量化指标进行对齐操作，利用通过具有专利质量等级标签的美国专利训练的多任务学习网络模型，向中国专利进行迁移。迁移过程主要涉及了迁移部分的选取、跨语言迁移、使用主动学习扩充数据三个部分。同时对各个数据项进行深层次的处理，由此得到与专利质量相关的量化指标。实验在几个主要维度上对其量化。包括时间维度、数量维度、技术维度、法律维度、发明人与代理人维度，对上述维度单独量化或进行组合。4. Through the comparative analysis of Chinese and American patent data, the similarities and differences between the two countries' patents are summarized, the quantitative indicators are aligned, and the multi-task learning network model trained by the US patents with patent quality grade labels is used to transfer to Chinese patents. . The transfer process mainly involves three parts: the selection of the transfer part, the cross-language transfer, and the use of active learning to expand the data. At the same time, each data item is processed in depth to obtain quantitative indicators related to patent quality. Experiments quantify it in several main dimensions. Including time dimension, quantity dimension, technical dimension, legal dimension, inventor and agent dimension, quantify or combine the above dimensions individually.

5.首先将构建好的技术功效矩阵，作为新的知识信息维度与其他向量进行拼接，构成新的特征向量。由于美国专利质量评估已经发展相对成熟，将拼接好的向量作为基于迁移的质量评估模型的输入，经过由512、128和32个节点组成的全连接神经网络及SoftMax层后，得到专利文本的质量等级预测结果，有效提升了结果的准确率。5. First, the constructed technical efficacy matrix is spliced with other vectors as a new knowledge information dimension to form a new feature vector. Due to the relatively mature development of US patent quality assessment, the spliced vector is used as the input of the migration-based quality assessment model, and the quality of the patent text is obtained after the fully connected neural network composed of 512, 128 and 32 nodes and the SoftMax layer. Level prediction results, effectively improving the accuracy of the results.

实施例1：Example 1:

该实施例中的实验结果是根据工程咨询公司提供的历史工程咨询报告文段内容与提取的标题数据，经过人工标注后得到数据集，并在该数据集上测试得出。经试验得到体现本发明技术效果如下：The experimental results in this embodiment are based on the content of the historical engineering consulting report provided by the engineering consulting company and the extracted title data. After manual annotation, a dataset is obtained, and is obtained by testing on the dataset. The technical effects of the present invention are obtained through tests as follows:

图2为不同模型用于专利质量等级预测和融合知识信息的预测模型结果对比示意图。其中：Figure 2 is a schematic diagram showing the comparison of the results of different models used for patent quality level prediction and prediction model fusion of knowledge information. in:

为了便于进行各模型预测结果比较的同时增强可读性，对各模型进行如下定义：In order to facilitate the comparison of the prediction results of each model and enhance readability, each model is defined as follows:

Model_1：利用机器学习中的分类器支持向量机（Support Vector Machine ,SVM），综合文本向量、132个量化指标进行专利质量等级预测的模型。Model_1: A model that uses the classifier Support Vector Machine (SVM) in machine learning to predict the patent quality level by integrating text vectors and 132 quantitative indicators.

Model_2：利用BiLSTM，使用文本向量与132个量化指标的向量拼接作为输入训练的质量等级预测模型。Model_2: Using BiLSTM, a quality level prediction model trained using text vectors and vectors of 132 quantitative indicators as input.

Model_3：利用BiLSTM，使用文本向量与132个量化指标以及实验提出的知识挖掘维度（功效短语、技术主题词）的向量拼接作为输入训练的质量等级预测模型。Model_3: Using BiLSTM, the vector splicing of text vectors and 132 quantitative indicators and the knowledge mining dimensions (efficacy phrases, technical keywords) proposed in the experiment as input training quality level prediction model.

Model_4：利用BiLSTM，使用文本向量与132个量化指标以及实验提出的知识信息维度（技术功效矩阵）的向量拼接作为输入训练的质量等级预测模型。Model_4: Using BiLSTM, the vector concatenation of text vector and 132 quantitative indicators and the knowledge information dimension (technical efficacy matrix) proposed by the experiment is used as the input training quality level prediction model.

Model_5：使用文本向量与132个量化指标的向量拼接作为输入，利用基于美国专利文本向中国专利文本训练的迁移学习模型。Model_5: Use the text vector and the vector concatenation of 132 quantitative indicators as input, and use the transfer learning model trained based on the US patent text to the Chinese patent text.

Model_6：使用文本向量与132个量化指标以及实验提出的知识挖掘维度（功效短语、技术主题词）的向量拼接作为输入，利用基于美国专利文本向中国专利文本训练的迁移学习模型。Model_6: Use the vector splicing of text vectors and 132 quantitative indicators and the knowledge mining dimensions (efficacy phrases, technical keywords) proposed in the experiment as input, and use the transfer learning model trained based on US patent texts to Chinese patent texts.

Model_7：使用文本向量与132个量化指标以及实验提出的知识信息维度（技术功效矩阵）的向量拼接作为输入，利用基于美国专利文本向中国专利文本训练的迁移学习模型。Model_7: Using the vector concatenation of the text vector and 132 quantitative indicators and the knowledge information dimension (technical efficacy matrix) proposed by the experiment as input, using the transfer learning model trained based on the US patent text to the Chinese patent text.

知识信息维度即技术功效矩阵反映了该专利所处领域内的技术规模，能够清晰地表明相关领域的技术重点及发展趋势，通过该矩阵可进一步了解在某一专业领域的细分类下的专利申请情况及该领域的专利创新程度和饱和程度。从图2实验结果表明，将生成的矩阵作为一项新的评估指标与其他关键性指标相结合用于专利质量等级研究对于模型训练而言起到了较为积极的作用，有效提升了预测任务的准确性。The dimension of knowledge information, that is, the technical efficacy matrix, reflects the technical scale in the field in which the patent is located, and can clearly indicate the technical focus and development trend of the relevant field. Through this matrix, we can further understand the patent application under the sub-category of a certain professional field. situation and the degree of patent innovation and saturation in the field. The experimental results in Figure 2 show that the use of the generated matrix as a new evaluation index combined with other key indicators for patent quality level research has played a more positive role in model training, effectively improving the accuracy of prediction tasks. sex.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和变型，这些改进和变型也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the technical principle of the present invention, several improvements and modifications can be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims

Translated fromChinese

1.本发明的一种融合知识信息的领域专利质量等级预测方法，其特征在于，包括以下步骤：1. a kind of field patent quality grade prediction method of fusion knowledge information of the present invention, is characterized in that, comprises the following steps:

（一）、获取实验所需数据，主要对专利摘要中涉及功效短语的句子进行筛选，通过训练得到融合多特征（偏旁部首、五笔、词长、词性）的抽取模型，用于专利文本中功效词的识别；(1) To obtain the data required for the experiment, mainly screen the sentences involving efficacy phrases in the patent abstract, and obtain an extraction model that integrates multiple features (radicals, five strokes, word length, and part of speech) through training, which is used in the patent text. identification of efficacy words;

（二）、针对专利标题与专利摘要中的主题词通过自构建的词库进行一轮标注，后进行多轮人工校对，基于Albert-BiLSTM模型进行训练得到抽取模型，用于专利文本中技术主题的抽取；(2) For the subject words in the patent title and patent abstract, one round of labeling is performed through a self-constructed thesaurus, and then multiple rounds of manual proofreading are carried out, and the extraction model is obtained by training based on the Albert-BiLSTM model, which is used for technical topics in the patent text. extraction;

（三）、将抽取出的主题词和功效短语利用K-means算法进行聚类，并经过进一步的人工审查和补充，最终根据专利文本构建技术功效矩阵，矩阵中新能源专利领域内的规模大小将用于后续对该领域专利质量的等级评估；(3) The extracted subject words and efficacy phrases are clustered using the K-means algorithm, and after further manual review and supplementation, a technical efficacy matrix is finally constructed according to the patent text, and the scale of the new energy patent field in the matrix is It will be used for subsequent grade evaluation of patent quality in this field;

（四）、利用已经相对成熟的美国专利质量评估模型，将美国专利翻译成中文进行训练，并利用少数具有质量标签的中文专利文本完成对模型的微调，同时将长文本与数字指标分别量化和组合后划分为132个指标，训练得到迁移学习模型；(4) Using the relatively mature US patent quality assessment model, translate US patents into Chinese for training, and use a small number of Chinese patent texts with quality labels to complete the fine-tuning of the model, and at the same time quantify and quantify long text and numerical indicators respectively. After the combination, it is divided into 132 indicators, and the transfer learning model is obtained by training;

（五）、将知识信息（技术功效矩阵）作为一种新的维度指标并与其他132个指标相结合用以专利质量评级，即将所有指标向量化或归一化后进行向量的拼接，并在效果较好的基于迁移学习的模型的基础上进行预测模型的训练，在测试集上进行专利质量等级的预测。(5) Take knowledge information (technical efficacy matrix) as a new dimension indicator and combine it with other 132 indicators for patent quality rating. The prediction model is trained on the basis of the model based on transfer learning with better effect, and the patent quality level is predicted on the test set.

2.如权利要求1所述的一种融合知识信息的领域专利质量等级预测方法，其特征在于：通过对专利文本中包含的功效短语特征进行分析，总结了功效短语在字形结构、字音、词性色彩、词根、语义关系、词长等方面存在的特征关系，并将其归纳提炼为偏旁部首特征、五笔编码特征、词性特征和词长，融合多特征进行功效短语的抽取，为后续构建技术功效矩阵奠定了基础。2. the field patent quality grade prediction method of a kind of fusion knowledge information as claimed in claim 1, it is characterized in that: by analyzing the efficacy phrase feature contained in the patent text, summed up the efficacy phrase in glyph structure, word pronunciation, part of speech The feature relationships existing in color, root, semantic relationship, word length, etc. are summarized and refined into radical features, Wubi coding features, part-of-speech features and word lengths, and multi-features are combined to extract functional phrases for subsequent construction technology. The efficacy matrix lays the foundation.

3.如权利要求2所述的一种融合知识信息的领域专利质量等级预测方法，其特征在于：针对主题词抽取模型参数量大训练速度慢的问题，利用Albert预训练对专利文本进行向量化，在提高模型准确率的同时，使用最少的参数最大程度上提升了模型的各项性能，能够在抽取任务上进行更为实际广泛的应用，为后续技术功效矩阵的构建奠定了基础。3. the field patent quality grade prediction method of a kind of fusion knowledge information as claimed in claim 2, it is characterized in that: for the problem that the subject word extraction model parameter is large and the training speed is slow, utilizes Albert pre-training to carry out vectorization to patent text , while improving the accuracy of the model, using the fewest parameters to maximize the performance of the model, enabling more practical and extensive applications in extraction tasks, laying the foundation for the construction of the subsequent technical efficacy matrix.

4.如权利要求3所述的一种融合知识信息的领域专利质量等级预测方法，其特征在于：以技术主题词作为横坐标，功效短语作为纵坐标构建技术功效矩阵，根据技术功效矩阵中的相关数据可以分析技术集聚点和空白点，集聚点就是目前的热门领域，空白点可能就是未来技术创新的方向，由此通过将知识信息维度引入专利质量评估指标以提升专利质量等级预测效果。4. the field patent quality grade prediction method of a kind of fusion knowledge information as claimed in claim 3, it is characterized in that: with technical subject word as abscissa, efficacy phrase as ordinate to build technology efficacy matrix, according to the technical efficacy matrix Relevant data can analyze technical agglomeration points and blank points. The agglomeration points are the current hot fields, and the blank points may be the direction of future technological innovation. Therefore, the knowledge information dimension is introduced into the patent quality evaluation index to improve the prediction effect of patent quality grade.

5.如权利要求4所述的一种融合知识信息的领域专利质量等级预测方法，其特征在于：基于美国专利质量评估已经发展相对成熟，由此利用迁移学习以提升中国专利质量评估效果，同时将长文本与数字量化指标相结合，进一步提取到专利数据中的相关特征。5. The method for predicting the quality level of a patent in the field of integrating knowledge information as claimed in claim 4, characterized in that: based on the relatively mature development of US patent quality assessment, transfer learning is used to improve the effect of Chinese patent quality assessment, and simultaneously Combine long text with digital quantitative indicators to further extract relevant features from patent data.

6.如权利要求5所述的一种融合知识信息的领域专利质量等级预测方法，其特征在于：在专利质量评估工作中增加一项新的特征维度，即知识信息维度（技术功效矩阵），从专利技术规模及功能效用的角度出发对专利质量的等级进行判断，并基于迁移学习的模型训练得质量等级预测模型，最终预测效果取得明显提升。6. The method for predicting the quality level of a field patent that integrates knowledge information as claimed in claim 5, wherein a new feature dimension is added in the patent quality assessment work, that is, the dimension of knowledge information (technical efficacy matrix), Judging the level of patent quality from the perspective of patent technology scale and functional utility, and training a quality level prediction model based on the transfer learning model, the final prediction effect has been significantly improved.