






本申请要求于2019年9月19日提交中国专利局、申请号为201910884825.7,发明名称为“局部优化关键词的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 201910884825.7, and the invention title is "Methods, Devices, Equipment, and Storage Media for Partially Optimizing Keywords", the entire content of which is approved The reference is incorporated in the application.
本申请涉及大数据技术领域,尤其涉及一种局部优化关键词的提取方法、装置、服务器及计算机可读存储介质。This application relates to the field of big data technology, and in particular to a method, device, server, and computer-readable storage medium for extracting locally optimized keywords.
在自然语言处理研究中,关键词代表着文本的中心思想,对文本检索及文本分类等工作发挥着重大作用,因此关键词提取技术受到大量学者重视。由于传统基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想。当前,大多数关键词提取算法均在传统统计特征算法上加入了分词的关联关系等特性,从而得到最终关键词。其中不少国内外学者基于tf-idf的加权词频来过滤大量出现在语料库中的分词,但其严重依赖于语料库数量,有可能将分词重要性偏离其正常值。发明人意识到基于复杂网络的关键词提取方法虽然考虑了分词关联度,但其过分关注“小世界”特性,忽略了“大世界”影响力及文本内容层次的中心思想,从而导致关键词提取准确度较低。In the research of natural language processing, keywords represent the central idea of the text, and play an important role in text retrieval and text classification. Therefore, keyword extraction technology is valued by a large number of scholars. Due to the traditional keyword method based on statistical features, it pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignores the overall central idea of the article. At present, most keyword extraction algorithms add features such as the association relationship of word segmentation to the traditional statistical feature algorithm, so as to obtain the final keyword. Many scholars at home and abroad filter out a large number of word segments appearing in the corpus based on the weighted word frequency of tf-idf, but it depends heavily on the number of corpora, which may deviate the importance of the word segmentation from its normal value. The inventor realized that although the keyword extraction method based on complex networks considers the degree of word segmentation, it pays too much attention to the characteristics of the "small world", ignoring the influence of the "big world" and the central idea of the text content level, resulting in keyword extraction The accuracy is low.
发明内容Summary of the invention
本申请的主要目的在于提供一种局部优化关键词的提取方法,旨在解决现有技术仅基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想,从而导致关键词不准确的技术问题。The main purpose of this application is to provide a method for extracting locally optimized keywords, which aims to solve the problem that the prior art keyword methods are based only on statistical features, which pay too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignore the article’s The overall central idea, leading to technical problems of inaccurate keywords.
为实现上述目的,本申请提供一种局部优化关键词的提取方法,所述局部优化关键词的提取方法包括:To achieve the above objective, the present application provides a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:
接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;
通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
此外,为实现上述目的,本申请还提供一种局部优化关键词的提取装置,所述局部优化关键词的提取装置包括:In addition, in order to achieve the above objective, the present application also provides a device for extracting locally optimized keywords, the device for extracting locally optimized keywords includes:
识别单元,用于接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;
更新单元,用于基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;The update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set. The part of speech of the target participle is the keyword part of speech;
第一记录单元,用于通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;The first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;
第二记录单元,用于遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;The second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
提取单元,用于根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。The extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.
此外,为实现上述目的,本申请还提供一种服务器,所述服务器包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的局部优化关键词的提取程序,所述局部优化关键词的提取程序被所述处理器执行时实现如上申请所述的局部优化关键词的提取方法的步骤。In addition, in order to achieve the above object, the present application also provides a server, the server includes: a memory, a processor, and a locally optimized keyword extraction program stored on the memory and running on the processor, so When the program for extracting locally optimized keywords is executed by the processor, the steps of the method for extracting locally optimized keywords as described in the above application are implemented.
此外,为实现上述目的,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行上述局部优化关键词的提取方法。In addition, in order to achieve the above objective, the present application also provides a computer-readable storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer can execute the above-mentioned partial optimization. Keyword extraction method.
本申请实施例提出的一种局部优化关键词的提取方法、装置、服务器及计算机可读存储介质,接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词,实现了基于中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。The method, device, server, and computer-readable storage medium for extracting locally optimized keywords proposed in the embodiments of the present application receive the text to be processed, and recognize the characters in the title, first paragraph, and last paragraph of the text to be processed; Preset Chinese word segmentation system to segment the characters in the title, first paragraph and end, and obtain the word segmentation set of the title, first paragraph and end, and update the part of speech of the target word in the word segmentation set as the key Part of speech; through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the to-be Process the text, obtain the related participle of the target participle and the part of speech of the related participle, and record the weight parameter of the related participle in the hash table; according to the keyword part of speech of the target participle and the part of speech of each related participle In the weight parameter of the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed, and realize the part-of-speech score and word frequency based on the target word segmentation in the central idea As well as the part-of-speech score and word frequency of the related word segmentation, the target word or related word segmentation with the highest total score is obtained as the keyword, which reduces the error and improves the accuracy of the text keyword.
图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图;FIG. 1 is a schematic diagram of a server structure of a hardware operating environment involved in a solution of an embodiment of the application;
图2为本申请局部优化关键词的提取方法的第一实施例的流程示意图;2 is a schematic flowchart of a first embodiment of a method for extracting locally optimized keywords according to this application;
图3为图2中步骤S10的细化流程示意图;FIG. 3 is a schematic diagram of the detailed flow of step S10 in FIG. 2;
图4为图2中步骤S20的细化流程示意图;FIG. 4 is a schematic diagram of the detailed flow of step S20 in FIG. 2;
图5为图2中步骤S30的细化流程示意图;FIG. 5 is a detailed flowchart of step S30 in FIG. 2;
图6为本申请局部优化关键词的提取方法的第二实施例的流程示意图;6 is a schematic flowchart of a second embodiment of a method for extracting partially optimized keywords according to this application;
图7为图2中步骤S50的细化流程示意图。Fig. 7 is a detailed flowchart of step S50 in Fig. 2.
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请实施例的主要解决方案是:接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端中的分词集合,更新分词结集合中的目标分词的词性为关键词词性;通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词 为待处理文本的关键词。The main solution of the embodiment of this application is to receive the text to be processed, and identify the characters in the title, first paragraph and the last paragraph of the text to be processed; based on the preset Chinese word segmentation system, perform the characterization of the title, the first paragraph and the last paragraph. Segmentation, and obtain the word segmentation set in the title, first paragraph and end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech; through the part of speech score comparison table in the Chinese word segmentation system, the weight corresponding to each target word segmentation The parameters are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the text to be processed, obtain the related participle of the target word segmentation and the part-of-speech of the related participle, and record the weight parameters of the related word segmentation in the hash table In; according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table, extract the top five target participles and/or related participles of the total score as the keywords of the text to be processed.
由于现有技术基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想,从而导致关键词不准确的技术问题。Because the prior art keyword method based on statistical features pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, it ignores the overall central idea of the article, which leads to technical problems of inaccurate keywords.
本申请提供一种解决方案,通过中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。This application provides a solution. Through the part-of-speech score and word frequency of the target word segmentation and the part-of-speech score and word frequency of the related participle in the central idea, the target word or related participle with the highest total score is obtained as a keyword, which reduces the error. Improve the accuracy of text keywords.
如图1所示,图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图。As shown in FIG. 1, FIG. 1 is a schematic diagram of the server structure of the hardware operating environment involved in the solution of the embodiment of the application.
本申请实施例终端为服务器。The terminal in the embodiment of this application is a server.
如图1所示,该终端可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及局部优化关键词的提取程序。As shown in FIG. 1, the memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a program for extracting partially optimized keywords.
在图1所示的终端中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的局部优化关键词的提取程序,并执行以下操作:In the terminal shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server; the user interface 1003 is mainly used to connect to a client (user side) to communicate with the client; and the processor 1001 can be used to call the extraction program of locally optimized keywords stored in the memory 1005, and perform the following operations:
接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Receive the text to be processed, and identify the characters in the title, first paragraph and last paragraph of the text to be processed;
基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to the keyword part of speech;
通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part-of-speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and the character between the second space character position and the third space character position as the first paragraph of the text to be processed, and set N-( N-1) The space between the space character position and the N space character position is used as the end of the text to be processed;
调取预置字符识别程序,识别标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to perform the part of speech of the characters in the title, the first paragraph and the last paragraph according to the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. Divide
获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with a part-of-speech score greater than 0 as the target participle;
将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。The target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
调取预置中文分词系统中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
分别将目标分词作为搜索条件,索引各个目标分词在所述标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。The target word segmentation is used as the search condition, and the word frequency of each target word segmentation in the title, the first paragraph and the end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
通过预置字符识别程序遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system to divide the characters in the text to be processed into multiple word segments;
提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;Extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;When the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech and word frequency of the related participle;
通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
当第一分词不是所述分词集合中的目标分词时,判断第一分词是否为目标分词的关联分词;When the first participle is not the target participle in the word segmentation set, judge whether the first participle is the related participle of the target participle;
在判定第一分词为所述目标分词的关联分词时,将第一分词的词性和词频记录在哈希表中。When it is determined that the first participle is the related participle of the target participle, the part of speech and the word frequency of the first participle are recorded in the hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序, 还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain the preset calculation rules and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation, and extract the top five target words with the total score value extracted And/or related word segmentation is the key word of the text to be processed.
参照图2,本申请为局部优化关键词的提取方法的第一实施例,所述局部优化关键词的提取方法包括:2, this application is a first embodiment of a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:
步骤S10,接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Step S10, receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
服务器在接收到终端发送的待处理文本时,确定该文本的标题、首段和尾端的位置,具体为服务器在获取到待处理文本,标题一般处于待处理文本的首行最中间的位置,也可能处于某一段的上一行,且标题字符一般用加粗的形式。首段一般位于待处理文本的第二行且首段的字符前一般是第一空格字符(空格两位字符),将第二行的第一空格字符至到第二空格之间作为待处理文本的首段。尾端位于最后一个字符至第二行的第二空格之间。服务器在获取到待处理文本中字符前的空格位置,从而确定首段和尾端的位置。调取字符识别软件,扫描该待处理文本,获取该待处理文本的标题、首段和尾端中的字符。When the server receives the text to be processed from the terminal, it determines the position of the title, the first paragraph and the end of the text. Specifically, when the server obtains the text to be processed, the title is generally located in the middle of the first line of the text to be processed. It may be in the upper line of a certain paragraph, and the title characters are generally in bold form. The first paragraph is generally located in the second line of the text to be processed, and the characters in the first paragraph are generally the first space character (two characters of space), and the first space character to the second space in the second line is regarded as the text to be processed The first paragraph. The end is located between the last character and the second space on the second line. The server obtains the position of the space before the character in the text to be processed, so as to determine the position of the first paragraph and the end. Call character recognition software, scan the text to be processed, and obtain the characters in the title, first paragraph, and end of the text to be processed.
步骤S20,基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Step S20, based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segment in the word segmentation set to the keyword part of speech. ;
中文分词系统(Chinese Word Segmentation)指的是将一个汉字字符序列切分成一个一个单独的词。中文分词是文本挖掘的基础,对于输入的一段中文,成功的进行中文分词,可以达到自动识别语句含义的效果。把所有的词都存入中文分词系统中,扫描带处理的文本,查找所有可能的词,然后看哪个词可以作为输出。如:待处理文本:我是学生;词:我/是/学生。服务器在调取预置中文分词系统,服务器通过中文分析系统对待处理文本中的标题、首段和尾端中的字符进行切分,读取待处理文本标题、首段和尾端中的分词,将读取到的分词进行集合,得到该待处理文本标题、首段和尾端中的分词集合。将分词集合中的分词作为目标分词,并将目标分词的词性标识为关键词词性。Chinese Word Segmentation refers to the segmentation of a sequence of Chinese characters into individual words. Chinese word segmentation is the basis of text mining. For a piece of Chinese input, successfully performing Chinese word segmentation can achieve the effect of automatically identifying the meaning of the sentence. Store all the words in the Chinese word segmentation system, scan the processed text, find all possible words, and then see which word can be output. Such as: text to be processed: I am a student; words: I/Yes/student. The server is calling the preset Chinese word segmentation system. The server uses the Chinese analysis system to segment the characters in the title, first paragraph and end of the text to be processed, and reads the word segmentation in the title, first paragraph and end of the text to be processed. Collect the read word segmentation to obtain the word segmentation set in the title, first paragraph and end of the text to be processed. The word segmentation in the word segmentation set is used as the target word segmentation, and the part of speech of the target word segmentation is identified as the keyword part of speech.
步骤S30,通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Step S30, through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are the part-of-speech score and the word frequency;
服务器在获取到分词集合时,调取中文分词系统中的词性分数表,基于中文分词系统,获取分词集合中每个目标分词的词性,通过中文分词系统中的词性分数表,获取各个目标分词对应的分数值,将该分数值作为该目标分词的权重参数并将对应的分值记录在哈希表中。When the server obtains the word segmentation set, it retrieves the part of speech score table in the Chinese word segmentation system, based on the Chinese word segmentation system, obtains the part of speech of each target word segmentation in the word segmentation set, and obtains the corresponding part of each target word through the part of speech score table in the Chinese word segmentation system The score value of is used as the weight parameter of the target word segmentation and the corresponding score value is recorded in the hash table.
步骤S40,遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中。Step S40, traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table.
服务器开始对待处理文本进行遍历,具体为服务器调取字符识别软件对待处理文本进行遍历,识别待处理文本中所有的字符,基于预置中文分词系统对识别的字符进行切分, 在获取到待处理中的分词时,将获取到的分词与分词集合中的目标分词进行匹配,当该分词为目标分词时,记录该分词出现的词频,以及将该目标分词前后的分词作为关联分词,并记录该关联分词的词频,执行步骤30,当该分词不是目标分词时,进行下一分词的匹配,直至匹配待处理文本中所有的分词;The server starts to traverse the text to be processed. Specifically, the server calls character recognition software to traverse the text to be processed, recognizes all the characters in the text to be processed, and splits the recognized characters based on the preset Chinese word segmentation system. When the word segmentation in the word segmentation, the obtained word segmentation is matched with the target participle in the word segmentation set. When the word segmentation is the target word segmentation, record the word frequency of the word segmentation, and the participle before and after the target word segmentation as the related participle, and record the Associate the word frequency of the word segmentation, go to step 30, when the word segmentation is not the target word segmentation, match the next word segmentation until it matches all the word segmentation in the text to be processed;
步骤S50,根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。Step S50, according to the keyword part of speech of the target word segmentation and the weight parameter of the part of speech of each related participle in the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the to-be-processed text.
服务器通过匹配处理文本中所有的分词后,将哈希表中记载的各个关键词以及关联分词对应的权重参数从大到小进行排序,提取权重参数前五对应的关键词,将权重参数前五对应的关键词确定为目标关键词,将该目标关键词作为该待处理文本的目标关键词。After the server processes all the word segmentation in the text by matching, it sorts the keywords recorded in the hash table and the weight parameters corresponding to the associated word segmentation from largest to smallest, extracts the keywords corresponding to the top five weight parameters, and puts the weight parameters in the top five The corresponding keyword is determined as the target keyword, and the target keyword is used as the target keyword of the text to be processed.
在本实施例中,通过将文本的标题、首段和尾端作为文本的中心思想,对待处理文本的标题、首段和尾端进行分析切分,获取到多个目标分析的词频和词性,在通过获取待处理文本中目标分词的关联分词的词性和词频,来获取各个目标分析和关联分词的词性总值,中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。In this embodiment, by taking the title, first paragraph, and end of the text as the central idea of the text, the title, first paragraph, and end of the text to be processed are analyzed and segmented to obtain the word frequency and part of speech for multiple target analysis. By obtaining the part-of-speech and word frequency of the related participle of the target participle in the text to be processed, the total part-of-speech value of each target analysis and related participle is obtained. The part-of-speech score of the target participle in the central idea, the word frequency and the part-of-speech score of the related participle, Word frequency, the target word or related word segmentation with the highest total score is obtained as keywords, which reduces errors and improves the accuracy of text keywords.
进一步的,参照图3,图3为本申请局部优化关键词的提取方法提供的第二实施例,基于上述图2所示的实施例,步骤S10包括:Further, referring to FIG. 3, FIG. 3 is a second embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S10 includes:
步骤S11,接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Step S11, receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
步骤S12,将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Step S12: Use the characters between the first space character position and the second space character position as the title of the text to be processed, and use the characters between the second space character position and the third space character position as the first paragraph of the text to be processed, and The space between the N-(N-1) space character position and the N space character position is used as the end of the text to be processed;
步骤S13,调取预置字符识别程序,识别标题、首段和尾段中的字符。In step S13, a preset character recognition program is called to recognize the characters in the title, the first paragraph and the last paragraph.
服务器在接收到终端发送的处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N。具体实施方式为服务器接收待处理文本,对待处理文本进行扫描,获取待处理文本中的每一行的空白处,并记录该空白处出的位置以及数量N。将第一空白位置处与第二空白位置处之间作为该待处理文本的标题。标题一般位于文本的第一行,且标题的首字符一般在该行空白两个字符。将第二空白位置处与第三空白位置处之间作为该待处理文本的首段。将第N空白位置处与第N-(N-1)空白位置处作为该待处理文本的尾端,例如,该待处理文本的尾段结尾字符不是空白字符,是特殊符号“。”、“!”、“?”等时,将其作为空白字符。服务器调取预置字符识别软件,对该处理文本的标题、首段和尾段进行识别,获取该处理文本的标题、首段和尾段中所有的字符。After receiving the processed text sent by the terminal, the server obtains the position of the space character and the number N of space characters in the text to be processed. The specific implementation is that the server receives the text to be processed, scans the text to be processed, obtains the blank space of each line in the text to be processed, and records the position and the number N of the blank space. Use the space between the first blank position and the second blank position as the title of the text to be processed. The title is generally located on the first line of the text, and the first character of the title is generally two blank characters in the line. The space between the second blank position and the third blank position is taken as the first paragraph of the text to be processed. The N-th blank position and the N-(N-1)-th blank position are regarded as the end of the text to be processed. For example, the end character of the end of the text to be processed is not a blank character, but is a special symbol ".", " !", "?", etc., treat them as blank characters. The server invokes preset character recognition software, recognizes the title, first paragraph, and end paragraph of the processed text, and obtains all characters in the title, first paragraph, and end paragraph of the processed text.
在本实施例中,通过获取待处理文本的空格字符的数量以及位置,将文本进行处理,从而获取到待处理文本的标题、首段和尾段,再通过字符识别程序获取标题、首段和尾段中的字符,通过空格字符快速的将待处理文本分为标题、首段和尾段。In this embodiment, the text is processed by obtaining the number and position of the space characters of the text to be processed, so as to obtain the title, first paragraph, and last paragraph of the text to be processed, and then obtain the title, first paragraph and the last paragraph through a character recognition program. The characters in the last paragraph can quickly divide the to-be-processed text into the title, the first paragraph and the last paragraph through the space character.
参照图4,图4为本申请局部优化关键词的提取方法提供的第三实施例,基于上述图2 所示的实施例,步骤S20,包括:Referring to Fig. 4, Fig. 4 is a third embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in Fig. 2 above, step S20 includes:
步骤S21,在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;Step S21: When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to classify the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. The part of speech is divided;
步骤S22,获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Step S22: Obtain the part-of-speech scores in the part-of-speech score comparison table of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with the part-of-speech score greater than 0 as the target word segmentation;
步骤S23,将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。In step S23, the target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
服务器在是被待该待处理文本标题、首段和尾端中的所有字符时,启动预置中文分词系统,中文分词系统自动识别到的字符进行切分,具体实施方式为,中文分词系统中记载有名词、动词、形容词、介词、标点、量词以及新词,中文分词系统将获取到的字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,例如,首先获取一个字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,当匹配不成功时,获取两个字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,直至匹配成功。服务器获取中文分词系统切分标题、首段和尾段中名词、动词、形容词、介词、标点、量词以及新词,获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将名词、动词、形容词、介词、标点、量词以及新词的词性分数大于0的字符确定为目标分词。将名词、动词、形容词、介词、标点、量词以及新词进行分词集合,即有两个相同的名词,只保留一个,并更新分词集合中目标分词的词性,将目标分词更新为关键词词性,目标分词的词性为名词、动词、形容词、介词、标点、量词以及新词等词性,将名词、动词、形容词、介词、标点、量词以及新词等词性标识为关键词词性。The server activates the preset Chinese word segmentation system when all characters in the title, first paragraph and end of the text to be processed are to be processed, and the characters automatically recognized by the Chinese word segmentation system are segmented. The specific implementation is in the Chinese word segmentation system. Nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words are recorded. The Chinese word segmentation system matches the acquired characters with the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. For example, first obtain one Characters are matched with recorded nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. When the match is unsuccessful, get two characters to match the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words Until the match is successful. The server obtains the Chinese word segmentation system to segment the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words in the title, first and last paragraphs, and obtain the characters of nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. The part of speech score in the Chinese word segmentation system compares the part of speech scores in the table, and nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words whose part of speech scores are greater than 0 are determined as target participles. The nouns, verbs, adjectives, prepositions, punctuation, quantifiers and new words are grouped into word segmentation, that is, there are two identical nouns, only one is kept, and the part of speech of the target participle in the participle set is updated, and the target participle is updated to the keyword part of speech. The part of speech of the target participle is nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms. The part of speech such as nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms are identified as keywords.
在本实施例中,通过预置中文分析系统对标题、首段和尾段进行切分,获取到不同的字符,再通过词性分数对照表获取到各个字符的词性分值,并将此项分值大于0的字符确定为目标分词,且目标分词的词性为关键词性,快速、准确的提取到标题、首段和尾段中的目标分词。In this embodiment, the title, the first paragraph, and the last paragraph are segmented through a preset Chinese analysis system to obtain different characters, and then the part-of-speech score of each character is obtained through the part-of-speech score comparison table, and this item is divided. Characters with a value greater than 0 are determined as the target word segmentation, and the part of speech of the target word segmentation is keyword nature, and the target word segmentation in the title, first paragraph and last paragraph can be extracted quickly and accurately.
参照图5,图5为本申请局部优化关键词的提取方法提供的第四实施例,基于上述图2所示的实施例,步骤S30包括:Referring to FIG. 5, FIG. 5 is a fourth embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S30 includes:
步骤S31,调取预置中文分词系统中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Step S31, retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
步骤S32,分别将目标分词作为搜索条件,索引各个所述目标分词在标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。In step S32, the target word segmentation is used as a search condition, and the word frequency of each target word segmentation in the title, first paragraph and end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
服务器调取预置中文分词系统中的词性分数对照表,词性分数对照表中记录有名词、动词、形容词、介词、标点、量词、关键词以及新词等词性的分数值,具体表格如下所示:The server retrieves the part-of-speech score comparison table in the preset Chinese word segmentation system. The part-of-speech score comparison table records the part-of-speech scores of nouns, verbs, adjectives, prepositions, punctuation, quantifiers, keywords, and new words. The specific table is as follows :
对照分数词性对照表,获取到关键词词性对应的分数值为3.0,在标题、首段和尾端中搜索获取到的分词集合中的各个目标分词的词频,将获取到的各个目标分词的词频以及对应的关键词分数值记录在哈希表中。Compare the score part of speech comparison table, get the score value of the keyword part of speech corresponding to 3.0, search for the word frequency of each target word in the word segmentation set obtained in the title, first paragraph and end, and get the word frequency of each target word And the corresponding keyword score value is recorded in the hash table.
在本实施例中,通过对照词性分数表,获取各个目标分词的词性分值,并通过索引,获取到各个目标分词在标题、首段和尾段中的词频,将获取到的词频和词性记录在哈希表,从而快速的获取到各个目标分词在标题、首段和尾段中的词频和词性。In this embodiment, by comparing the part-of-speech score table, the part-of-speech score of each target word segment is obtained, and through the index, the word frequency of each target word segment in the title, the first paragraph and the last paragraph is obtained, and the obtained word frequency and part-of-speech are recorded In the hash table, the frequency and part of speech of each target word in the title, first paragraph and last paragraph can be quickly obtained.
参照图6,图6为本申请局部优化关键词的提取方法提供的第五实施例,基于上述图2所示的实施例,步骤S40包括:Referring to FIG. 6, FIG. 6 is a fifth embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S40 includes:
步骤S41,通过预置字符识程序遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Step S41, traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;
步骤S42,提取待处理文本中的第一分词,判断第一分词是否为分词集合中的目标分词;Step S42, extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;
步骤S43,当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;Step S43, when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the part of speech and word frequency of the related participle are obtained;
步骤S44,通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。Step S44: By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
启动预置字符识软件遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;当第一分词为分词集合中的目标分词时,读取第一分词的前后第二分词和第三分词,具体为,服务器获取到中文分词系统切分的分词位置,提取待处理文本中的第一分词,当第一分词为目标分词时,读取所述第二分词和所述第三分词的词性以及词频,将获取到的关联分词的词性比对中农问分词系统中的词性分数对照表,获取关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。当第一分词之前的第二分词或之后的第三分词为空白字符或特殊符号时,则不读取第三分词或第二分词,获取下一分词。Start the preset character recognition software to traverse the text to be processed, identify the characters in the text to be processed, the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation; extract the first word segmentation in the text to be processed, Determine whether the first participle is the target participle in the word participle set; when the first participle is the target participle in the word participle set, read the second and third participles before and after the first participle, specifically, the server obtains Chinese The word segmentation position segmented by the word segmentation system, extract the first word segmentation in the text to be processed, when the first word segmentation is the target word segmentation, read the part of speech and word frequency of the second word segmentation and the third word segmentation, and obtain the association The part of speech comparison of word segmentation compares the part of speech score comparison table in the Zhongnongwen word segmentation system to obtain the part of speech score corresponding to the related word segmentation, and record the part of speech score and word frequency of the related word segmentation in the hash table. When the second participle before the first participle or the third participle after the first participle is a blank character or a special symbol, the third participle or the second participle is not read, and the next participle is obtained.
当服务器判定第一分词不是分词集合中的分表分词时,判断第一分词是否为目标分词的关联分词。具体为,当识别第一分词的字符=时,将第一分析的字符与目标分词的字符进行比对,当第一分词的字符与目标分粗的字符不相同时,将第一分词的字符与目标分词的 关联分词的字符进行比对,判断第一分词是否为关联分词,的那个第一分词的字符与关联分词的字符比对一致时,将第一分词的词性和词频记录到哈希表中,且词频为记录一次。When the server determines that the first participle is not a participle in the word segmentation set, it determines whether the first participle is an associated participle of the target participle. Specifically, when the character of the first participle is recognized=, the first analyzed character is compared with the character of the target participle. When the character of the first participle is not the same as the character of the target participle, the character of the first participle is Compare the characters of the related participle of the target participle to determine whether the first participle is a related participle. When the characters of the first participle match the characters of the related participle, record the part of speech and word frequency of the first participle to the hash In the table, and the word frequency is recorded once.
在本实施例中,在本实施例中,启动预置字符识软件遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;当第一分词为分词集合中的目标分词时,读取第一分词的前后第二分词和第三分词,快速的获取到待处理文本中目标分词的关联分词。In this embodiment, in this embodiment, the preset character recognition software is started to traverse the text to be processed, to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Word segmentation; extract the first participle in the text to be processed, and determine whether the first participle is the target participle in the word segmentation set; when the first participle is the target participle in the word segmentation set, read the second participle before and after the first word segmentation And the third word segmentation, to quickly obtain the related word segmentation of the target word segmentation in the text to be processed.
参照图7,图7为本申请局部优化关键词的提取方法提供的第七实施例,,基于上述图2所示的实施例,步骤S50之后,还包括:Referring to FIG. 7, FIG. 7 is a seventh embodiment of the method for extracting locally optimized keywords according to this application. Based on the embodiment shown in FIG. 2, after step S50, the method further includes:
步骤S51,获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Step S51: Obtain preset calculation rules, and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
步骤S52,通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。Step S52, by sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the top five total score values. The target participle and/or related participle of is the key word of the text to be processed.
服务器在获取预置计算规则,通过预置计算规则,计算出该哈希表中各个目标分词和关联分词的总分值,具体为,获取任意一个目标分词的词频,词频也就是在待处理文本中处理目标分词的次数,以及对应的词性分值,将词频乘以词性分值,得到该目标分词的总分值,计算出哈希表中所有的目标分词和关联分词的总分值,通过将目标分词和关联分词的总分值按照从大到小和从小到大的排列顺序进行排序,得到总分值最大的前五为目标分词或关联分词,提取总分值最大的前五为目标分词或关联分词为待处理文本的关键词。The server is obtaining the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table through the preset calculation rules. Specifically, it obtains the word frequency of any target word segmentation. The word frequency is also in the text to be processed. Process the number of target word segmentation and the corresponding part-of-speech score in the process, multiply the word frequency by the part-of-speech score to get the total score of the target word segmentation, calculate the total score of all the target word segmentation and related word segmentation in the hash table, and pass Sort the total scores of the target segmentation and the related segmentation in the order from largest to smallest and from smallest to largest, and the top five with the largest total score are the target or related word segmentation, and the top five with the largest total score are extracted as the target Word segmentation or related word segmentation is the key word of the text to be processed.
在本实施例中,服务器在获取预置计算规则,通过预置计算规则,计算出该哈希表中各个目标分词和关联分词的总分值,通过将目标分词和关联分词的总分值按照从大到小和从小到大的排列顺序进行排序,得到总分值最大的前五为目标分词或关联分词,提取总分值最大的前五为目标分词或关联分词为待处理文本的关键词。从而减小了误差,提高了文本关键词的准确性。In this embodiment, the server is acquiring preset calculation rules, and calculates the total score of each target segmentation and associated word segmentation in the hash table through the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table. Sort from big to small and from small to big. The top five with the largest total score are the target word segmentation or related word segmentation, and the top five with the largest total score are extracted as the target word segmentation or related word segmentation as the key word of the text to be processed . Thereby reducing errors and improving the accuracy of text keywords.
此外,本申请实施例还提出一种局部优化关键词的提取装置,局部优化关键词的提取装置包括:In addition, an embodiment of the present application also proposes a device for extracting locally optimized keywords. The device for extracting locally optimized keywords includes:
识别单元,用于接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, first paragraph and the last paragraph of the text to be processed;
更新单元,用于基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;The update unit is used to segment the characters in the title, the first paragraph and the end based on the preset Chinese word segmentation system, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segmentation in the word segmentation set as the key Part of speech
第一记录单元,用于通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;The first recording unit is used to record the weight parameters corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are the part-of-speech score and the word frequency;
第二记录单元,用于遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;The second recording unit is used to traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameters of the related word segmentation in the hash table;
提取单元,用于根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。The extraction unit is used to extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table.
进一步地,上述识别单元具体用于:接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,空格字符的数量N大于3;Further, the above-mentioned recognition unit is specifically configured to: receive the text to be processed, and obtain the position of the space character in the text to be processed and the number N of space characters, where the number of space characters N is greater than 3;
将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and the character between the second space character position and the third space character position as the first paragraph of the text to be processed, and set N-( N-1) The space between the space character position and the N space character position is used as the end of the text to be processed;
调取预置字符识别程序,识别标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
进一步地,上述更新单元具体用于:在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;Further, the above-mentioned update unit is specifically used for: when the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, and prepositions. , Punctuation, quantifiers, and neologisms are divided into parts of speech;
获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with a part-of-speech score greater than 0 as the target participle;
将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。The target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
进一步地,上述第一记录单元具体用于:调取预置中文分词程序中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Further, the above-mentioned first recording unit is specifically used to: retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
分别将目标分词作为搜索条件,索引各个目标分词在标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。The target word segmentation is used as the search condition to index the word frequency of each target word segmentation in the title, first paragraph and end, and the score value and word frequency of each target word segmentation are recorded in the hash table.
进一步地,第二记录单元包括:识别子单元,用于通过预置字符识软件遍历待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Further, the second recording unit includes: a recognition subunit for traversing the text to be processed through a preset character recognition software, and recognizing characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Participle
第一判断子单元,用于提取待处理文本中的第一分词,判断第一分词是否为分词集合中的目标分词;The first judgment subunit is used for extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
第一判定子单元,用于当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;The first determination subunit is used to determine when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech of the related participle And word frequency;
获取子单元,用于通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。The acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the hash table.
进一步地,上述局部优化关键词的提取装置,还包括:Further, the above-mentioned device for extracting locally optimized keywords further includes:
第二判断子单元,用于当第一分词不是分词集合中的目标分词时,判断第一分词是否为目标分词的关联分词;The second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;
第二判定子单元,用于当判定第一分词为目标分词的关联分词时,将第一分词的词性和词频记录在哈希表中。The second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.
进一步地,上述提取单元具体用于:Further, the above extraction unit is specifically used for:
获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain the preset calculation rules and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation, and extract the top five target words with the total score value extracted And/or related word segmentation is the key word of the text to be processed.
上述局部优化关键词的提取装置中各个单元的功能实现与上述局部优化关键词的提取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。The implementation of the functions of each unit in the device for extracting locally optimized keywords corresponds to the steps in the embodiment of the method for extracting locally optimized keywords, and the functions and implementation processes will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Receive the text to be processed, and identify the characters in the title, first paragraph and last paragraph of the text to be processed;
基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to the keyword part of speech;
通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;Traverse the text to be processed, obtain the relevant participle of the target word segmentation and the part of speech of the relevant participle, and record the weight parameters of the relevant participle in the hash table;
根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part-of-speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201910884825.7ACN110765767B (en) | 2019-09-19 | 2019-09-19 | Extraction method, device, server and storage medium of local optimization keywords | 
| CN201910884825.7 | 2019-09-19 | 
| Publication Number | Publication Date | 
|---|---|
| WO2021051599A1true WO2021051599A1 (en) | 2021-03-25 | 
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| PCT/CN2019/118273CeasedWO2021051599A1 (en) | 2019-09-19 | 2019-11-14 | Method and apparatus for extracting locally optimized keywords, device and storage medium | 
| Country | Link | 
|---|---|
| CN (1) | CN110765767B (en) | 
| WO (1) | WO2021051599A1 (en) | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN114003712A (en)* | 2021-10-29 | 2022-02-01 | 平安国际智慧城市科技股份有限公司 | Document searching method, device, equipment and storage medium based on artificial intelligence | 
| CN114282092A (en)* | 2021-12-07 | 2022-04-05 | 咪咕音乐有限公司 | Information processing method, apparatus, device, and computer-readable storage medium | 
| CN118627972A (en)* | 2024-07-24 | 2024-09-10 | 武汉华林梦想科技有限公司 | Vocational skills assessment method and system based on big data | 
| CN118734850A (en)* | 2024-06-27 | 2024-10-01 | 浪潮智慧科技有限公司 | Chinese word segmentation method, device and medium based on hash table and binary search tree | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN113378141A (en)* | 2021-08-12 | 2021-09-10 | 明品云(北京)数据科技有限公司 | Text data transmission method, system, equipment and medium | 
| CN114372122A (en)* | 2021-12-08 | 2022-04-19 | 阿里云计算有限公司 | Information acquisition method, computing device and storage medium | 
| CN118569817B (en)* | 2024-08-01 | 2024-11-22 | 江苏苏鹰信息科技有限公司 | Enterprise service management system based on cloud platform | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| JP2013020439A (en)* | 2011-07-11 | 2013-01-31 | Nec Corp | Synonym extraction system, method and program | 
| US20140101243A1 (en)* | 2012-10-05 | 2014-04-10 | Facebook, Inc. | Method and apparatus for identifying common interest between social network users | 
| CN110069599A (en)* | 2019-03-13 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Search method, device, equipment and readable storage medium storing program for executing based on approximate word | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN107239455B (en)* | 2016-03-28 | 2021-06-11 | 阿里巴巴集团控股有限公司 | Core word recognition method and device | 
| CN108304378B (en)* | 2018-01-12 | 2019-09-24 | 深圳壹账通智能科技有限公司 | Text similarity computing method, apparatus, computer equipment and storage medium | 
| CN109086355B (en)* | 2018-07-18 | 2022-05-17 | 北京航天云路有限公司 | Hot-spot association relation analysis method and system based on news subject term | 
| CN109635273B (en)* | 2018-10-25 | 2023-04-25 | 平安科技(深圳)有限公司 | Text keyword extraction method, device, equipment and storage medium | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| JP2013020439A (en)* | 2011-07-11 | 2013-01-31 | Nec Corp | Synonym extraction system, method and program | 
| US20140101243A1 (en)* | 2012-10-05 | 2014-04-10 | Facebook, Inc. | Method and apparatus for identifying common interest between social network users | 
| CN110069599A (en)* | 2019-03-13 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | Search method, device, equipment and readable storage medium storing program for executing based on approximate word | 
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN114003712A (en)* | 2021-10-29 | 2022-02-01 | 平安国际智慧城市科技股份有限公司 | Document searching method, device, equipment and storage medium based on artificial intelligence | 
| CN114282092A (en)* | 2021-12-07 | 2022-04-05 | 咪咕音乐有限公司 | Information processing method, apparatus, device, and computer-readable storage medium | 
| CN118734850A (en)* | 2024-06-27 | 2024-10-01 | 浪潮智慧科技有限公司 | Chinese word segmentation method, device and medium based on hash table and binary search tree | 
| CN118627972A (en)* | 2024-07-24 | 2024-09-10 | 武汉华林梦想科技有限公司 | Vocational skills assessment method and system based on big data | 
| Publication number | Publication date | 
|---|---|
| CN110765767B (en) | 2024-01-19 | 
| CN110765767A (en) | 2020-02-07 | 
| Publication | Publication Date | Title | 
|---|---|---|
| WO2021051599A1 (en) | Method and apparatus for extracting locally optimized keywords, device and storage medium | |
| CN107045496B (en) | Error correction method and error correction device for text after voice recognition | |
| WO2021174717A1 (en) | Text intent recognition method and apparatus, computer device and storage medium | |
| WO2020215554A1 (en) | Speech recognition method, device, and apparatus, and computer-readable storage medium | |
| WO2019218527A1 (en) | Multi-system combined natural language processing method and apparatus | |
| US20120284308A1 (en) | Statistical spell checker | |
| US9317608B2 (en) | Systems and methods for parsing search queries | |
| CN113468176B (en) | Information input method and device, electronic equipment and computer readable storage medium | |
| CN108027814B (en) | Stop word recognition method and device | |
| CN105631009A (en) | Retrieval method and system based on word vector similarity | |
| CN109033212B (en) | Text classification method based on similarity matching | |
| JP2018040906A (en) | Dictionary updating apparatus and program | |
| CN109977397B (en) | News hotspot extracting method, system and storage medium based on part-of-speech combination | |
| WO2017020454A1 (en) | Search method and apparatus | |
| CN108717459A (en) | A kind of mobile application defect positioning method of user oriented comment information | |
| CN105760359B (en) | Question processing system and method thereof | |
| WO2017215242A1 (en) | Method and device for searching resumes | |
| CN113743094B (en) | Text error correction method, electronic device and computer readable storage medium | |
| US8806455B1 (en) | Systems and methods for text nuclearization | |
| CN111126201A (en) | Character recognition method and device in script | |
| WO2021051600A1 (en) | Method, apparatus and device for identifying new word based on information entropy, and storage medium | |
| CN114822527B (en) | A speech-to-text error correction method, device, electronic device, and storage medium | |
| KR102710905B1 (en) | Apparatus, method and computer program for summarizing document | |
| CN112084777B (en) | Entity linking method | |
| CN116303913A (en) | Intelligent question-answering method, device, equipment and storage medium | 
| Date | Code | Title | Description | 
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | Ref document number:19946138 Country of ref document:EP Kind code of ref document:A1 | |
| NENP | Non-entry into the national phase | Ref country code:DE | |
| 122 | Ep: pct application non-entry in european phase | Ref document number:19946138 Country of ref document:EP Kind code of ref document:A1 |