Movatterモバイル変換


[0]ホーム

URL:


CN111507090B - Abstract extraction method, device, equipment and computer readable storage medium - Google Patents

Abstract extraction method, device, equipment and computer readable storage medium
Download PDF

Info

Publication number
CN111507090B
CN111507090BCN202010125189.2ACN202010125189ACN111507090BCN 111507090 BCN111507090 BCN 111507090BCN 202010125189 ACN202010125189 ACN 202010125189ACN 111507090 BCN111507090 BCN 111507090B
Authority
CN
China
Prior art keywords
sentence
sentences
candidate set
candidate
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010125189.2A
Other languages
Chinese (zh)
Other versions
CN111507090A (en
Inventor
郑立颖
徐亮
阮晓雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co LtdfiledCriticalPing An Technology Shenzhen Co Ltd
Priority to CN202010125189.2ApriorityCriticalpatent/CN111507090B/en
Publication of CN111507090ApublicationCriticalpatent/CN111507090A/en
Priority to PCT/CN2020/112340prioritypatent/WO2021169217A1/en
Application grantedgrantedCritical
Publication of CN111507090BpublicationCriticalpatent/CN111507090B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请提供一种摘要提取方法、装置、设备及计算机可读存储介质,该方法包括:计算语句集中每两个语句之间的句子相似度,并基于TextRank算法和句子相似度,从语句集中筛选出第一摘要候选集;计算语句集中每两个语句之间的余弦相似度,并基于TextRank算法和余弦相似度,从语句集中筛选出第二摘要候选集;基于MMR算法和预设语句个数,分别从第一摘要候选集和第二摘要候选集中筛选出第三摘要候选集第四摘要候选集;分别四个摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;统计融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从融合摘要候选集中筛选出目标文本的摘要结果集。本申请涉及数据处理,可以提高摘要提取的准确性。

The present application provides a summary extraction method, device, equipment and computer-readable storage medium, the method comprising: calculating the sentence similarity between every two sentences in a sentence set, and filtering out a first summary candidate set from the sentence set based on the TextRank algorithm and the sentence similarity; calculating the cosine similarity between every two sentences in the sentence set, and filtering out a second summary candidate set from the sentence set based on the TextRank algorithm and the cosine similarity; filtering out a third summary candidate set and a fourth summary candidate set from the first summary candidate set and the second summary candidate set based on the MMR algorithm and the preset number of sentences; selecting sentences with a preset number of summary sentences from the four summary candidate sets to form a fused summary candidate set; counting the number of occurrences of each sentence in the fused summary candidate set, and filtering out a summary result set of the target text from the fused summary candidate set according to the number of occurrences of each sentence. The present application relates to data processing, which can improve the accuracy of summary extraction.

Description

Digest extraction method, digest extraction device, digest extraction apparatus, and computer-readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for extracting a digest.
Background
At present, the abstract technology is mainly divided into two major categories of extraction type and generation type, wherein the extraction type refers to extracting important sentences directly from the text, and then outputting the sentences as a final abstract after sequencing and combining the sentences; the generation formula refers to refining summary according to the original text content, and new words or sentences are allowed to be generated to form the abstract. However, a large amount of labeling data is needed for generating the abstract, the labeling of the abstract has no unified standard and is time-consuming, the abstract of the text cannot be accurately extracted, the common extraction type abstract method is TextRank, but the original TextRank method only extracts the abstract based on the similarity of sentences, the extracted sentences have redundancy, and the accuracy of abstract extraction is lower. Therefore, how to improve the accuracy of the abstract extraction is a problem to be solved at present.
Disclosure of Invention
The application mainly aims to provide a method, a device, equipment and a computer readable storage medium for extracting abstract, aiming at improving the accuracy of abstract extraction.
In a first aspect, the present application provides a digest extraction method, the digest extraction method comprising the steps of:
Acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted;
Calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
Calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm;
Screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation (MMR) algorithm and the number of preset sentences;
Selecting sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set;
Counting the occurrence times of each sentence in the fusion abstract candidate set, and screening the abstract result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence.
In a second aspect, the present application also provides a digest extracting apparatus, including:
the acquisition module is used for acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted;
The first abstract screening module is used for calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
the second abstract screening module is used for calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set based on a TextRank algorithm according to the cosine similarity;
The third abstract screening module is used for screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation (MMR) algorithm and the number of preset sentences;
the selection module is used for selecting sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set;
the abstract determining module is used for counting the occurrence times of each sentence in the fused abstract candidate set and screening the abstract result set of the target text from the fused abstract candidate set according to the occurrence times of each sentence.
In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the summary extraction method as described above.
In a fourth aspect, the present application also provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the summary extraction method as described above.
The application provides a method, a device, equipment and a computer readable storage medium for abstract extraction, wherein a first abstract candidate set is screened from a sentence set according to sentence similarity between every two sentences in the sentence set by a TextRank algorithm, a second abstract candidate set is screened from the sentence set according to cosine similarity between every two sentences in the sentence set by the TextRank algorithm, then a third abstract candidate set is screened from the first abstract candidate set and a fourth abstract candidate set is screened from the second abstract candidate set based on a maximum edge correlation MMR algorithm and the number of preset sentences, finally the four abstract candidate sets are fused to determine an abstract result set of text, so that redundancy between extracted abstract sentences can be reduced, and the accuracy of text abstract extraction is effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a summary extraction method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating sub-steps of the method for extracting a digest of FIG. 1;
FIG. 3 is a flowchart of another method for extracting a summary according to an embodiment of the present application;
FIG. 4 is a schematic block diagram of a summary extracting apparatus according to an embodiment of the present application;
FIG. 5 is a schematic block diagram of another summary extracting apparatus according to an embodiment of the present application;
fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
The embodiment of the application provides a method, a device, equipment and a computer readable storage medium for abstracting abstract. The abstract extraction method can be applied to a server or terminal equipment, wherein the server can be a single server or a server cluster consisting of a plurality of servers, and the terminal equipment can be electronic equipment such as mobile phones, tablet computers, notebook computers, desktop computers, personal digital assistants, wearable equipment and the like. The following description will take a server as an example.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a flow chart of a summary extracting method according to an embodiment of the application.
As shown in fig. 1, the summary extraction method includes steps S101 to S106.
Step S101, acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted.
When a user needs to extract the abstract in the text, the text of the abstract to be extracted can be uploaded to a server through a terminal device, the server performs sentence splitting on the received text of the abstract to be extracted to obtain an initial sentence set, and the initial sentence set is cleaned to remove punctuation marks, stop words and other characters in the initial sentence set, so that the sentence set of the text of the abstract to be extracted is obtained. The server acquires the sentence set of the abstract text to be extracted at fixed time or in real time.
Step S102, calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, sentence similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the sentence similarity between every two sentences in the sentence set.
Specifically, counting the number of the same words of every two sentences in the sentence set and the number of words contained in each sentence in the sentence set; calculating the sentence similarity of each two sentences in the sentence set according to the number of the same words of each two sentences in the sentence set and the number of words contained in each sentence in the sentence set; based on a TextRank algorithm, determining a first importance value of each sentence according to sentence similarity between every two sentences in the sentence set; and screening a first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set. The first importance value is used for representing the importance degree of the sentence in the target text, the higher the importance degree of the sentence with higher first importance value in the target text is, the lower the importance degree of the sentence with lower first importance value in the target text is, and the formula for calculating the first importance value of the sentence based on the TextRank algorithm is as follows:
Wherein WS(Vi on the left side of the equation) represents the importance value of the sentence Vi, Wji is the weight of the edge from the sentence Vi to the sentence Vj, d is the damping coefficient, and represents the probability that a certain sentence points to any other sentence, and is optionally 0.85, in (Vi) and Out (Vj) are respectively the sentence set pointing to the sentence Vi and the sentence set pointing to the edge from the sentence Vj, the weight Wji is the similarity of the two sentences Si and Sj, and the weight Wjk is the similarity of any one of the sentence sets pointing to the edge from the sentence Vj and the sentence Sj. The calculation formula of the sentence similarity between every two sentences in the sentence set is as follows:
Where { tk∨tk∈Si∧tk∈Sj } is the number of words occurring in both sentences Si and Sj, Si and Sj include a plurality of words, tk is the kth word, |si | is the number of words contained in sentence Si, and|sj | is the number of words contained in sentence Sj. Sentence similarity of every two sentences in the sentence set can be calculated through the similarity formula, and the first importance value of each sentence in the sentence set can be calculated through the calculation formula of the first importance value.
In one embodiment, the method for screening the first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set is specifically as follows: sorting each sentence in the sentence set according to the first importance value of each sentence in the sentence set to obtain a first abstract candidate set, or sorting each sentence in the sentence set according to the first importance value of each sentence in the sentence set, sequentially obtaining sentences from the sentence set according to the sorting sequence until the number of the obtained sentences reaches the set number, and collecting each obtained sentence to obtain the first abstract candidate set. The number of the above-mentioned settings may be set based on actual conditions, and the present application is not limited thereto.
And step 103, calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, the cosine similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the cosine similarity between every two sentences in the sentence set.
Specifically, each sentence in the sentence set is encoded to obtain a sentence vector corresponding to each sentence in the sentence set; according to the sentence vectors corresponding to each sentence in the sentence set, calculating the cosine similarity between every two sentences in the sentence set; based on a TextRank algorithm, determining a second importance value of each sentence according to cosine similarity between every two sentences in the sentence set; and screening a second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set. The second importance value is used for representing the importance degree of the sentence in the target text, the higher the importance degree of the sentence with higher second importance value in the target text is, the lower the importance degree of the sentence with lower second importance value in the target text is, and the formula for calculating the second importance value of the sentence based on the TextRank algorithm is as follows:
Wherein WS(Vi on the left side of the equation) represents the importance value of the sentence Vi, Dji is the weight of the edge from the sentence Vi to the sentence Vj, D is the damping coefficient, and represents the probability that a certain sentence points to any other sentence, and is optionally 0.85, in (Vi) and Out (Vj) are respectively the sentence set pointing to the sentence Vi and the sentence set pointing to the edge from the sentence Vj, the weight Wji is the similarity of the two sentences Si and Sj, and the weight Wjk is the similarity of any one of the sentence sets pointing to the edge from the sentence Vj and the sentence Sj.
The calculation formula of cosine similarity of the two sentences Si and Sj is as follows:
wherein,Is a sentence vector of the sentence Si,Is a sentence vector of the sentence Sj. The cosine similarity of every two sentences in the sentence set can be calculated through the similarity formula, and the second importance value of every sentence in the sentence set can be calculated through the calculation formula of the second importance value.
In an embodiment, the statement vector of the statement may be determined in the following manner: each word in the sentence is encoded to obtain a word vector corresponding to each word, an average word vector is calculated according to the word vector corresponding to each word, and the average word vector is used as the sentence vector of the sentence.
In one embodiment, the method for screening the second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set is specifically as follows: sorting each sentence in the sentence set according to the second importance value of each sentence in the sentence set to obtain a second abstract candidate set, or sorting each sentence in the sentence set according to the second importance value of each sentence in the sentence set, sequentially obtaining sentences from the sentence set according to the sorting sequence until the number of the obtained sentences reaches the set number, and collecting each obtained sentence to obtain the second abstract candidate set. The number of the above-mentioned settings may be set based on actual conditions, and the present application is not limited thereto.
Step S104, based on the maximum edge correlation MMR algorithm and the preset statement number, a third abstract candidate set is screened from the first abstract candidate set, and a fourth abstract candidate set is screened from the second abstract candidate set.
After the first abstract candidate set and the second abstract candidate set are obtained through screening, the server screens a third abstract candidate set from the first abstract candidate set and screens a fourth abstract candidate set from the second abstract candidate set based on a maximum boundary correlation (Maximal MARGINAL RELEVANCE, MMR) algorithm and the preset statement number. Wherein the third abstract candidate set is a subset of the first abstract candidate set, and the fourth abstract candidate set is a subset of the second abstract candidate set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application. Redundancy among sentences can be eliminated through an MMR algorithm, and accuracy of abstract extraction is improved.
In one embodiment, as shown in fig. 2, step S104 includes sub-steps S1041 to S1047.
S1041, sorting each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and obtaining the sorting number of each sentence.
And sequencing each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and acquiring the sequencing number of each sentence. The ranking number of the sentence having the higher first importance value is smaller, and the ranking number of the sentence having the lower first importance value is larger.
S1042, acquiring sentences with the sorting numbers smaller than or equal to the preset sorting numbers from the first abstract candidate set to form a candidate sentence set.
After each sentence in the first abstract candidate set is ordered, sentences with the ordering number smaller than or equal to a preset ordering number are obtained from the first abstract candidate set, so that a candidate sentence set is formed. It should be noted that the preset ranking number may be set based on practical situations, which is not particularly limited in the present application. Optionally, if the preset ranking number is 10, sentences with ranking numbers less than or equal to 10 are obtained from the first abstract candidate set to form a candidate sentence set.
S1043, moving the sentence with the highest first importance value in the candidate sentence set to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence set.
Specifically, the server acquires the sentence with the highest first importance value from the candidate sentence set, and moves the sentence to a preset blank abstract candidate set to update the abstract candidate set and the candidate sentence set. For example, the candidate sentence set includes 5 sentences, namely, sentence a, sentence B, sentence C, sentence D and sentence E, and the first importance value of sentence C is the highest, and the updated abstract candidate set includes sentence C, and the updated candidate sentence set includes sentence a, sentence B, sentence D and sentence E.
S1044, calculating MMR values of the abstract candidate set and each sentence in the candidate sentence set respectively according to the first importance value of each sentence in the candidate sentence set based on a preset MMR value calculation formula.
Wherein, the MMR value is used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set, and the preset calculation formula of the MMR value is as follows:
MMRi=α·WS(Vi)-(1-α)·sim(i,set)
Wherein MMRi is the MMR value of sentence Vi, α is a weight coefficient, the value range is optionally 0-1, wS(Vi is the first importance value of sentence Vi, set is the candidate sentence set, sim (i, set) is the semantic similarity between sentence Vi and the candidate sentence set. According to the first importance value of each sentence in the candidate sentence set and the calculation formula of the MMR value, the MMR value corresponding to each sentence in the abstract candidate set and the candidate sentence set respectively can be calculated.
Specifically, encoding the abstract candidate set to obtain a vector corresponding to the abstract candidate set; encoding each sentence in the candidate sentence set respectively to obtain a vector corresponding to each sentence in the candidate sentence set; calculating semantic similarity between vectors corresponding to the abstract candidate set and vectors corresponding to each sentence in the candidate sentence set respectively; based on a calculation formula of the MMR value, according to each semantic similarity and a first importance value of each sentence in the candidate sentence set, calculating the MMR value respectively corresponding to each sentence in the abstract candidate set and the candidate sentence set. For example, if the first importance value of one sentence in the candidate sentence sets is x and the similarity between the sentence and the abstract candidate set is s, the MMR value between the sentence and the abstract candidate set is α·x- (1- α) ·s.
The method for encoding the abstract candidate set to obtain the vector corresponding to the abstract candidate set specifically comprises the following steps: encoding each sentence in the abstract candidate set to obtain a sentence vector corresponding to each sentence in the abstract candidate set; and calculating an average vector according to the sentence vector corresponding to each sentence in the abstract candidate set, and taking the average vector as the vector of the abstract candidate set.
S1045, the sentence with the highest MMR value is moved to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set.
After the MMR values of the abstract candidate set and each sentence in the candidate sentence set are obtained through calculation, the server stores the sentence with the highest MMR value into the abstract candidate set so as to update the abstract candidate set and the candidate sentence set. For example, the summary candidate set includes a sentence C, the candidate sentence set includes a sentence a, a sentence B, a sentence D, and a sentence E, and the sentence with the highest MMR value is a sentence E, and the updated summary candidate set includes a sentence C and a sentence E, and the updated candidate sentence set includes a sentence a, a sentence B, and a sentence D.
S1046, determining whether the number of the updated sentences in the abstract candidate set reaches the preset number of sentences.
The server determines whether the number of sentences in the updated summary candidate set reaches the preset number of sentences, if the number of sentences in the updated summary candidate set does not reach the preset number of sentences, executing a substep S1044, that is, calculating, based on a preset MMR value calculation formula, the MMR value of each sentence in the summary candidate set, which corresponds to each sentence in the candidate sentence set, according to the first importance value of each sentence in the candidate sentence set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application.
S1047, if the number of the updated sentences in the summary candidate set reaches the preset number of sentences, taking the updated summary candidate set as a third summary candidate set.
And if the number of sentences in the updated abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a third abstract candidate set. For example, the number of the preset sentences is 5, the updated abstract candidate set includes sentences a, B, C, D and E, and a total of 5 sentences, and at this time, the number of the sentences in the abstract candidate set reaches the preset number of the sentences, so that the abstract candidate set including the sentences a, B, C, D and E is taken as the third abstract candidate set.
It can be understood that the extraction manner of the fourth abstract candidate set is similar to that of the third abstract candidate set, specifically: sequencing each sentence in the first abstract candidate set according to the second importance value of each sentence in the second abstract candidate set, and acquiring the sequencing number of each sentence; acquiring sentences with the sorting numbers smaller than or equal to a preset sorting number from a second abstract candidate set to form a candidate sentence set; the sentences with the highest second importance values in the candidate sentence sets are moved to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence sets; based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the candidate sentence set and the abstract candidate set according to a second importance value of each sentence in the candidate sentence set, wherein the MMR values are used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set; the sentence with the highest MMR value is moved to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set; determining whether the number of sentences in the updated abstract candidate set reaches the preset number of sentences; if the number of sentences in the updated abstract candidate set does not reach the preset number of sentences, executing the steps: based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and each sentence in the candidate sentence set according to the second importance value of each sentence in the candidate sentence set; and if the number of sentences in the updated abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a fourth abstract candidate set.
Step S105, selecting sentences of a preset summary sentence number from the first summary candidate set, the second summary candidate set, the third summary candidate set and the fourth summary candidate set, respectively, so as to form a fused summary candidate set.
After obtaining the four abstract candidate sets of the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set, the server selects sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set. It should be noted that, the number of the preset summary sentences is smaller than the number of the preset sentences, and the number of the preset summary sentences may be set based on actual situations, which is not particularly limited in the present application.
In an embodiment, according to the magnitude of the importance value, the sentences in the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set are respectively sequenced, and according to the sequencing sequence of each sentence, the preset abstract sentence quantity is selected from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively, and is written into the fusion abstract candidate set. Wherein, the larger the importance value, the earlier the ranking, the smaller the importance value, and the later the ranking.
For example, the first digest candidate set is [ a, B, C, D, E, F, G, H, I, J ], the second digest candidate set is [ a, B, C, D, E, G, H, I, J, K ], the third digest candidate set is [ C, D, E, F, G, H, I ], the fourth digest candidate set is [ D, E, G, H, I, J, K ], the preset number of digest phrases is 5, the sentence selected from the first digest candidate set is [ a, B, C, D, E ], the sentence selected from the second digest candidate set is [ a, B, C, D, E ], the sentence selected from the third digest candidate set is [ C, D, E, F, G ], the sentence selected from the fourth digest candidate set is [ D, E, G, H, I ], and thus the fusion digest candidate set is { [ a, B, C, D, E ], [ D, G ] }.
And S106, counting the occurrence times of each sentence in the fusion abstract candidate set, and screening the abstract result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence.
After the fusion abstract candidate set is obtained, counting the occurrence times of each sentence in the fusion abstract candidate set, and screening the abstract result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence, namely screening sentences with the occurrence times greater than or equal to the preset occurrence times from the fusion abstract candidate set as the abstract result set of the target text. The occurrence frequency is the occurrence frequency of sentences in the fusion abstract candidate set.
For example, the fusion digest candidate set is { [ a, B, C, D, E ], [ C, D, E, F, G ], [ D, E, G, H, I ] }, the number of occurrences of sentence a is 2, the number of occurrences of sentence B is 2, the number of occurrences of sentence C is 3, the number of occurrences of sentence D is 4, the number of occurrences of sentence E is 4, the number of occurrences of sentence F is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, and the number of occurrences of sentence I is 1.
According to the abstract extraction method provided by the embodiment, the TextRank algorithm is used for screening the first abstract candidate set from the sentence set according to the sentence similarity between every two sentences in the sentence set, the TextRank algorithm is used for screening the second abstract candidate set from the sentence set according to the cosine similarity between every two sentences in the sentence set, then the third abstract candidate set is screened from the first abstract candidate set and the fourth abstract candidate set is screened from the second abstract candidate set based on the maximum edge correlation MMR algorithm and the preset sentence number, and finally the four abstract candidate sets are fused to determine the abstract result set of the text, so that the redundancy between the extracted abstract sentences can be reduced, and the accuracy of text abstract extraction is effectively improved.
Referring to fig. 3, fig. 3 is a flow chart of another method for extracting a summary according to an embodiment of the application.
As shown in fig. 3, the digest extraction method includes steps S201 to 208.
Step S201, acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted.
When a user needs to extract the abstract in the text, the text of the abstract to be extracted can be uploaded to a server through a terminal device, the server performs sentence splitting on the received text of the abstract to be extracted to obtain an initial sentence set, and the initial sentence set is cleaned to remove punctuation marks, stop words and other characters in the initial sentence set, so that the sentence set of the text of the abstract to be extracted is obtained. The server acquires the sentence set of the abstract text to be extracted at fixed time or in real time.
Step S202, calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, sentence similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the sentence similarity between every two sentences in the sentence set.
And step 203, calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, the cosine similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the cosine similarity between every two sentences in the sentence set.
And S204, screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation (MMR) algorithm and the number of preset sentences.
After the first abstract candidate set and the second abstract candidate set are obtained through screening, the server screens a third abstract candidate set from the first abstract candidate set and screens a fourth abstract candidate set from the second abstract candidate set based on a maximum boundary correlation (Maximal MARGINAL RELEVANCE, MMR) algorithm and the preset statement number. Wherein the third abstract candidate set is a subset of the first abstract candidate set, and the fourth abstract candidate set is a subset of the second abstract candidate set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application. Redundancy among sentences can be eliminated through an MMR algorithm, and accuracy of abstract extraction is improved.
Step S205, selecting sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fused abstract candidate set.
After obtaining the four abstract candidate sets of the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set, the server selects sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set. It should be noted that, the number of the preset summary sentences is smaller than the number of the preset sentences, and the number of the preset summary sentences may be set based on actual situations, which is not particularly limited in the present application.
Step S206, counting the occurrence times of each sentence in the fusion abstract candidate set, and determining whether the number of sentences with the occurrence times larger than the preset occurrence times is larger than or equal to the preset abstract sentence number.
After the fusion abstract candidate set is obtained, counting the occurrence times of each sentence in the fusion abstract candidate set, and determining whether the number of sentences with the occurrence times larger than the preset occurrence times is larger than or equal to the preset abstract sentence number. The occurrence frequency is the occurrence frequency of sentences in the fusion abstract candidate set. It should be noted that the number of abstract sentences may be set based on practical situations, which is not particularly limited in the present application.
Step S207, if the number of sentences with the occurrence frequency greater than the preset occurrence frequency is greater than or equal to the preset summary sentence number, sorting the sentences in the fusion summary candidate set according to the occurrence frequency.
And if the number of sentences with the occurrence frequency larger than the preset occurrence frequency is larger than or equal to the preset summary sentence number, sorting the sentences in the fusion summary candidate set according to the occurrence frequency. The higher the number of occurrences, the earlier the ranking of the sentences, and the lower the number of occurrences, the later the ranking of the sentences.
In an embodiment, if the number of sentences with the occurrence frequency greater than the preset occurrence frequency is less than the number of summary sentences, the sentences with the occurrence frequency greater than the preset occurrence frequency in the fusion summary candidate set are moved to a summary result set of the target text so as to update the fusion summary candidate set; acquiring an importance value of each statement in the updated fusion abstract candidate set, and sequencing the statements in the updated fusion abstract candidate set according to the importance value; and sequentially selecting sentences from the updated fusion abstract candidate set to write into the abstract result set according to the sequence of each sentence in the updated fusion abstract candidate set until the number of sentences in the abstract result set reaches the preset number of abstract sentences.
And step S208, according to the sequence of each sentence in the fusion abstract candidate set, sequentially selecting sentences from the fusion abstract candidate set to write the sentences into the abstract result set of the target text until the number of the sentences in the abstract result set reaches the preset number of the abstract sentences.
After the sentences in the fusion abstract candidate set are sequenced, sequentially selecting sentences from the fusion abstract candidate set according to the sequencing of each sentence in the fusion abstract candidate set, and writing the sentences into the abstract result set of the target text until the number of the sentences in the abstract result set reaches the preset number of the abstract sentences. For example, the fusion digest candidate set is { [ a, B, C, D, E ], [ C, D, E, F, G ], [ D, E, G, H, I ] }, the number of occurrences of sentence a is 2, the number of occurrences of sentence B is 2, the number of occurrences of sentence C is 3, the number of occurrences of sentence D is 4, the number of occurrences of sentence E is 4, the number of occurrences of sentence F is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, the number of occurrences of sentence I is 1, and thus the order of each sentence in the fusion digest candidate set is [ D, E, C, A, B, G, F, H, I ], the number of digest sentences is 5, and the preset number of occurrences is 2, and the digest result set of the target text is [ D, E, C, A, B ].
According to the abstract extraction method provided by the embodiment, the TextRank algorithm is used for screening a first abstract candidate set according to sentence similarity between every two sentences in the sentence set, the TextRank algorithm is used for screening a second abstract candidate set according to cosine similarity between every two sentences, then the MMR algorithm is used for screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set, and sentences with the preset abstract sentence number are selected from the four abstract candidate sets, so that a fusion abstract candidate set is formed; finally, counting the occurrence times of each sentence in the fusion abstract candidate set, and when the occurrence times are greater than or equal to the preset occurrence times, selecting sentences from the fusion abstract candidate set to write the sentences into the abstract result set of the target text according to the order of the occurrence times, so that the redundancy among the extracted abstract sentences can be reduced, and the accuracy of extracting the text abstract can be effectively improved.
Referring to fig. 4, fig. 4 is a schematic block diagram of a summary extracting apparatus according to an embodiment of the application.
As shown in fig. 4, the digest extracting apparatus 300 includes: an acquisition module 301, a first summary screening module 302, a second summary screening module 303, a third summary screening module 304, a selection module 305, and a summary determination module 306.
The obtaining module 301 is configured to obtain a sentence set of a target text, where the target text is a text of a summary to be extracted;
The first abstract screening module 302 is configured to calculate a sentence similarity between every two sentences in the sentence set, and screen a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
A second abstract screening module 303, configured to calculate a cosine similarity between every two sentences in the sentence set, and screen a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm;
A third abstract screening module 304, configured to screen a third abstract candidate set from the first abstract candidate set and screen a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation MMR algorithm and a preset number of sentences;
a selecting module 305, configured to select a statement with a preset number of abstract statements from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set, respectively, so as to form a fused abstract candidate set;
The abstract determining module 306 is configured to count the occurrence times of each sentence in the fused abstract candidate set, and screen the abstract result set of the target text from the fused abstract candidate set according to the occurrence times of each sentence.
In one embodiment, the first summary screening module 302 is further configured to:
Counting the number of the same words of every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;
Calculating the sentence similarity of each two sentences in the sentence set according to the number of the same words of each two sentences in the sentence set and the number of words contained in each sentence in the sentence set;
Based on a TextRank algorithm, determining a first importance value of each sentence according to sentence similarity between every two sentences in the sentence set, wherein the first importance value is used for representing the importance degree of the sentence in the target text;
And screening a first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set.
In one embodiment, the second summary screening module 303 is further configured to:
Encoding each sentence in the sentence set to obtain a sentence vector corresponding to each sentence in the sentence set;
according to the sentence vectors corresponding to each sentence in the sentence set, calculating the cosine similarity between every two sentences in the sentence set;
Based on a TextRank algorithm, determining a second importance value of each sentence according to cosine similarity between every two sentences in the sentence set, wherein the second importance value is used for representing the importance degree of the sentence in the target text;
And screening a second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set.
In one embodiment, the third summary screening module 304 is further configured to:
Sequencing each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and acquiring the sequencing number of each sentence;
acquiring sentences with the sorting numbers smaller than or equal to a preset sorting number from the first abstract candidate set to form a candidate sentence set;
shifting the sentence with the highest first importance value in the candidate sentence set to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence set;
Based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and each sentence in the candidate sentence set according to a first importance value of each sentence in the candidate sentence set, wherein the MMR values are used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set;
shifting the sentence with the highest MMR value to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set;
Determining whether the number of the updated sentences in the abstract candidate set reaches the preset number of the sentences;
If the number of the updated sentences in the abstract candidate set does not reach the preset number of sentences, executing the steps: based on the MMR algorithm and a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and the candidate sentence set according to the first importance value of each sentence in the candidate sentence set;
And if the number of the updated sentences in the abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a third abstract candidate set.
In one embodiment, the third summary screening module 304 is further configured to:
encoding the abstract candidate set to obtain a vector corresponding to the abstract candidate set;
Encoding each sentence in the candidate sentence set respectively to obtain a vector corresponding to each sentence in the candidate sentence set;
Calculating semantic similarity between vectors corresponding to the abstract candidate set and vectors corresponding to each sentence in the candidate sentence set respectively;
And calculating MMR values respectively corresponding to the abstract candidate set and each sentence in the candidate sentence set according to the semantic similarity and the first importance value of each sentence in the candidate sentence set.
Referring to fig. 5, fig. 5 is a schematic block diagram of another summary extracting apparatus according to an embodiment of the present application.
As shown in fig. 5, the digest extracting apparatus 400 includes: an acquisition module 401, a first summary screening module 402, a second summary screening module 403, a third summary screening module 404, a selection module 405, a determination module 406, a ranking module 407, and a summary determination module 408.
An obtaining module 401, configured to obtain a sentence set of a target text, where the target text is a text of a summary to be extracted;
a first abstract screening module 402, configured to calculate a sentence similarity between every two sentences in the sentence set, and screen a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
a second abstract screening module 403, configured to calculate a cosine similarity between every two sentences in the sentence set, and screen a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm;
a third abstract screening module 404, configured to screen a third abstract candidate set from the first abstract candidate set and screen a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation MMR algorithm and a preset number of sentences;
A selecting module 405, configured to select a statement with a preset number of abstract statements from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set, respectively, so as to form a fused abstract candidate set;
A determining module 406, configured to determine whether the number of sentences that occur more than a preset number of occurrences is greater than or equal to a preset number of summary sentences;
The sorting module 407 is configured to sort the sentences in the fusion summary candidate set according to the number of occurrences if the number of the sentences with the number of occurrences greater than the preset number of occurrences is greater than or equal to the preset number of summary sentences;
and the summary determining module 408 is configured to sequentially select sentences from the fusion summary candidate set to write into the summary result set of the target text according to the ranking of each sentence in the fusion summary candidate set until the number of sentences in the summary result set reaches the preset number of summary sentences.
In an embodiment, the summary determining module 408 is further configured to:
If the number of sentences with the occurrence times larger than the preset occurrence times is smaller than the number of the preset abstract sentences, the sentences with the occurrence times larger than the preset occurrence times in the fusion abstract candidate set are moved to the abstract result set of the target text so as to update the fusion abstract candidate set;
acquiring an importance value of each statement in the updated fusion abstract candidate set, and sequencing the updated statements in the fusion abstract candidate set according to the importance value;
and according to the sequence of each statement in the updated fusion abstract candidate set, sequentially writing the selected statements in the abstract result set from the updated fusion abstract candidate set until the number of the statements in the abstract result set reaches the preset number of the abstract statements.
It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and each module and unit may refer to corresponding processes in the foregoing abstract extraction method embodiment, and will not be described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 6.
Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal device.
As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of digest extraction methods.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium, which when executed by a processor, causes the processor to perform any one of the digest extraction methods.
The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
It should be appreciated that the Processor may be a central processing unit (Central Processing Unit, CPU), it may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, where the computer program includes program instructions, where the method implemented when the program instructions are executed may refer to the embodiments of the summary extraction method of the present application.
The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the computer device.
It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (9)

Translated fromChinese
1.一种摘要提取方法,其特征在于,包括:1. A method for extracting a summary, comprising:获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;Obtaining a sentence set of a target text, wherein the target text is a text to be extracted as a summary;计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;Calculating sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, selecting a first summary candidate set from the sentence set according to the sentence similarity;计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;Calculating the cosine similarity between every two sentences in the sentence set, and based on the TextRank algorithm, selecting a second summary candidate set from the sentence set according to the cosine similarity;基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;Based on the Maximum Margin Relevance (MMR) algorithm and a preset number of sentences, a third summary candidate set is selected from the first summary candidate set, and a fourth summary candidate set is selected from the second summary candidate set;分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;Selecting a preset number of summary sentences from the first summary candidate set, the second summary candidate set, the third summary candidate set, and the fourth summary candidate set, respectively, to form a fused summary candidate set;统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集;Counting the number of occurrences of each sentence in the fused summary candidate set, and filtering out a summary result set of the target text from the fused summary candidate set according to the number of occurrences of each sentence;其中,所述计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集,包括:The step of calculating the sentence similarity between every two sentences in the sentence set and selecting a first summary candidate set from the sentence set according to the sentence similarity based on the TextRank algorithm includes:统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;Counting the number of identical words between every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;Calculate the sentence similarity between every two sentences in the sentence set according to the number of identical words between every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;Based on the TextRank algorithm, determining a first importance value of each sentence according to the sentence similarity between every two sentences in the sentence set, wherein the first importance value is used to characterize the importance of the sentence in the target text;根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。A first summary candidate set is screened out from the sentence set according to a first importance value of each sentence in the sentence set.2.根据权利要求1所述的摘要提取方法,其特征在于,所述计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集,包括:2. The abstract extraction method according to claim 1, wherein the step of calculating the cosine similarity between every two sentences in the sentence set and selecting a second abstract candidate set from the sentence set based on the cosine similarity based on the TextRank algorithm comprises:对所述语句集中的每个语句进行编码,得到所述语句集中的每个语句各自对应的语句向量;Encode each sentence in the sentence set to obtain a sentence vector corresponding to each sentence in the sentence set;根据所述语句集中的每个语句各自对应的语句向量,计算所述语句集中每两个语句之间的余弦相似度;Calculate the cosine similarity between every two sentences in the sentence set according to the sentence vector corresponding to each sentence in the sentence set;基于TextRank算法,根据所述语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值,其中,所述第二重要性值用于表征语句在所述目标文本中的重要程度;Based on the TextRank algorithm, determining a second importance value of each sentence according to the cosine similarity between every two sentences in the sentence set, wherein the second importance value is used to characterize the importance of the sentence in the target text;根据所述语句集中每个语句的第二重要性值,从所述语句集中筛选出第二摘要候选集。A second summary candidate set is screened out from the sentence set according to the second importance value of each sentence in the sentence set.3.根据权利要求1所述的摘要提取方法,其特征在于,所述基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集,包括:3. The abstract extraction method according to claim 1, wherein the step of selecting a third abstract candidate set from the first abstract candidate set based on a maximum marginal correlation (MMR) algorithm and a preset number of sentences comprises:根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;sorting each sentence in the first summary candidate set according to the first importance value of each sentence in the first summary candidate set, and obtaining a sorting number for each sentence;从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;Acquire sentences whose ranking numbers are less than or equal to a preset ranking number from the first summary candidate set to form a candidate sentence set;将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;Move the sentence with the highest first importance value in the candidate sentence set to a blank summary candidate set to update the summary candidate set and the candidate sentence set;基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,其中,所述MMR值用于表征所述候选语句集中的语句与所述摘要候选集之间的相似程度;Based on a preset MMR value calculation formula, according to the first importance value of each sentence in the candidate sentence set, the MMR values corresponding to each sentence in the candidate sentence set and the candidate summary set are calculated, wherein the MMR values are used to represent the similarity between the sentences in the candidate sentence set and the candidate summary set;将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集和候选语句集;Move the sentence with the highest MMR value to the summary candidate set to update the summary candidate set and the candidate sentence set;确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数;Determining whether the number of sentences in the updated summary candidate set reaches a preset number of sentences;若更新后的所述摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于所述MMR算法,基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值;If the number of sentences in the updated summary candidate set does not reach the preset number of sentences, the steps are performed: based on the MMR algorithm, based on the preset MMR value calculation formula, according to the first importance value of each sentence in the candidate sentence set, calculating the MMR value corresponding to each sentence in the summary candidate set and the candidate sentence set respectively;若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。If the number of sentences in the updated summary candidate set reaches a preset number of sentences, the updated summary candidate set is used as a third summary candidate set.4.根据权利要求3所述的摘要提取方法,其特征在于,所述基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,包括:4. The summary extraction method according to claim 3, characterized in that the step of calculating the MMR values corresponding to each sentence in the candidate sentence set and the candidate sentence set based on a preset MMR value calculation formula according to the first importance value of each sentence in the candidate sentence set comprises:对所述摘要候选集进行编码,得到所述摘要候选集对应的向量;Encoding the summary candidate set to obtain a vector corresponding to the summary candidate set;分别对所述候选语句集中的每个语句进行编码,得到所述候选语句集中的每个语句各自对应的向量;Encode each sentence in the candidate sentence set respectively to obtain a vector corresponding to each sentence in the candidate sentence set;计算所述摘要候选集对应的向量分别与所述候选语句集中的每个语句各自对应的向量之间的语义相似度;Calculating semantic similarities between vectors corresponding to the candidate summary set and vectors corresponding to each sentence in the candidate sentence set;根据每个所述语义相似度和所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。According to each of the semantic similarities and the first importance value of each sentence in the candidate sentence set, the MMR values corresponding to each sentence in the candidate sentence set and the candidate summary set are calculated.5.根据权利要求1至4中任一项所述的摘要提取方法,其特征在于,所述根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集,包括:5. The abstract extraction method according to any one of claims 1 to 4, characterized in that the step of selecting the abstract result set of the target text from the fused abstract candidate set according to the number of occurrences of each sentence comprises:确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量;Determining whether the number of sentences whose occurrence times is greater than a preset occurrence times is greater than or equal to a preset number of summary sentences;若所述出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据所述出现次数,对所述融合摘要候选集中的语句进行排序;If the number of the sentences whose occurrence times are greater than the preset occurrence times is greater than or equal to the preset number of summary sentences, the sentences in the fusion summary candidate set are sorted according to the occurrence times;按照所述融合摘要候选集中每个语句的排序,依次从所述融合摘要候选集中选择语句写入所述目标文本的摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。According to the ranking of each sentence in the fused summary candidate set, sentences are selected from the fused summary candidate set in sequence and written into the summary result set of the target text until the number of sentences in the summary result set reaches a preset number of summary sentences.6.根据权利要求5所述的摘要提取方法,其特征在于,所述确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量之后,还包括:6. The method for extracting a summary according to claim 5, characterized in that after determining whether the number of sentences with a number of occurrences greater than a preset number of occurrences is greater than or equal to a preset number of summary sentences, the method further comprises:若所述出现次数大于预设出现次数的语句的个数小于预设的摘要语句数量,则将所述融合摘要候选集中所述出现次数大于预设出现次数的语句移存至所述目标文本的摘要结果集中,以更新所述融合摘要候选集;If the number of the sentences with a number of occurrences greater than a preset number of occurrences is less than a preset number of summary sentences, the sentences with a number of occurrences greater than a preset number of occurrences in the fused summary candidate set are moved to the summary result set of the target text to update the fused summary candidate set;获取更新后的所述融合摘要候选集中每个语句的重要性值,并根据所述重要性值,对更新后的所述融合摘要候选集中的语句进行排序;Obtaining an importance value of each sentence in the updated fusion summary candidate set, and sorting the sentences in the updated fusion summary candidate set according to the importance value;按照更新后的所述融合摘要候选集中每个语句的排序,依次从更新后的所述融合摘要候选集中选择语句写入所述摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。According to the ranking of each sentence in the updated fused summary candidate set, sentences are selected from the updated fused summary candidate set in sequence and written into the summary result set until the number of sentences in the summary result set reaches a preset number of summary sentences.7.一种摘要提取装置,其特征在于,所述摘要提取装置包括:7. A summary extraction device, characterized in that the summary extraction device comprises:获取模块,用于获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;An acquisition module, used for acquiring a sentence set of a target text, wherein the target text is a text to be extracted as a summary;第一摘要筛选模块,用于计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;A first summary screening module is used to calculate the sentence similarity between every two sentences in the sentence set, and screen a first summary candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;第二摘要筛选模块,用于计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;A second summary screening module is used to calculate the cosine similarity between every two sentences in the sentence set, and screen a second summary candidate set from the sentence set according to the cosine similarity based on the TextRank algorithm;第三摘要筛选模块,用于基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;a third summary screening module, configured to screen a third summary candidate set from the first summary candidate set and screen a fourth summary candidate set from the second summary candidate set based on a maximum marginal relevance (MMR) algorithm and a preset number of sentences;选择模块,用于分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;A selection module, configured to select a preset number of summary sentences from the first summary candidate set, the second summary candidate set, the third summary candidate set, and the fourth summary candidate set, respectively, to form a fused summary candidate set;摘要确定模块,用于统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集;A summary determination module, used to count the number of occurrences of each sentence in the fused summary candidate set, and filter out a summary result set of the target text from the fused summary candidate set according to the number of occurrences of each sentence;其中,所述第一摘要筛选模块还用于:Wherein, the first summary screening module is further used for:统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;Counting the number of identical words between every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;Calculate the sentence similarity between every two sentences in the sentence set according to the number of identical words between every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;Based on the TextRank algorithm, determining a first importance value of each sentence according to the sentence similarity between every two sentences in the sentence set, wherein the first importance value is used to characterize the importance of the sentence in the target text;根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。A first summary candidate set is screened out from the sentence set according to a first importance value of each sentence in the sentence set.8.一种计算机设备,其特征在于,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如权利要求1至6中任一项所述的摘要提取方法的步骤。8. A computer device, characterized in that the computer device comprises a processor, a memory, and a computer program stored in the memory and executable by the processor, wherein when the computer program is executed by the processor, the steps of the abstract extraction method according to any one of claims 1 to 6 are implemented.9.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如权利要求1至6中任一项所述的摘要提取方法的步骤。9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein when the computer program is executed by a processor, the steps of the abstract extraction method according to any one of claims 1 to 6 are implemented.
CN202010125189.2A2020-02-272020-02-27 Abstract extraction method, device, equipment and computer readable storage mediumActiveCN111507090B (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
CN202010125189.2ACN111507090B (en)2020-02-272020-02-27 Abstract extraction method, device, equipment and computer readable storage medium
PCT/CN2020/112340WO2021169217A1 (en)2020-02-272020-08-30Abstract extraction method and apparatus, device, and computer-readable storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010125189.2ACN111507090B (en)2020-02-272020-02-27 Abstract extraction method, device, equipment and computer readable storage medium

Publications (2)

Publication NumberPublication Date
CN111507090A CN111507090A (en)2020-08-07
CN111507090Btrue CN111507090B (en)2024-11-15

Family

ID=71868960

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010125189.2AActiveCN111507090B (en)2020-02-272020-02-27 Abstract extraction method, device, equipment and computer readable storage medium

Country Status (2)

CountryLink
CN (1)CN111507090B (en)
WO (1)WO2021169217A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111507090B (en)*2020-02-272024-11-15平安科技(深圳)有限公司 Abstract extraction method, device, equipment and computer readable storage medium
CN112307738B (en)*2020-11-112024-06-14北京沃东天骏信息技术有限公司Method and device for processing text
CN114117035A (en)*2021-11-252022-03-01北京航空航天大学Unsupervised cantonese forum extraction type abstract method
CN114328897A (en)*2021-12-272022-04-12浪潮通信信息系统有限公司Conference summary method and model based on improved TextRank algorithm
CN114203169B (en)*2022-01-262025-01-24合肥讯飞数码科技有限公司 A method, device, equipment and storage medium for determining speech recognition results
CN114595684B (en)*2022-02-112024-11-29北京三快在线科技有限公司Digest generation method and device, electronic equipment and storage medium
CN115329087A (en)*2022-09-012022-11-11中电信数智科技有限公司Data processing method, device, equipment and storage medium
CN115438654B (en)*2022-11-072023-03-24华东交通大学 Article title generation method, device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107766419A (en)*2017-09-082018-03-06广州汪汪信息技术有限公司A kind of TextRank file summarization methods and device based on threshold denoising
CN110837556A (en)*2019-10-302020-02-25深圳价值在线信息科技股份有限公司Abstract generation method and device, terminal equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7607083B2 (en)*2000-12-122009-10-20Nec CorporationTest summarization using relevance measures and latent semantic analysis
JP5235918B2 (en)*2010-01-212013-07-10日本電信電話株式会社 Text summarization apparatus, text summarization method, and text summarization program
CN105868175A (en)*2015-12-032016-08-17乐视网信息技术(北京)股份有限公司Abstract generation method and device
CN109977219B (en)*2019-03-192021-04-09国家计算机网络与信息安全管理中心Text abstract automatic generation method and device based on heuristic rule
CN110362674B (en)*2019-07-182020-08-04中国搜索信息科技股份有限公司Microblog news abstract extraction type generation method based on convolutional neural network
CN111507090B (en)*2020-02-272024-11-15平安科技(深圳)有限公司 Abstract extraction method, device, equipment and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN107766419A (en)*2017-09-082018-03-06广州汪汪信息技术有限公司A kind of TextRank file summarization methods and device based on threshold denoising
CN110837556A (en)*2019-10-302020-02-25深圳价值在线信息科技股份有限公司Abstract generation method and device, terminal equipment and storage medium

Also Published As

Publication numberPublication date
CN111507090A (en)2020-08-07
WO2021169217A1 (en)2021-09-02

Similar Documents

PublicationPublication DateTitle
CN111507090B (en) Abstract extraction method, device, equipment and computer readable storage medium
CN108304378B (en)Text similarity computing method, apparatus, computer equipment and storage medium
CN106897428B (en)Text classification feature extraction method and text classification method and device
CN111767713B (en)Keyword extraction method and device, electronic equipment and storage medium
CN112256822A (en) Text search method, apparatus, computer equipment and storage medium
WO2021114810A1 (en)Graph structure-based official document recommendation method, apparatus, computer device, and medium
WO2017045443A1 (en)Image retrieval method and system
WO2021164231A1 (en)Official document abstract extraction method and apparatus, and device and computer readable storage medium
CN112052331A (en) A method and terminal for processing text information
CN112070550A (en)Keyword determination method, device and equipment based on search platform and storage medium
CN113515589B (en)Data recommendation method, device, equipment and medium
CN113722438A (en)Sentence vector generation method and device based on sentence vector model and computer equipment
CN117493645B (en)Big data-based electronic archive recommendation system
CN115062621A (en) Label extraction method, device, electronic device and storage medium
CN111985217B (en) A keyword extraction method, computing device and readable storage medium
CN112417845B (en) Text evaluation method, device, electronic device and storage medium
CN106033417B (en)Method and device for sequencing series of video search
JP2019204246A (en)Learning data creation method and learning data creation device
CN114756690A (en)Method and device for constructing knowledge graph, electronic equipment and storage medium
CN114328884A (en)Image-text duplication removing method and device
CN113934842A (en)Text clustering method and device and readable storage medium
CN104142947A (en) File Classification System and Method
WO2024244255A1 (en)Synonym mining
CN110738048A (en)keyword extraction method and device and terminal equipment
CN107169013B (en)A kind of processing method and processing device of dish information

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
REGReference to a national code

Ref country code:HK

Ref legal event code:DE

Ref document number:40030965

Country of ref document:HK

SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp