Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
The embodiment of the application provides a method, a device, equipment and a computer readable storage medium for abstracting abstract. The abstract extraction method can be applied to a server or terminal equipment, wherein the server can be a single server or a server cluster consisting of a plurality of servers, and the terminal equipment can be electronic equipment such as mobile phones, tablet computers, notebook computers, desktop computers, personal digital assistants, wearable equipment and the like. The following description will take a server as an example.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a flow chart of a summary extracting method according to an embodiment of the application.
As shown in fig. 1, the summary extraction method includes steps S101 to S106.
Step S101, acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted.
When a user needs to extract the abstract in the text, the text of the abstract to be extracted can be uploaded to a server through a terminal device, the server performs sentence splitting on the received text of the abstract to be extracted to obtain an initial sentence set, and the initial sentence set is cleaned to remove punctuation marks, stop words and other characters in the initial sentence set, so that the sentence set of the text of the abstract to be extracted is obtained. The server acquires the sentence set of the abstract text to be extracted at fixed time or in real time.
Step S102, calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, sentence similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the sentence similarity between every two sentences in the sentence set.
Specifically, counting the number of the same words of every two sentences in the sentence set and the number of words contained in each sentence in the sentence set; calculating the sentence similarity of each two sentences in the sentence set according to the number of the same words of each two sentences in the sentence set and the number of words contained in each sentence in the sentence set; based on a TextRank algorithm, determining a first importance value of each sentence according to sentence similarity between every two sentences in the sentence set; and screening a first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set. The first importance value is used for representing the importance degree of the sentence in the target text, the higher the importance degree of the sentence with higher first importance value in the target text is, the lower the importance degree of the sentence with lower first importance value in the target text is, and the formula for calculating the first importance value of the sentence based on the TextRank algorithm is as follows:
Wherein WS(Vi on the left side of the equation) represents the importance value of the sentence Vi, Wji is the weight of the edge from the sentence Vi to the sentence Vj, d is the damping coefficient, and represents the probability that a certain sentence points to any other sentence, and is optionally 0.85, in (Vi) and Out (Vj) are respectively the sentence set pointing to the sentence Vi and the sentence set pointing to the edge from the sentence Vj, the weight Wji is the similarity of the two sentences Si and Sj, and the weight Wjk is the similarity of any one of the sentence sets pointing to the edge from the sentence Vj and the sentence Sj. The calculation formula of the sentence similarity between every two sentences in the sentence set is as follows:
Where { tk∨tk∈Si∧tk∈Sj } is the number of words occurring in both sentences Si and Sj, Si and Sj include a plurality of words, tk is the kth word, |si | is the number of words contained in sentence Si, and|sj | is the number of words contained in sentence Sj. Sentence similarity of every two sentences in the sentence set can be calculated through the similarity formula, and the first importance value of each sentence in the sentence set can be calculated through the calculation formula of the first importance value.
In one embodiment, the method for screening the first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set is specifically as follows: sorting each sentence in the sentence set according to the first importance value of each sentence in the sentence set to obtain a first abstract candidate set, or sorting each sentence in the sentence set according to the first importance value of each sentence in the sentence set, sequentially obtaining sentences from the sentence set according to the sorting sequence until the number of the obtained sentences reaches the set number, and collecting each obtained sentence to obtain the first abstract candidate set. The number of the above-mentioned settings may be set based on actual conditions, and the present application is not limited thereto.
And step 103, calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, the cosine similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the cosine similarity between every two sentences in the sentence set.
Specifically, each sentence in the sentence set is encoded to obtain a sentence vector corresponding to each sentence in the sentence set; according to the sentence vectors corresponding to each sentence in the sentence set, calculating the cosine similarity between every two sentences in the sentence set; based on a TextRank algorithm, determining a second importance value of each sentence according to cosine similarity between every two sentences in the sentence set; and screening a second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set. The second importance value is used for representing the importance degree of the sentence in the target text, the higher the importance degree of the sentence with higher second importance value in the target text is, the lower the importance degree of the sentence with lower second importance value in the target text is, and the formula for calculating the second importance value of the sentence based on the TextRank algorithm is as follows:
Wherein WS(Vi on the left side of the equation) represents the importance value of the sentence Vi, Dji is the weight of the edge from the sentence Vi to the sentence Vj, D is the damping coefficient, and represents the probability that a certain sentence points to any other sentence, and is optionally 0.85, in (Vi) and Out (Vj) are respectively the sentence set pointing to the sentence Vi and the sentence set pointing to the edge from the sentence Vj, the weight Wji is the similarity of the two sentences Si and Sj, and the weight Wjk is the similarity of any one of the sentence sets pointing to the edge from the sentence Vj and the sentence Sj.
The calculation formula of cosine similarity of the two sentences Si and Sj is as follows:
wherein,Is a sentence vector of the sentence Si,Is a sentence vector of the sentence Sj. The cosine similarity of every two sentences in the sentence set can be calculated through the similarity formula, and the second importance value of every sentence in the sentence set can be calculated through the calculation formula of the second importance value.
In an embodiment, the statement vector of the statement may be determined in the following manner: each word in the sentence is encoded to obtain a word vector corresponding to each word, an average word vector is calculated according to the word vector corresponding to each word, and the average word vector is used as the sentence vector of the sentence.
In one embodiment, the method for screening the second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set is specifically as follows: sorting each sentence in the sentence set according to the second importance value of each sentence in the sentence set to obtain a second abstract candidate set, or sorting each sentence in the sentence set according to the second importance value of each sentence in the sentence set, sequentially obtaining sentences from the sentence set according to the sorting sequence until the number of the obtained sentences reaches the set number, and collecting each obtained sentence to obtain the second abstract candidate set. The number of the above-mentioned settings may be set based on actual conditions, and the present application is not limited thereto.
Step S104, based on the maximum edge correlation MMR algorithm and the preset statement number, a third abstract candidate set is screened from the first abstract candidate set, and a fourth abstract candidate set is screened from the second abstract candidate set.
After the first abstract candidate set and the second abstract candidate set are obtained through screening, the server screens a third abstract candidate set from the first abstract candidate set and screens a fourth abstract candidate set from the second abstract candidate set based on a maximum boundary correlation (Maximal MARGINAL RELEVANCE, MMR) algorithm and the preset statement number. Wherein the third abstract candidate set is a subset of the first abstract candidate set, and the fourth abstract candidate set is a subset of the second abstract candidate set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application. Redundancy among sentences can be eliminated through an MMR algorithm, and accuracy of abstract extraction is improved.
In one embodiment, as shown in fig. 2, step S104 includes sub-steps S1041 to S1047.
S1041, sorting each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and obtaining the sorting number of each sentence.
And sequencing each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and acquiring the sequencing number of each sentence. The ranking number of the sentence having the higher first importance value is smaller, and the ranking number of the sentence having the lower first importance value is larger.
S1042, acquiring sentences with the sorting numbers smaller than or equal to the preset sorting numbers from the first abstract candidate set to form a candidate sentence set.
After each sentence in the first abstract candidate set is ordered, sentences with the ordering number smaller than or equal to a preset ordering number are obtained from the first abstract candidate set, so that a candidate sentence set is formed. It should be noted that the preset ranking number may be set based on practical situations, which is not particularly limited in the present application. Optionally, if the preset ranking number is 10, sentences with ranking numbers less than or equal to 10 are obtained from the first abstract candidate set to form a candidate sentence set.
S1043, moving the sentence with the highest first importance value in the candidate sentence set to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence set.
Specifically, the server acquires the sentence with the highest first importance value from the candidate sentence set, and moves the sentence to a preset blank abstract candidate set to update the abstract candidate set and the candidate sentence set. For example, the candidate sentence set includes 5 sentences, namely, sentence a, sentence B, sentence C, sentence D and sentence E, and the first importance value of sentence C is the highest, and the updated abstract candidate set includes sentence C, and the updated candidate sentence set includes sentence a, sentence B, sentence D and sentence E.
S1044, calculating MMR values of the abstract candidate set and each sentence in the candidate sentence set respectively according to the first importance value of each sentence in the candidate sentence set based on a preset MMR value calculation formula.
Wherein, the MMR value is used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set, and the preset calculation formula of the MMR value is as follows:
MMRi=α·WS(Vi)-(1-α)·sim(i,set)
Wherein MMRi is the MMR value of sentence Vi, α is a weight coefficient, the value range is optionally 0-1, wS(Vi is the first importance value of sentence Vi, set is the candidate sentence set, sim (i, set) is the semantic similarity between sentence Vi and the candidate sentence set. According to the first importance value of each sentence in the candidate sentence set and the calculation formula of the MMR value, the MMR value corresponding to each sentence in the abstract candidate set and the candidate sentence set respectively can be calculated.
Specifically, encoding the abstract candidate set to obtain a vector corresponding to the abstract candidate set; encoding each sentence in the candidate sentence set respectively to obtain a vector corresponding to each sentence in the candidate sentence set; calculating semantic similarity between vectors corresponding to the abstract candidate set and vectors corresponding to each sentence in the candidate sentence set respectively; based on a calculation formula of the MMR value, according to each semantic similarity and a first importance value of each sentence in the candidate sentence set, calculating the MMR value respectively corresponding to each sentence in the abstract candidate set and the candidate sentence set. For example, if the first importance value of one sentence in the candidate sentence sets is x and the similarity between the sentence and the abstract candidate set is s, the MMR value between the sentence and the abstract candidate set is α·x- (1- α) ·s.
The method for encoding the abstract candidate set to obtain the vector corresponding to the abstract candidate set specifically comprises the following steps: encoding each sentence in the abstract candidate set to obtain a sentence vector corresponding to each sentence in the abstract candidate set; and calculating an average vector according to the sentence vector corresponding to each sentence in the abstract candidate set, and taking the average vector as the vector of the abstract candidate set.
S1045, the sentence with the highest MMR value is moved to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set.
After the MMR values of the abstract candidate set and each sentence in the candidate sentence set are obtained through calculation, the server stores the sentence with the highest MMR value into the abstract candidate set so as to update the abstract candidate set and the candidate sentence set. For example, the summary candidate set includes a sentence C, the candidate sentence set includes a sentence a, a sentence B, a sentence D, and a sentence E, and the sentence with the highest MMR value is a sentence E, and the updated summary candidate set includes a sentence C and a sentence E, and the updated candidate sentence set includes a sentence a, a sentence B, and a sentence D.
S1046, determining whether the number of the updated sentences in the abstract candidate set reaches the preset number of sentences.
The server determines whether the number of sentences in the updated summary candidate set reaches the preset number of sentences, if the number of sentences in the updated summary candidate set does not reach the preset number of sentences, executing a substep S1044, that is, calculating, based on a preset MMR value calculation formula, the MMR value of each sentence in the summary candidate set, which corresponds to each sentence in the candidate sentence set, according to the first importance value of each sentence in the candidate sentence set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application.
S1047, if the number of the updated sentences in the summary candidate set reaches the preset number of sentences, taking the updated summary candidate set as a third summary candidate set.
And if the number of sentences in the updated abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a third abstract candidate set. For example, the number of the preset sentences is 5, the updated abstract candidate set includes sentences a, B, C, D and E, and a total of 5 sentences, and at this time, the number of the sentences in the abstract candidate set reaches the preset number of the sentences, so that the abstract candidate set including the sentences a, B, C, D and E is taken as the third abstract candidate set.
It can be understood that the extraction manner of the fourth abstract candidate set is similar to that of the third abstract candidate set, specifically: sequencing each sentence in the first abstract candidate set according to the second importance value of each sentence in the second abstract candidate set, and acquiring the sequencing number of each sentence; acquiring sentences with the sorting numbers smaller than or equal to a preset sorting number from a second abstract candidate set to form a candidate sentence set; the sentences with the highest second importance values in the candidate sentence sets are moved to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence sets; based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the candidate sentence set and the abstract candidate set according to a second importance value of each sentence in the candidate sentence set, wherein the MMR values are used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set; the sentence with the highest MMR value is moved to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set; determining whether the number of sentences in the updated abstract candidate set reaches the preset number of sentences; if the number of sentences in the updated abstract candidate set does not reach the preset number of sentences, executing the steps: based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and each sentence in the candidate sentence set according to the second importance value of each sentence in the candidate sentence set; and if the number of sentences in the updated abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a fourth abstract candidate set.
Step S105, selecting sentences of a preset summary sentence number from the first summary candidate set, the second summary candidate set, the third summary candidate set and the fourth summary candidate set, respectively, so as to form a fused summary candidate set.
After obtaining the four abstract candidate sets of the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set, the server selects sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set. It should be noted that, the number of the preset summary sentences is smaller than the number of the preset sentences, and the number of the preset summary sentences may be set based on actual situations, which is not particularly limited in the present application.
In an embodiment, according to the magnitude of the importance value, the sentences in the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set are respectively sequenced, and according to the sequencing sequence of each sentence, the preset abstract sentence quantity is selected from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively, and is written into the fusion abstract candidate set. Wherein, the larger the importance value, the earlier the ranking, the smaller the importance value, and the later the ranking.
For example, the first digest candidate set is [ a, B, C, D, E, F, G, H, I, J ], the second digest candidate set is [ a, B, C, D, E, G, H, I, J, K ], the third digest candidate set is [ C, D, E, F, G, H, I ], the fourth digest candidate set is [ D, E, G, H, I, J, K ], the preset number of digest phrases is 5, the sentence selected from the first digest candidate set is [ a, B, C, D, E ], the sentence selected from the second digest candidate set is [ a, B, C, D, E ], the sentence selected from the third digest candidate set is [ C, D, E, F, G ], the sentence selected from the fourth digest candidate set is [ D, E, G, H, I ], and thus the fusion digest candidate set is { [ a, B, C, D, E ], [ D, G ] }.
And S106, counting the occurrence times of each sentence in the fusion abstract candidate set, and screening the abstract result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence.
After the fusion abstract candidate set is obtained, counting the occurrence times of each sentence in the fusion abstract candidate set, and screening the abstract result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence, namely screening sentences with the occurrence times greater than or equal to the preset occurrence times from the fusion abstract candidate set as the abstract result set of the target text. The occurrence frequency is the occurrence frequency of sentences in the fusion abstract candidate set.
For example, the fusion digest candidate set is { [ a, B, C, D, E ], [ C, D, E, F, G ], [ D, E, G, H, I ] }, the number of occurrences of sentence a is 2, the number of occurrences of sentence B is 2, the number of occurrences of sentence C is 3, the number of occurrences of sentence D is 4, the number of occurrences of sentence E is 4, the number of occurrences of sentence F is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, and the number of occurrences of sentence I is 1.
According to the abstract extraction method provided by the embodiment, the TextRank algorithm is used for screening the first abstract candidate set from the sentence set according to the sentence similarity between every two sentences in the sentence set, the TextRank algorithm is used for screening the second abstract candidate set from the sentence set according to the cosine similarity between every two sentences in the sentence set, then the third abstract candidate set is screened from the first abstract candidate set and the fourth abstract candidate set is screened from the second abstract candidate set based on the maximum edge correlation MMR algorithm and the preset sentence number, and finally the four abstract candidate sets are fused to determine the abstract result set of the text, so that the redundancy between the extracted abstract sentences can be reduced, and the accuracy of text abstract extraction is effectively improved.
Referring to fig. 3, fig. 3 is a flow chart of another method for extracting a summary according to an embodiment of the application.
As shown in fig. 3, the digest extraction method includes steps S201 to 208.
Step S201, acquiring a sentence set of a target text, wherein the target text is a text of a abstract to be extracted.
When a user needs to extract the abstract in the text, the text of the abstract to be extracted can be uploaded to a server through a terminal device, the server performs sentence splitting on the received text of the abstract to be extracted to obtain an initial sentence set, and the initial sentence set is cleaned to remove punctuation marks, stop words and other characters in the initial sentence set, so that the sentence set of the text of the abstract to be extracted is obtained. The server acquires the sentence set of the abstract text to be extracted at fixed time or in real time.
Step S202, calculating sentence similarity between every two sentences in the sentence set, and screening a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, sentence similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the sentence similarity between every two sentences in the sentence set.
And step 203, calculating cosine similarity between every two sentences in the sentence set, and screening a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm.
After a sentence set of the target text is obtained, the cosine similarity between every two sentences in the sentence set is calculated, and based on a TextRank algorithm, a first abstract candidate set is screened from the sentence set according to the cosine similarity between every two sentences in the sentence set.
And S204, screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation (MMR) algorithm and the number of preset sentences.
After the first abstract candidate set and the second abstract candidate set are obtained through screening, the server screens a third abstract candidate set from the first abstract candidate set and screens a fourth abstract candidate set from the second abstract candidate set based on a maximum boundary correlation (Maximal MARGINAL RELEVANCE, MMR) algorithm and the preset statement number. Wherein the third abstract candidate set is a subset of the first abstract candidate set, and the fourth abstract candidate set is a subset of the second abstract candidate set. It should be noted that the number of the preset sentences may be set based on practical situations, which is not particularly limited in the present application. Redundancy among sentences can be eliminated through an MMR algorithm, and accuracy of abstract extraction is improved.
Step S205, selecting sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fused abstract candidate set.
After obtaining the four abstract candidate sets of the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set, the server selects sentences with preset abstract sentence numbers from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set and the fourth abstract candidate set respectively to form a fusion abstract candidate set. It should be noted that, the number of the preset summary sentences is smaller than the number of the preset sentences, and the number of the preset summary sentences may be set based on actual situations, which is not particularly limited in the present application.
Step S206, counting the occurrence times of each sentence in the fusion abstract candidate set, and determining whether the number of sentences with the occurrence times larger than the preset occurrence times is larger than or equal to the preset abstract sentence number.
After the fusion abstract candidate set is obtained, counting the occurrence times of each sentence in the fusion abstract candidate set, and determining whether the number of sentences with the occurrence times larger than the preset occurrence times is larger than or equal to the preset abstract sentence number. The occurrence frequency is the occurrence frequency of sentences in the fusion abstract candidate set. It should be noted that the number of abstract sentences may be set based on practical situations, which is not particularly limited in the present application.
Step S207, if the number of sentences with the occurrence frequency greater than the preset occurrence frequency is greater than or equal to the preset summary sentence number, sorting the sentences in the fusion summary candidate set according to the occurrence frequency.
And if the number of sentences with the occurrence frequency larger than the preset occurrence frequency is larger than or equal to the preset summary sentence number, sorting the sentences in the fusion summary candidate set according to the occurrence frequency. The higher the number of occurrences, the earlier the ranking of the sentences, and the lower the number of occurrences, the later the ranking of the sentences.
In an embodiment, if the number of sentences with the occurrence frequency greater than the preset occurrence frequency is less than the number of summary sentences, the sentences with the occurrence frequency greater than the preset occurrence frequency in the fusion summary candidate set are moved to a summary result set of the target text so as to update the fusion summary candidate set; acquiring an importance value of each statement in the updated fusion abstract candidate set, and sequencing the statements in the updated fusion abstract candidate set according to the importance value; and sequentially selecting sentences from the updated fusion abstract candidate set to write into the abstract result set according to the sequence of each sentence in the updated fusion abstract candidate set until the number of sentences in the abstract result set reaches the preset number of abstract sentences.
And step S208, according to the sequence of each sentence in the fusion abstract candidate set, sequentially selecting sentences from the fusion abstract candidate set to write the sentences into the abstract result set of the target text until the number of the sentences in the abstract result set reaches the preset number of the abstract sentences.
After the sentences in the fusion abstract candidate set are sequenced, sequentially selecting sentences from the fusion abstract candidate set according to the sequencing of each sentence in the fusion abstract candidate set, and writing the sentences into the abstract result set of the target text until the number of the sentences in the abstract result set reaches the preset number of the abstract sentences. For example, the fusion digest candidate set is { [ a, B, C, D, E ], [ C, D, E, F, G ], [ D, E, G, H, I ] }, the number of occurrences of sentence a is 2, the number of occurrences of sentence B is 2, the number of occurrences of sentence C is 3, the number of occurrences of sentence D is 4, the number of occurrences of sentence E is 4, the number of occurrences of sentence F is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, the number of occurrences of sentence I is 1, and thus the order of each sentence in the fusion digest candidate set is [ D, E, C, A, B, G, F, H, I ], the number of digest sentences is 5, and the preset number of occurrences is 2, and the digest result set of the target text is [ D, E, C, A, B ].
According to the abstract extraction method provided by the embodiment, the TextRank algorithm is used for screening a first abstract candidate set according to sentence similarity between every two sentences in the sentence set, the TextRank algorithm is used for screening a second abstract candidate set according to cosine similarity between every two sentences, then the MMR algorithm is used for screening a third abstract candidate set from the first abstract candidate set and a fourth abstract candidate set from the second abstract candidate set, and sentences with the preset abstract sentence number are selected from the four abstract candidate sets, so that a fusion abstract candidate set is formed; finally, counting the occurrence times of each sentence in the fusion abstract candidate set, and when the occurrence times are greater than or equal to the preset occurrence times, selecting sentences from the fusion abstract candidate set to write the sentences into the abstract result set of the target text according to the order of the occurrence times, so that the redundancy among the extracted abstract sentences can be reduced, and the accuracy of extracting the text abstract can be effectively improved.
Referring to fig. 4, fig. 4 is a schematic block diagram of a summary extracting apparatus according to an embodiment of the application.
As shown in fig. 4, the digest extracting apparatus 300 includes: an acquisition module 301, a first summary screening module 302, a second summary screening module 303, a third summary screening module 304, a selection module 305, and a summary determination module 306.
The obtaining module 301 is configured to obtain a sentence set of a target text, where the target text is a text of a summary to be extracted;
The first abstract screening module 302 is configured to calculate a sentence similarity between every two sentences in the sentence set, and screen a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
A second abstract screening module 303, configured to calculate a cosine similarity between every two sentences in the sentence set, and screen a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm;
A third abstract screening module 304, configured to screen a third abstract candidate set from the first abstract candidate set and screen a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation MMR algorithm and a preset number of sentences;
a selecting module 305, configured to select a statement with a preset number of abstract statements from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set, respectively, so as to form a fused abstract candidate set;
The abstract determining module 306 is configured to count the occurrence times of each sentence in the fused abstract candidate set, and screen the abstract result set of the target text from the fused abstract candidate set according to the occurrence times of each sentence.
In one embodiment, the first summary screening module 302 is further configured to:
Counting the number of the same words of every two sentences in the sentence set and the number of words contained in each sentence in the sentence set;
Calculating the sentence similarity of each two sentences in the sentence set according to the number of the same words of each two sentences in the sentence set and the number of words contained in each sentence in the sentence set;
Based on a TextRank algorithm, determining a first importance value of each sentence according to sentence similarity between every two sentences in the sentence set, wherein the first importance value is used for representing the importance degree of the sentence in the target text;
And screening a first abstract candidate set from the sentence set according to the first importance value of each sentence in the sentence set.
In one embodiment, the second summary screening module 303 is further configured to:
Encoding each sentence in the sentence set to obtain a sentence vector corresponding to each sentence in the sentence set;
according to the sentence vectors corresponding to each sentence in the sentence set, calculating the cosine similarity between every two sentences in the sentence set;
Based on a TextRank algorithm, determining a second importance value of each sentence according to cosine similarity between every two sentences in the sentence set, wherein the second importance value is used for representing the importance degree of the sentence in the target text;
And screening a second abstract candidate set from the sentence set according to the second importance value of each sentence in the sentence set.
In one embodiment, the third summary screening module 304 is further configured to:
Sequencing each sentence in the first abstract candidate set according to the first importance value of each sentence in the first abstract candidate set, and acquiring the sequencing number of each sentence;
acquiring sentences with the sorting numbers smaller than or equal to a preset sorting number from the first abstract candidate set to form a candidate sentence set;
shifting the sentence with the highest first importance value in the candidate sentence set to a blank abstract candidate set so as to update the abstract candidate set and the candidate sentence set;
Based on a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and each sentence in the candidate sentence set according to a first importance value of each sentence in the candidate sentence set, wherein the MMR values are used for representing the similarity degree between the sentences in the candidate sentence set and the abstract candidate set;
shifting the sentence with the highest MMR value to the abstract candidate set so as to update the abstract candidate set and the candidate sentence set;
Determining whether the number of the updated sentences in the abstract candidate set reaches the preset number of the sentences;
If the number of the updated sentences in the abstract candidate set does not reach the preset number of sentences, executing the steps: based on the MMR algorithm and a preset MMR value calculation formula, calculating MMR values respectively corresponding to each sentence in the abstract candidate set and the candidate sentence set according to the first importance value of each sentence in the candidate sentence set;
And if the number of the updated sentences in the abstract candidate set reaches the preset number of sentences, taking the updated abstract candidate set as a third abstract candidate set.
In one embodiment, the third summary screening module 304 is further configured to:
encoding the abstract candidate set to obtain a vector corresponding to the abstract candidate set;
Encoding each sentence in the candidate sentence set respectively to obtain a vector corresponding to each sentence in the candidate sentence set;
Calculating semantic similarity between vectors corresponding to the abstract candidate set and vectors corresponding to each sentence in the candidate sentence set respectively;
And calculating MMR values respectively corresponding to the abstract candidate set and each sentence in the candidate sentence set according to the semantic similarity and the first importance value of each sentence in the candidate sentence set.
Referring to fig. 5, fig. 5 is a schematic block diagram of another summary extracting apparatus according to an embodiment of the present application.
As shown in fig. 5, the digest extracting apparatus 400 includes: an acquisition module 401, a first summary screening module 402, a second summary screening module 403, a third summary screening module 404, a selection module 405, a determination module 406, a ranking module 407, and a summary determination module 408.
An obtaining module 401, configured to obtain a sentence set of a target text, where the target text is a text of a summary to be extracted;
a first abstract screening module 402, configured to calculate a sentence similarity between every two sentences in the sentence set, and screen a first abstract candidate set from the sentence set according to the sentence similarity based on a TextRank algorithm;
a second abstract screening module 403, configured to calculate a cosine similarity between every two sentences in the sentence set, and screen a second abstract candidate set from the sentence set according to the cosine similarity based on a TextRank algorithm;
a third abstract screening module 404, configured to screen a third abstract candidate set from the first abstract candidate set and screen a fourth abstract candidate set from the second abstract candidate set based on a maximum edge correlation MMR algorithm and a preset number of sentences;
A selecting module 405, configured to select a statement with a preset number of abstract statements from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set, respectively, so as to form a fused abstract candidate set;
A determining module 406, configured to determine whether the number of sentences that occur more than a preset number of occurrences is greater than or equal to a preset number of summary sentences;
The sorting module 407 is configured to sort the sentences in the fusion summary candidate set according to the number of occurrences if the number of the sentences with the number of occurrences greater than the preset number of occurrences is greater than or equal to the preset number of summary sentences;
and the summary determining module 408 is configured to sequentially select sentences from the fusion summary candidate set to write into the summary result set of the target text according to the ranking of each sentence in the fusion summary candidate set until the number of sentences in the summary result set reaches the preset number of summary sentences.
In an embodiment, the summary determining module 408 is further configured to:
If the number of sentences with the occurrence times larger than the preset occurrence times is smaller than the number of the preset abstract sentences, the sentences with the occurrence times larger than the preset occurrence times in the fusion abstract candidate set are moved to the abstract result set of the target text so as to update the fusion abstract candidate set;
acquiring an importance value of each statement in the updated fusion abstract candidate set, and sequencing the updated statements in the fusion abstract candidate set according to the importance value;
and according to the sequence of each statement in the updated fusion abstract candidate set, sequentially writing the selected statements in the abstract result set from the updated fusion abstract candidate set until the number of the statements in the abstract result set reaches the preset number of the abstract statements.
It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and each module and unit may refer to corresponding processes in the foregoing abstract extraction method embodiment, and will not be described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 6.
Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal device.
As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of digest extraction methods.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium, which when executed by a processor, causes the processor to perform any one of the digest extraction methods.
The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
It should be appreciated that the Processor may be a central processing unit (Central Processing Unit, CPU), it may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, where the computer program includes program instructions, where the method implemented when the program instructions are executed may refer to the embodiments of the summary extraction method of the present application.
The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the computer device.
It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.