Disclosure of Invention
The application provides a data deep processing method and computer equipment for rewriting patent application, and aims to solve the problem that the conventional patent deep processing technology cannot support special retrieval and analysis aiming at the aspect of patent technical application; the patent application is firstly rewritten and is solely used as a general category for deep processing of patent data, so that the technical application of one or more patent documents can be quickly and accurately analyzed, and a user can be quickly and accurately inquired about a target document with a specific technical application (application scene) from a large amount of patent documents.
Therefore, the following technical scheme is provided in the application:
a data deep processing method for rewriting patent application comprises the following steps:
A. usage generative model training
a1 Obtaining a patent literature sample library used for model training and evaluation and a use target text corresponding to each patent literature;
a2 Data preprocessing is carried out on each patent document in the sample library, each patent document obtains a plurality of documents through data preprocessing, and the documents respectively correspond to a patent name, an abstract and a plurality of paragraphs of a specification text;
a3 Preprocessing and word segmentation are carried out on the target text for the purpose, and then the text characteristics for the purpose are constructed based on word frequency logarithmic linear law and covering literature number sequencing and combining stop word skills;
a4 By performing mathematical statistics on the usage text features, determining a long text compression algorithm for generating a compressed text of a patent document as an input of a usage generation model;
a5 Using the compressed text and the corresponding purpose target text thereof to train and evaluate a purpose generation model;
B. generating usage text using models
Acquiring a patent document to be processed, preprocessing data according to the step a 2), and then generating a compressed text of the patent document according to the long text compression algorithm determined in the step a 4); and inputting the compressed text into a trained purpose generation model to obtain a purpose text of the patent document.
Optionally, the data preprocessing comprises:
and (3) processing patent names: remove prefix "one";
and (3) abstract processing: extracting full text, and converting non-Chinese punctuations into Chinese punctuations;
and (3) instruction processing: extracting the five contents of the technical field, the background technology, the invention content, the beneficial effect and the text tail, and converting the non-Chinese punctuation into Chinese punctuation;
the processed patent name, abstract and the five contents in the specification total seven key documents for the subsequent text compression processing;
for other contents in the specification, the non-Chinese punctuation is converted into Chinese punctuation for standby according to the situation.
Optionally, the constructing the text feature for use based on the log-linear rule of word frequency and covering the document number ordering and combining the stop word skill includes:
counting the word frequency of the purpose target text, generating a relation curve of the word frequency logarithm and the word sequence, taking all high-frequency words at the inflection point between the nonlinear decline and the linear decline of the curve, then removing the stop single words, and recording the obtained high-frequency word set as a first group of high-frequency words, wherein the number of the high-frequency words is N in total;
counting the number of covered use target texts of each word in the use target texts, sequencing the words according to the number of the covered use target texts from high to low, removing the disabled single words in the high-frequency words which are sequenced at the front, and finally reserving the first N high-frequency words which are marked as a second group of high-frequency words;
and taking intersection of the first group of high-frequency words and the second group of high-frequency words to obtain the purpose text characteristics.
Optionally, obtaining the first group of high-frequency words by counting the word frequency of the target text for use includes:
a311 Distinguishing technical fields to obtain high-frequency words corresponding to a plurality of technical fields through statistics;
a312 High-frequency words are obtained through statistics without distinguishing technical fields;
a313 Combining the high-frequency words obtained in the two statistical modes of a 311) and a 312) to obtain the first group of high-frequency words;
obtaining the second group of high-frequency words by counting the number of the target text covered by each word in the target text, including:
a321 Distinguishing technical fields to obtain high-frequency words corresponding to a plurality of technical fields through statistics;
a322 High-frequency words are obtained by statistics without distinguishing the technical fields;
a323 And) combining the high-frequency words obtained by the two statistical modes of a 321) and a 322) to obtain the second group of high-frequency words.
Optionally, the constructing the text feature for use based on the log-log linear rule of word frequency and the covering literature number ordering and combining stop word skills further includes:
combining the intersection of the first group of high-frequency words and the second group of high-frequency words with the artificial features to obtain the purpose text features; the artificial features are words provided by experts in relation to the expression of the usage statement.
Optionally, step a 4) comprises:
calculating the weight of each text characteristic of each application by using the word frequency and the document coverage number, and short for the weight of the characteristic;
setting the position of the feature in the sentence to be more important, and modeling the sentence weight by using the feature weight and the position weight of the use text feature in the sentence;
determining a sentence and use target correlation formula based on the sentence weight and the sentence length factors;
and obtaining a sequence of the relevance between the candidate sentences and the purposes according to the relevance formula between the sentences and the purpose target, and extracting key sentences according to the compression length threshold value of each key document to obtain a compressed text.
Optionally, the feature weight is calculated as follows:
wherein ,
: the weight of the ith feature;
: word frequency of the ith feature;
: document coverage number of ith feature;
FSeta set of all features;
featurej : feature setFSetThe jth feature of (1);
the calculation formula of the position weight of the text feature in the sentence is as follows:
wherein ,
: a location weight of a feature;
: a character index in which a certain feature counts from 0 in a sentence;
the sentence weight is calculated as follows:
wherein ,
: the representation of the weights of the patent text sentences,
: an ith feature of a patent text sentence;
the determined sentence-to-use target relevance formula is as follows:
wherein ,Lsentence : sentence length;
Lavg : average length of all sentences of text;
k1 : the super-parameter is obtained by adjusting the parameter of the sentence weight importance, and the parameter is 1.6;
bthe super-parameter is used to adjust the influence factor of sentence length, and 0.75 is selected.
Optionally, the method for determining the compression length threshold of each key document includes:
respectively calculating Softmax of the longest common substring LCS of each key document and the use target text and Softmax of the proportion of the use target words in all use target word sets of each key document according to formulas (5) and (6);
wherein ,doc: a key document;
target: a usage target text;
: a document non-duplicate word set;
: a use target text non-repeated word set;
calculating the compression length threshold value of each key document according to the formula (7):
wherein ,
: a length limit value of the compressed text input as the usage generation model;
: and compressing the compression length threshold of the ith key document in the text.
Optionally, the algorithm for extracting the key sentence is as follows:
establishing a first sentence list and traversing according to the sequence of the relevance between the candidate sentences and the purposes from high to low;
the traversed current sentence is put into a second sentence list, and whether the sum of the lengths of all sentences in the second sentence list is larger than the compression length threshold of the affiliated key document is judged;
if the sum of the lengths of all sentences in the second sentence list is larger than the compression length threshold of the key document, further judging the total number of sentences in the sentence list;
if the second sentence list comprises more than two sentences, all the sentences except the last sentence in the second sentence list are spliced; then, if the length of the residual space is larger than a preset manually set shortest sentence length threshold, calling an extra-long sentence processing algorithm to process the sentences crossing the boundary; returning;
if only one sentence is in the second sentence list, calling a very long sentence processing algorithm to obtain a compressed sentence, and returning;
if the sum of the lengths of all sentences in the second sentence list does not exceed the compression length threshold of the affiliated key documents, the next sentence in the first sentence list is traversed.
Optionally, the extracting key sentences to obtain compressed texts further comprises high-correlation sentence emphasis processing and low-correlation compression result supplement processing; wherein:
the highly relevant sentence emphasizing process is: summarizing sentences with the highest relevance to each key document when compressing the key document to form an important contribution sentence list, and splicing to form a high-purpose related paragraph of the patent document; when the length of a compressed key document is smaller than the set proportion of the threshold value of the compressed length of the key document, extracting key sentences of the paragraphs with high relevance to the purposes, and supplementing the extraction result to the back of the compressed key document;
the low correlation compression result supplementary processing is as follows: calculating the sum of all sentences of a certain key document compressed text and the target relevance to obtain the correlation between the certain key document compressed text and the target, and calculating the sum of all seven key document compressed texts of a certain patent document and the target relevance to obtain the correlation between the compressed text and the target of the patent document; and calculating an average value of the correlation between the compressed texts of all patent documents and the target, and when the correlation between the compressed texts of a certain patent document and the target is lower than the average value and the compressed length is at least one short sentence storage space away from the compressed length threshold, cleaning the compressed texts by using the claims and other contents (except the technical field, the background technology, the invention content, the beneficial effect and the text tail) of the specification, extracting key sentences by using a compression algorithm, and completing the key sentences to the compressed texts of the key documents.
Optionally, the usage generation model employs a T5 series model.
The application also provides computer equipment which comprises a memory and a processor, wherein the memory stores a computer program, and the computer equipment is characterized in that the processor realizes the steps of the data deep processing method rewritten by the patent application when executing the computer program.
The present application also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the above-mentioned method for further processing of data rewritten for patent use.
The application has at least the following beneficial effects:
the method comprises the steps of firstly, independently rewriting patent application as a general category of deep processing of patent data, constructing application text characteristics for application text written by an operator (an expert) corresponding to a sample library patent document based on a word frequency logarithmic linear rule and covering document number sequencing, combining stop word skills, performing mathematical statistics on the application text characteristics, determining a long text compression algorithm, and training and evaluating an application generation model by utilizing compressed texts generated by data preprocessing and the long text compression algorithm of all patent documents and corresponding manually written application target texts; for the patent document to be processed, the compressed text of the patent document is generated according to the same data preprocessing and long text compression algorithm, and the trained purpose generating model is input, so that the purpose text of the patent document can be obtained, a computer can be facilitated to understand the purpose elements of the patent document more deeply, and the research on the calculability of the patent document is promoted. Data service products such as databases constructed based on the method of the present application can support one or more patent documents to be analyzed quickly and accurately to obtain the technical applications, and can also support users to inquire target documents of a certain technical application (application scenario) quickly and accurately from a large amount of patent documents, for example: when the search element relates to the use information, the user can preferentially generate the text (search item) in use to search, thereby improving the search efficiency.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a method for further processing data rewritten for patent purpose, comprising the steps of:
A. usage generative model training
a1 Obtaining a patent literature sample library used for model training and evaluation and a use target text corresponding to each patent literature;
the sample library described herein preferably covers the entire area, not less than 15000 patent documents;
the purpose text mentioned here is determined by the expert reading each patent document in the sample library, and of course, the words may be extracted from the patent document directly or may be restated. Using the target text as a standard answer of a model training and evaluating link; therefore, many experts are familiar with different technical/industrial fields, and should have a certain patent basic knowledge.
a2 Performing data preprocessing on each patent document in the sample library, wherein each patent document is subjected to data preprocessing to obtain a plurality of documents, and the documents respectively correspond to a patent name, an abstract of the specification and a plurality of paragraphs of the text of the specification (including the patent name at the beginning);
a3 The target text for the purpose is subjected to data preprocessing and word segmentation processing, and then the characteristics of the text for the purpose are constructed based on word frequency logarithmic linear law and covering literature number sequencing and combined with stop word skills;
the method is characterized in that the method is based on the structural characteristics of a word frequency logarithm linear rule, mainly focuses on the nonlinear decline and linear decline inflection point of a relation curve of a word frequency logarithm and a word sequence, and collects all high-frequency words to the left at the inflection point; based on the covering literature number sequencing construction characteristics, recognizing high-frequency words mainly from the aspect that each word covers the number of patent literatures; then, solving the intersection of the high-frequency words (characteristics) extracted from the two aspects;
a4 By performing mathematical statistics on the usage text features, determining a long text compression algorithm for generating a compressed text of a patent document as an input of a usage generation model;
a5 Using the compressed text and the corresponding purpose target text thereof to train and evaluate a purpose generation model;
B. generating usage text using models
Acquiring a patent document to be processed, preprocessing data according to the step a 2), and then generating a compressed text of the patent document according to the long text compression algorithm determined in the step a 4); and inputting the compressed text into a trained purpose generation model to obtain a purpose text of the patent document.
The data deep processing method for the patent purpose rewriting comprises four aspects of purpose characteristic construction, long text compression, model training and evaluation and purpose text generation, as shown in figure 2, wherein at least the following technical characteristics are provided:
constructing a use text characteristic, and constructing the characteristic based on the combination of a word frequency logarithm linear law and a stop word skill, wherein the construction method mainly comprises the following steps: counting the word frequency of the use text, visualizing the log of the word frequency to take out all high-frequency words which are nonlinearly decreased and left at the turning points which are linearly decreased, and then removing the stop-used single words such as ' and ' waiting ' to obtain the high-frequency word characteristics generated by the use text; similarly, document covering word characteristics generated by the use text are constructed according to the inflection values and the word covering document number in a descending order; and solving intersection of the high-frequency word features and the document covering word features to obtain the text features of the purposes.
And (3) modeling by characteristic weight, designing a weight calculation formula by using the use word frequency and the document coverage number, calculating the mean value, the mode and the standard deviation of the weight calculation formula, and determining a smooth value according to the requirement of a topic for smoothing the artificial characteristics lacking the word frequency and the document coverage number.
Sentence weight modeling, supposing that features appearing earlier in a sentence have higher value than features appearing later, deducing that the more front the features in the sentence are, the more important the features are, and introducing position factors to design position weight; sentence weights are modeled using the feature weights and the location weights of the features in the sentence.
And (4) modeling the relevance of the sentences and the use targets, and designing a sentence and use target relevance formula based on the sentence weight and the sentence length factors.
The long text compression technology comprises a correlation-based key sentence extraction algorithm and a feature sentence dividing algorithm based on feature weight aiming at an ultra-long sentence.
The following describes the implementation process of the present embodiment by taking the usage rewriting of the chinese patent document as an example:
1. patent original text preprocessing flow:
patent name processing: remove prefix "one";
and (3) abstract processing: extracting full text, and converting non-Chinese punctuations into Chinese punctuations for subsequent unified processing;
and (3) instruction processing: extracting five parts of contents of the technical field, the background technology, the invention contents, the beneficial effects and the text tail, and converting non-Chinese punctuation into Chinese punctuation;
and (3) other content processing: non-Chinese punctuations are converted into Chinese punctuations, and then can be used for standby according to the situation, and low correlation compression result supplement is carried out in a long compression link;
in addition, LCS, proportion Ratio and Softmax can be calculated in advance and used for calculating the length value (compression length threshold) to be compressed of each document in a long compression link; the method specifically comprises the following steps:
calculating the LCS and Softmax of each document and the target text of the application, such as the patent name, the abstract, the technical field, the background technology, the invention content, the beneficial effect, the text end and the like, and the formula (5) and the relevant description which are detailed later.
The Ratio and Softmax of each document and the target text of the application, such as the name, the abstract, the technical field, the background technology, the invention content, the beneficial effect, the text end and the like, are calculated, and the details are shown in a formula (6) and related description.
2. Constructing a text feature processing flow for use:
the method is characterized in that the method is based on the combination of the log-log linear law of word frequency and stop word skill construction characteristics: counting the word frequency of the target text for the purpose, visualizing the log of the word frequency to extract all high-frequency words at the turning points (shown in figure 3) with nonlinear decline and linear decline, and then removing the disabled single words such as 'and the like' to be used as the characteristics generated by the text for the purpose. And similarly, constructing the features according to the high-frequency word covering document number ordering. The method comprises the following construction steps:
1) Word frequency logarithm nonlinear region extraction characteristics (except single words such as ' and ' etc. ') and common stop words)
1.1 Category-based target text word frequency characteristics for statistical purposes; the categories referred to herein, i.e., the technical fields, may be classified into patent documents according to, for example, the major categories or the minor categories classified by IPC, and counted separately;
1.2 Not distinguishing classification statistics purpose target text word frequency characteristics (all purpose target texts are used as a classification to carry out characteristic construction);
1.3 Finally, high-frequency words (features) obtained by statistics of all categories and indistinguishable categories are combined;
2) The method covers the prior extraction of the features with the same quantity as the word frequency features (except for single words such as 'word' and 'etc.') in the document number ordering
2.1 Category statistics of the overlaid document number word features of the target text words; the categories referred to herein, i.e., the technical fields, may be classified into patent documents according to, for example, the major categories or the minor categories classified by IPC, and counted separately;
2.2 The characteristic of the covering literature number words of the objective text words of the classified statistics use is not distinguished (all the purpose texts are used as one category to carry out characteristic construction);
2.3 Finally, combining the covering literature numbers (characteristics) obtained by statistics of all categories and the indistinguishable categories;
3) Additionally collecting artificial features
Terms (characteristics) provided by experts in relation to usage expressions: "used", "use", "used", "having", "efficacy", "function", "action", "achieving", "effect", and the like.
4) Merging to obtain usage text features
And extracting the features of the word frequency logarithm nonlinear region and the features of the covering literature number in a sequencing manner close to the front to obtain an intersection, and then combining the artificial features to obtain the text features of the application.
3. Feature weight modeling
Accumulating the word frequency and the coverage of all words of the purpose text, and designing a characteristic weight formula as follows:
wherein ,
: the weight of the ith feature;
: word frequency of the ith feature;
: document coverage of ith feature;
FSeta set of all features;
featurej : feature setFSetThe jth feature of (1);
the weighting for artificial features lacking word frequency and document coverage is solved by a smoothing technique. The invention determines a smooth value according to the task requirement and the experiment by calculating the mean value and the standard deviation of the text characteristic weight:
wherein ,
: the value of the smoothing is smoothed out,
: the average of the text feature weights is used,
: the standard deviation of the measured values is calculated,
n: multiple, greater than or equal to zero.
This example testsnRespectively taking the effects of 0,1,2 and 3,nthe worst effect in case of =2, considering the calculation efficiencynTake 0.
After model training and evaluation, the feature weight values are determined. And in the process of calculating the purpose text by applying the model, only the determined characteristic weight value is needed.
4. Sentence weight modeling
Assuming that features that appear earlier in the sentence have more value than features that appear later, it is concluded that the more important the position of a feature is in the sentence, the more forward, the location factor is introduced,
the feature location weights are designed as follows:
wherein ,
: a location weight of a feature;
: a character index in which a certain feature counts from 0 in a sentence;
the sentence weight is calculated as follows:
wherein ,
: a weighted representation of a patent text sentence;
: the ith feature of a patent text sentence.
5. Sentence and use object relevance modeling
Because the generated text is shorter than the original text, a short sentence is preferentially selected, the correlation between the modeling sentence and the target is modeled, and the formula is as follows:
wherein ,Lsentence : sentence length;
Lavg : average length of all sentences of text;
k1 : the super-parameter is used for adjusting the parameter of the sentence weight importance, and the value is 1.6;
bthe super-parameter is used to adjust the influence factor of sentence length, and 0.75 is selected.
The sentence and usage relevance algorithm is described in detail as follows:
1. definition dictionary
1.1 ID2 send = { ID: presence }
1.2 Sentence length code, ID2length = { ID: length (presence)
1.3 Define a sentence-to-use dependency dictionary, ID2correlation = { ID: correlation }
1.4 Calculate the average of all sentence lengths;
2. traverse all sentences
2.1 ) traverse all features in the sentence for sentence weight:
calculating a feature location weight according to formula (2);
substituting the characteristic weight and the characteristic position weight into a formula (3) to calculate sentence weight;
2.2 Substituting the sentence weight into formula (4) to calculate the sentence-target correlation;
2.3 Sentence relevance is stored into the dictionary id2 relevance;
3. the result id2 send, id2length, id2correlation is returned.
6. Long text compression
The long compression algorithm of the present embodiment has the following features: (i) custom compression length; (ii) Compressing multiple documents, namely dividing a patent text into multiple documents such as a patent name, an abstract, a technical field, a background technology, an invention content, a beneficial effect, a text tail and the like; (iii) better interpretability than neural networks.
The user-defined compression length refers to that the text generation model with the reference application has the performance of defining the compression length for the length limit value of the input text, and then the compression length requirement of each document can be calculated by referring to a formula (7). The multi-document compression means that LCS (longest common substring) and use feature ratio of document contents are calculated, the LCS and use feature ratio are substituted into formulas (5) and (6) to calculate softmax respectively, and the calculation result and the limit value of the generation model to the text length are substituted into formula (7) to obtain the length value (compression length threshold value of each document) required to be compressed of each document.
The calculation formula of Softmax of the longest common substring LCS of each document and the target text for use is as follows:
wherein ,doc: key documents, namely, names, abstracts, technical fields, background technologies, inventive content, beneficial effects, text ending and the like;
target: the usage target text.
Performing word segmentation processing on each document and each application target text of the patent text, and calculating the proportion of the application target words appearing in each document to all application target word sets to obtain the application target proportion, wherein the calculation formula is as follows:
wherein ,doc: key documents, namely, names, abstracts, technical fields, background technologies, inventive content, beneficial effects, text ending and the like;
target: a usage target text;
: a set of non-duplicate words of a document,
: the set of non-repeating words of the usage-target text,
the patent text comprises the following document length calculation formula:
wherein ,
: the length (compression length threshold) to which the ith document of the patent text needs to be compressed;
: and generating the length of the model finetune, namely the length limit value of the compressed patent text.
The key sentences refer to sentences which have higher contribution degree to the target text generation than other sentences in the original text. The compression aims to find out valuable key sentences generated for the target text and remove redundant worthless sentences.
The key sentence extraction algorithm is specifically as follows:
1. dividing the text according to punctuations such as periods, semicolons and the like, selecting sentences containing characteristics to obtain a candidate sentence list, judging the sum of the lengths of all the sentences in the sentence list, and if the sum is smaller than a compressed length threshold value, returning; otherwise, continuing to execute downwards;
2. calling a sentence and use correlation algorithm, namely a formula (4), and calculating the correlation between the candidate sentences and the uses;
3. sorting the candidate sentences from high to low according to the relevance of the candidate sentences to the application to obtain a sentence ID list ID _ lst;
4. traverse sentence ID list ID _ lst
4.1 Sentence ID into the list selected _ ID _ lst;
4.2 Judging whether the sum of the lengths of sentences corresponding to all ids of selected _ id _ lst is larger than a compression length threshold value;
(1) Is greater than the compressed length threshold
a)If the selected _ ID _ lst has at least 2 IDs, splicing all ID corresponding sentences except the last ID of the selected _ ID _ lst;
if the length of the residual space is larger than a manually preset shortest sentence length threshold SHORT _ SENT _ LEN _ THETA, calling an extra-long sentence processing algorithm to process the sentences crossing the boundary; the minimum sentence length threshold SHORT _ SENT _ LEN _ THETA is a super parameter preset artificially, meaning the minimum sentence length obtained by compression, and the value of the minimum sentence length threshold is 8 or 16;
returning;
b)otherwise, the selected _ ID _ lst only has 1 ID, and a very long sentence processing algorithm is called to obtain a compressed sentence and return the compressed sentence;
(2) Otherwise, the iteration is continued.
Here, two lists are involved: id _ lst and selected _ id _ lst.
ID _ lst is a list of all sentence IDs ordered according to relevance;
when traversing a sentence indicated by a certain id of the id _ lst, the id is put into a list of selected _ id _ lst, the lengths of all sentences indicated by the id currently added into the selected _ id _ lst are judged one by one and whether a compression length threshold value is reached, and the method stops when the lengths reach the compression length threshold value, otherwise, the next id of the id _ lst is continuously traversed.
The extra-long sentence processing means that the extracted most relevant first key sentence needs to be compressed when the length of the first key sentence is larger than the compression length threshold, and the extra-long sentence processing method is also suitable for processing the cross-boundary sentences.
The extra-long sentence processing algorithm is specifically described as follows:
1. dividing a very long sentence according to punctuations such as commas, colons, pause signs and the like to obtain a sentence list tmp _ sent _ lst;
2. traversing the sentence list tmp _ sent _ lst
2.1 If the sentence contains features:
(1) If the sentence length exceeds the compression length, calling a characteristic sentence dividing algorithm to obtain a characteristic sentence dividing input short _ sent _ lst;
(2) Otherwise, directly accessing the sentence into the list short _ sent _ lst;
2.2 Otherwise, skipping the current sentence (which shows that the sentence has no value for the generation of the target text), and traversing the next sentence;
3. calculating the extra-long sentence to obtain the sum of all sentence lengths in the short _ sent _ lst of the short sentence list
3.1 If the sum of all sentence lengths in the short sentence list short _ sent _ lst obtained by the extra-long sentence is not more than the compression length threshold value, returning;
3.2 Otherwise, sorting the sentences from high to low according to the relevance between the sentences and the purpose targets to obtain a sentence ID list short _ ID _ lst;
4. traversing ID list short _ ID _ lst
1) ID selected _ ID _ lst in the selected sentence ID list;
2) If the sum of the lengths of sentences corresponding to all ids of selected _ id _ lst is not more than the compression length threshold value, continuing to add;
3) If not, then,
(1) Slicing the last sentence to obtain a sub-character string which just complements the vacant space, and then splicing all sentences in selected _ id _ lst, break;
(2) Otherwise, the first sentence is spliced to exceed the space length, and the only sentence with the proper length is sliced and processed, break;
5. and returning short sentences obtained after processing the extra-long sentences.
The characteristic sentence dividing mentioned in the characteristic sentence processing algorithm refers to a characteristic-containing characteristic key sentence, needs to be compressed again, and a focused small short sentence is divided through the characteristic sentence dividing algorithm.
The characteristic sentence-dividing algorithm is described as follows:
6.1 defining a storage characteristic phrase list selected _ feed _ sent _ lst;
6.2, calculating the number of strokes or the maximum number of the features max _ count;
6.3 determining max _ count characteristics according to the characteristic weight from high to low;
6.4, sorting the max _ count characteristic indexes according to the indexes of the characteristics in the sentence;
6.5 traverse the sorted feature index list
1) Calculating the left boundary of the current characteristic stroke;
2) Calculating the right boundary of the current characteristic stroke;
3) If the boundary is illegal, continuously traversing the next index;
4) Taking out the short sentence;
(1) If only one characteristic or the right boundary of the characteristic scribing sentence is equal to 0 or is not coincident with the boundary of the last scribing short sentence, the scribing short sentence is directly input into the selected _ feed _ send _ lst of the characteristic short sentence list;
(2) Otherwise, the boundary is overlapped with the boundary of the last short sentence, the left boundary of the sentence is updated, the short sentence is prevented from being overlapped with the previous short sentence, and the short sentence is drawn into the selected _ feed _ sent _ lst of the characteristic short sentence list;
5) Updating the right boundary of the feature stroke;
6.6 feature stroke list selected _ feed _ send _ lst is returned.
In the long text compression link, the compression flows of the documents of patent names, abstracts, technical fields, background technologies, invention contents, beneficial effects and text ends are consistent, as shown in a multi-document compression main flow in fig. 4, the compression length of each document is calculated according to the document length weight and the support length of a model input sequence, and a key sentence extraction algorithm is called to compress each document to obtain a multi-document compressed text.
As shown in FIG. 4, to further improve the effect, the present invention also emphasizes the high correlation sentence processing and supplements the compressed text with the low correlation compression result; specifically, the method comprises the following steps:
1) Emphasizing highly relevant sentence processing
Based on the common knowledge that human beings can deeply understand important sentences through reading for several times, the important sentences need to be emphasized once and then fed to the generative model. The processing idea is as follows: and processing the sentences with the highest relevance when each document is compressed together to form a list of sentences with important contributions, and splicing to form high-relevance paragraphs of each patent. And for the document with the length smaller than the length threshold value 0.75 to be compressed of a certain document in the patent, extracting key sentences by using the high-correlation paragraphs, and supplementing the extraction result to the back of the document in the patent compression.
2) Low correlation compression results to supplement compressed text
For the patent with low target relevance of the compressed text, considering that part of the patent application text is not written in the multi-document, the compressed text with low target relevance of the compressed text is compressed and supplemented by the unused part of the original text of the patent. The processing idea is as follows: calculating the sum of all sentences of a certain key document compressed text and the target relevance to obtain the correlation between the certain key document compressed text and the target, and calculating the sum of all seven key document compressed texts of a certain patent document and the target relevance to obtain the correlation between the patent document compressed text and the target; calculating an average value of the correlations between all the compressed texts of the patent documents and the target, and for the compressed texts of the patent documents, the correlations between the compressed texts of the patent documents and the target are lower than the average value (the average value of the correlations between the compressed texts of all the patent documents and the target), and the compressed length distance threshold value is also provided with at least one minimum sentence length space size (a manually set minimum sentence length parameter value), using the unused texts in the original texts of the patent, cleaning the unused texts, and then calling a key sentence extraction algorithm to extract the correlation content to be filled in the compressed texts, thereby achieving the purpose of improving the correlations between the compressed texts and the target of the application.
7. Purpose text generation model finetune and target text generation
The length of the compressed text is still larger than that of the target text, and the semantic links are not consistent, so that the aim of achieving the target text with natural semantic links, concise text and the requirement of the target length by using a generation model is considered.
The current popular T5 model for translation and Text generation tasks in the industry is Colin Raffel, noam Shazeer, adam Roberts, katherine Lee, sharan Narang, michael Matena, yanqi Zhou, wei Li, peter J. Liu, et al, in the literature of expanding the Limits of Transfer left with a Unified Text-to-Text transform, the encoder-decoder architecture, suitable for Text generation and other sequence-to-sequence tasks. In the embodiment, a T5 series model is selected to realize the generation of the compressed text to the target text.
For a patent document to be processed, a compressed text of the patent document is generated according to the same data preprocessing and long text compression algorithm, and the trained purpose generation model is input, so that the purpose text of the patent document can be obtained, a computer can understand the purpose elements of the patent document more deeply, and the research on the computability of the patent document is promoted. The data service products such as the database and the like constructed based on the method of the embodiment can support one or more patent documents to be analyzed quickly and accurately to obtain the technical application of the patent documents, and can also support a user to inquire a target document with a certain technical application (application scene) quickly and accurately from a large amount of patent documents. For example:
a patent literature use information acquisition scene comprises the following steps: based on the data service products such as the database and the like constructed by the method of the embodiment, a user introduces a target patent document, or the user inputs the document number of the target patent document, and the system automatically searches and downloads to obtain the target patent document; then, carrying out the operation; and directly outputting the purpose text by using the purpose text generating function of the data service product.
A scenario for document retrieval based on patent usage: when the search element relates to the use information, the user can generate the text (search item) in use preferentially to search, thereby improving the search efficiency and accuracy.
In one embodiment, there is also provided a computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the data deep processing method rewritten for the above-mentioned patent application when executing the computer program.
In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which is characterized in that the computer program, when being executed by a processor, implements the steps of the above-mentioned patented data further processing method.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.