Background
With the development of social economy and the arousal of the right consciousness of people, the basic social contradiction in recent years presents new trend, new trend and new characteristics; the resolving work of the contradiction dispute call needs to start from the prevention of civil crimes and the reduction of the repeated occurrence rate of high-frequency disputes, so that four intelligent early warning models of contradiction risks of important people, important matters, important industries and important areas are needed to be built; obviously, the construction difficulty of the 'important thing' early warning model is high, and whether the two event descriptions are similar or not needs to be judged by an artificial intelligent algorithm.
The existing text similarity calculation method is mainly based on a large-scale corpus pre-training model, but the contradiction and adjustment data volume is large, the data sources are complex, a part of data is uniformly reported to a platform by basic-level management staff, and the characteristics of similar format and high event background occupation ratio exist; part of data comes from the public autonomy, and has the characteristics of large redundancy of information and larger description and access of the same event. The use of a pre-trained model based on a large scale corpus has the following problems: the events are slow to match, the same events with different text lengths cannot be matched, and different events with the same event background and too large event background ratio are difficult to distinguish.
The Chinese patent application with publication number of CN115203365A provides a social event processing method applied to the comprehensive treatment field, which calculates the similarity between an input event and an event in a library through an RE2 model and assists in classifying the event; however, the complexity of the input data is not considered in the technology of the patent, and when the input event formats are similar and the event background ratio is high, the method has poor effect.
The chinese patent application with publication No. CN115982324a provides a method for checking purchasing documents based on improved natural language processing, which calculates the similarity between sentences in an input document and sentences in a database through a content-bert model to determine whether the input document is compliant, but the method can only act on the case of similar sentence length, and has poor effect when facing the case of different sample pair lengths.
Disclosure of Invention
In view of the above, the invention provides a complex text similarity calculation method applied to social treatment scenes, which solves the problem of matching different texts with different lengths by performing word repetition on a similar positive sample, constructs different event pairs with the same event background by an entity recognition algorithm, and solves the problem of difficultly discriminating different events with overlarge event background ratio and the same event background.
A complex text similarity calculation method applied to a social management scene comprises the following steps:
(1) Acquiring a large amount of text similarity training data and entity identification training data;
(2) Acquiring an entity recognition pre-training model and a text similarity pre-training model;
(3) Fine tuning the entity recognition pre-training model by utilizing entity recognition training data to obtain an entity recognition model;
(4) For positive sample pairs with approximate lengths in text similarity training data, reconstructing the positive sample pairs by using a word repetition algorithm to ensure that sentence pairs have different lengths;
(5) Carrying out data enhancement on negative sample pairs in the text similarity training data;
(6) Fine tuning the text similarity pre-training model by using the text similarity training data obtained in the steps (4) and (5) to obtain a text similarity model;
(7) Accessing data through a contradiction reconciliation platform, and coding each access data by using a text similarity model to obtain corresponding sentence vectors and form event data to be stored in a corpus;
(8) And receiving contradiction reconciliation data, performing similarity calculation on sentence vectors of the contradiction reconciliation data and event data in a corpus, and synchronizing calculation results into an event monitoring database so as to perform early warning on corresponding events.
Further, the text similarity training data is sentence pairs with whether similar labels exist, and the entity recognition training data is a sentence with a sequence label.
Further, the text similarity pre-training model is a paramaphase-multilangual-MiniLM-L12-v 2; the entity recognition pre-training model is Chinese_pretrain_ mrc _roberta_ wwm _ext_large.
Further, the specific implementation manner of the step (4) is as follows: firstly, selecting positive sample pairs with approximate lengths from text similarity training data, wherein the positive sample pairs are sentence pairs with similar texts; for a positive sample pair with approximate length, one sentence is selected, and the sentence is divided into n words { W) by using a jieba word segmentation method0 ,W1 ,…,Wn-1 Then using the random function to select n/4 numbers { X } from 0 to n-11 ,X2 ,…,Xn/4 And finally { W }X1 ,WX2 ,…,WXn/4 Copy inserted to the back of the corresponding word in the original sentence.
Further, the specific implementation manner of the step (5) is as follows: firstly, randomly selecting N sentences from text similarity training data to form a queue, and extracting event background text and event content text of each sentence in the queue by using an entity recognition model; and randomly selecting a sentence s from the queue, splicing the event background text of the sentence s with the event content text of the rest N-1 sentences in the queue to obtain N-1 new sentences, and finally respectively combining the sentence s with the N-1 new sentences to form N-1 negative sample pairs, wherein N is a natural number larger than 1.
Further, the specific implementation manner of the step (7) is as follows: for any access data, firstly extracting an event background text of the access data by using an entity recognition model, wherein the access data is a text describing a contradictory dispute, then respectively encoding the event background text and a complete text of the access data by using a text similarity model to generate two sentence vectors correspondingly and form event data to be stored in a corpus, and each group of event data in the corpus comprises an event number, an event complete text, an event background text, a complete text sentence vector and a background text sentence vector.
Further, the specific process of similarity calculation in the step (8) is as follows: firstly, extracting an event background text of contradictory reconciliation data by using an entity recognition model, and then respectively encoding the event background text and a complete text of the contradictory reconciliation data by using a text similarity model to generate two sentence vectors code1 and code2 correspondingly; for any event data in the corpus, calculating the similarity between the background text sentence vector of the event data and code1, if the similarity reaches a set threshold, further calculating the similarity between the complete text sentence vector of the event data and code2, and if the similarity also reaches the set threshold, judging that the event data is similar to contradictory mediation data.
Further, the event monitoring database in the step (8) contains field information of all existing events, including event numbers, event contents, reporting time, reporting sources, repetition times and early warning levels; when the received contradiction adjustment data is similar to the data of the existing events, the repetition times of the corresponding events are increased once, and then hierarchical early warning is carried out on all the existing events according to the repetition times.
Aiming at the problems that the data which are uniformly reported to a platform by basic management staff have similar formats and high event background ratio, the invention provides a method for separating and extracting the event background from a complete text, then calculating the similarity and constructing a difficult negative sample pair, and solves the problems that different events with overlarge event background ratio and same event background are difficult to distinguish. For the problems that information is redundant in a large amount and the description access of the same event is large in the data reported by the public autonomy, the invention provides a method for constructing a difficult positive sample pair by using a word repetition algorithm, and solves the problem of matching of similar texts with different lengths. After the method is used, the accuracy of the similarity algorithm is greatly improved, and particularly, the accuracy of difficult samples is remarkably improved.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
The invention discloses a complex text similarity calculation method applied to the field of social management, which comprises two stages:
stage one: model training and corpus construction.
As shown in fig. 1, the specific implementation steps of model training and corpus construction are as follows:
s1: and (3) coding texts in the database into sentence vectors by using a Bert-whistening model, storing the sentence vectors into a vector database fasss, finding Top5 similar data of each piece of data by using a fasss rapid matching function to form a pre-marking positive sample pair, randomly taking 2 pieces of data for multiple times to form a pre-marking negative sample pair, judging by manpower, judging whether the labels of the sample pairs are correct, and screening out the correct sample pairs to form a text similarity training data set. And pre-marking event background entities in the entity identification data by using a doccano marking platform, and then forming an entity identification training data set by manually correcting sequence labels of the entity background entities.
S2: invoking the python script uses the auto model_pretrained () and model_save () methods of the transform framework to download the text similarity pretrained model and entity recognition pretrained model at huggingface and save to the local.
In this embodiment, the text similarity pre-training model is selected from the group consisting of parallel-multilangual-MiniLM-L12-v 2, and the entity recognition pre-training model is selected from the group consisting of Chinese_pretrain_ mrc _roberta_ wwm _ext_large.
S3: the training model of entity recognition is performed on a single-card 4090 server by using a deep KE framework, wherein the main super-parameters are { batch size=8, maxlenth=128, epoch=5, learning rate=0.0001, optimization=adamw }, the entity recognition training data set size is 5000 pieces of data, and the checkpoint with the best effect is selected as the entity recognition model for subsequent use after training.
S4: and selecting positive sample pairs with approximate lengths in the text similarity training data, and forming positive sample pairs with different lengths by using a word repetition algorithm for one sentence.
The word repetition algorithm in this embodiment is specifically: first, a sentence is divided into n words { W } by jieba word segmentation0 ,W1 …,Wn-1 Then using the random function to select n/4 numbers { X } from 0 to n-11 ,X2 …,Xn/4 Will { W }X1 ,WX2 …,WXn/4 Copying and filling into the original sentence; examples: abcdefg-)>aabcdeffg, wherein a, c, e are randomly selected segmentations.
S5: randomly selecting N sentences to form a candidate queue, extracting event background and event content from each sentence in the queue by using an entity identification model in S3, randomly selecting one sentence, splicing and combining the event background with the event content of the rest N-1 sentences, and then forming N-1 negative sample pairs with the sentences.
S6: the text similarity pre-training model is done with text similarity training data and the data generated in S4, S5 on a single card 4090 server with main super-parameters { batch size=4, maxlenth=128, epoch=3, learning rate= warmup (constant), optimizer=adamw }.
S7: and accessing contradiction reconciliation event data from a data interface of the contradiction reconciliation platform, coding each piece of contradiction reconciliation event data by using the model trained in the step S6, and storing the generated sentence vectors into a corpus database.
The contradiction adjustment data comprise data of various fields, and particularly comprise citizen hotlines, four basic-level treatment platforms and police service and police conditions; the generated sentence vectors refer to two different sentence vectors generated by respectively encoding an event background and a complete event text; the database fields used include: event number, event complete text, event background text, event complete text encoding, event background encoding.
Stage two: and calculating the text similarity.
As shown in fig. 2, the text similarity calculation is specifically implemented as follows:
step 1: integrating information of citizen hotlines, four platforms for basic management and police information, and collecting the information to a contradiction adjustment platform; meanwhile, the data butt joint of the contradiction reconciliation platform and each channel is realized, the real-time updating and sharing of information are ensured, an interface of the contradiction reconciliation platform is called, and the data is accessed into a local database.
The accessed data includes: the method comprises the steps of generating a street, reporting time, type, text, party information and the like of the contradictory dispute, wherein the text contains event background (event occurrence place, event occurrence time and the like) and event content.
Step 2: extracting event backgrounds in the contradictory dispute event texts by using the entity recognition model trained in the stage one, respectively inputting the event backgrounds and the complete texts into the text similarity model trained in the stage one for coding to obtain codes 1 and 2 respectively; and calculating the similarity between the event background sentence vector of each piece of data in the corpus and code 1.
The similarity is cosine similarity, and can be adjusted to other similarity calculation methods (such as Euclidean distance, manhattan distance and geodesic distance) according to different requirements, if the similarity of event backgrounds reaches a set threshold value, the similarity of the event complete text sentence vector and code2 of the piece of data is calculated, and if the similarity reaches the set threshold value, the two events are regarded as similar events. The threshold value is any value between 0 and 1 which is set in a self-defined mode and can be adjusted according to requirements.
Step 3: synchronizing the result of the algorithm model to an event monitoring database, and if the input event is similar to the existing event, increasing the repetition number of the corresponding event once, wherein the event monitoring database comprises the following fields: event number, event content, reporting time, reporting source, repetition number, and early warning level. The purpose of this step is to count the same or similar events reported from different sources for subsequent pre-alarm analysis.
Step 4: inputting data in an event monitoring database into an important event early warning algorithm, and carrying out early warning on repeated reporting events according to set rules, wherein the rules are as follows: screening out events with repeated reporting times higher than 5 times, and setting the events as red early warning events; and (3) screening and repeating the events with the reporting times higher than 3 times in the rest events, setting the events as red early warning events if the initial reporting time is approximately 3 months, and setting the events as yellow early warning events if the reporting times are higher than 3 times but the initial reporting time exceeds the range of 3 months. The purpose of this step is to identify those events that may cause significant impact or crisis, and to pre-warn at different levels according to their urgency and severity.
The above example is a single data processing flow, and multiple pieces of data may be parallel.
The invention is compared with the Sentence-Bert by training in 10000 training data amount, and the evaluation index adopts the cosine pearson coefficient and the difficult sample accuracy. As shown in Table 1, the invention has raised cosine pearson coefficient by 10 percent compared with the basic method Sentence-Bert, and the performance on difficult samples is raised by 20 percent, so that the accuracy of the invention is improved greatly, especially the accuracy of the difficult samples is improved obviously.
TABLE 1
| Method | Training data | Cosine pearson coefficient | Difficult sample accuracy |
| Sentence-Bert | 10000 | 73.58 | 62.84 |
| The invention is that | 10000 | 83.79 | 80.31 |
The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.