Small sample-based specific field multi-label text classification methodTechnical Field
The invention relates to a specific field multi-label text classification method based on small samples.
Background
In the initial stage of on-line of a system needing a text classification task, the data accumulation is less, and only a small amount of data is labeled.
At present, a paper proposes a template-based small sample learning method, but the premise of good effect is that a large number of data sets in a specific field exist at the beginning, and only labeled data are few. According to the scheme, on the basis of the method, the problem that the number of labels and the number of texts in the specific field are few is solved.
Disclosure of Invention
In order to achieve the technical purpose, the technical scheme of the invention is that,
a specific field multi-label text classification method based on small samples comprises the following steps:
step one, acquiring an original corpus in a specific field, extracting a small part of the corpus, labeling each sentence in the corpus with a label, recording the total number of the labels by taking the same label as one type;
adding the labeled tags to the front of the sentence, masking the tags, adding fixed words at the head and the tail of the tags to identify the tags and form a new sentence, and adding specific symbols at the head and the tail of the new sentence; then adding an identification tag for identifying whether the current tag is correct or not, copying the sentence, sequentially replacing the content of the original tag with tags which are labeled by other sentences and are different from the original tag, and simultaneously changing the identification tag from correct to wrong, thereby expanding the small part of the corpus extracted in the step one;
inputting the expanded linguistic data into the pre-training language model, and then executing a mask language model task, thereby updating the parameters of the pre-training model;
step four, the updated model is used as a semantic feature extractor, so that all the expanded linguistic data are converted into semantic vectors and serve as a query retrieval library;
step five, extracting partial linguistic data from the original linguistic data, adding masks before and after each sentence in the linguistic data, adding fixed words in the step two before and after the masks, copying according to the label category number recorded in the step one to obtain sentences with the same number, and inputting the sentences into a model to obtain a semantic vector of each sentence;
step six, similarity calculation is carried out on the obtained semantic vector result and the query search library, and the label with the highest occurrence frequency in the first N labels with the highest similarity is taken as the label of the corpus without the original label;
step seven, returning to the step three, taking the corpus of the label obtained in the step six as the input of the model, and continuously updating the parameters of the model until the loss function is converged to finish the model training;
and step eight, labeling the corpora in the same field as the corpora in the step one by adopting the model trained in the step seven, thereby realizing classification.
In the method for classifying the multi-label texts in the specific field based on the small samples, in the first step, the small part of the corpus is less than 200 text sentences.
In the third step, the execution of the mask language model task includes:
inputting each sentence into a pre-training language model to obtain a mapped low-dimensional vector representation, calculating a loss function of the low-dimensional vector and a mask position label mlm _ label for a mask position, calculating a loss function of the low-dimensional vector and an identification label eq _ label for a sentence start position [ cls ], and adding the two loss functions to be used as a loss function of the whole pre-training language model; the corresponding loss function L is formulated as follows:
L=mlm_loss+eq_loss
eq_loss=-[yjlog(pj)+(1-yj)log(1-pj)]
for mlm _ loss, where V is the word of maskNumber, yiOne-hot format, p, representing a tag replaced by a maskiRepresenting the probability of the word predicted by the model; for eq _ loss, yjDenotes the value of eq _ label, pjRepresenting the probability of whether the model prediction is a positive case; wherein mlm _ label is calculated by softmax, eq _ label is calculated by sigmoid;
based on the steps, iteration is repeated until the model loss value continuously decreases until convergence.
In the fourth step, all the expanded linguistic data are converted into semantic vectors, the original sentences without labels and the labels are respectively mapped into low-dimensional vectors through multi-layer transform outputs of the model, and the mean value of all the characters is taken as the semantic vector of the sentence.
In the fifth step, the semantic vector of each sentence comprises a low-dimensional vector mean value and a predicted mask vector mean value, wherein the low-dimensional vector mean value of the sentence is the vector mean value of each word in the sentence, and the mask vector mean value refers to the mean value of the word vector at the position replaced by the mask in the sentence.
In the sixth step, the similarity calculation is realized through cosine similarity, wherein the calculation formula is as follows:
w1+w2=1
wherein w1、w2Weight for two similarities, vm1、vm2Representing the predicted mask vector and the actual tag vector, v, of the model, respectivelys1、vs2It is represented as a sentence vector to be predicted and a sentence vector in the search base.
In the sixth step, the top N labels with the highest similarity are taken as the labels of the corpus without the original labels, the results of all similarity calculation are sorted from large to small, the top N results are taken, a voting method in knn is used, and the label with the highest occurrence frequency in the top N results is taken as the closest sentence label.
The method has the technical effects that based on a pre-training language model, under the condition that the number of artificial labels is small (only 200 labels) and the data in the field is small, a trained data set is expanded by using few label data through data pre-processing, and the data set is subjected to multi-task training through a mask language model, so that the model can fully learn the semantic knowledge of the field, and in the prediction stage, a knowledge base retrieval mode is used, knn is used for reducing the randomness, and the accuracy of the classification result is improved. After the prediction result is obtained, the prediction result is continuously taken as an artificial label to repeat the steps, so that the model can continuously learn knowledge in the field, the search knowledge base is larger and larger, and the classification result is correspondingly improved.
The invention will be further explained with reference to the drawings.
Drawings
Fig. 1 is a schematic flow chart of the present embodiment.
Detailed Description
The embodiment is realized by the following steps:
step one, acquiring an original corpus in a specific field, extracting a small part of the corpus, labeling each sentence in the corpus with a label, and recording the total number of the labels by taking the same label as one type. In this embodiment, the small part of the corpus is less than 200 text sentences. The number can be adjusted accordingly depending on the situation.
And step two, adding the labeled tags into the front of the sentence, masking the tags, simultaneously adding fixed words at the head and the tail of the tags to identify the tags and form a new sentence, and adding specific symbols at the head and the tail of the new sentence. And then adding an identification tag for identifying whether the current tag is correct or not, copying the sentence, sequentially replacing the content of the original tag with tags which are labeled by other sentences and are different from the original tag, and simultaneously changing the identification tag from correct to wrong, thereby expanding the small part of the corpus extracted in the step one. For example, assuming that the specific field is the financial industry, one sentence in the corpus is "the elderly often transact regular renewal. ", then the label is" personal deposit ". The processing in step two is to replace the words of the tag with [ MASK ], add specific symbols [ CLS ] and [ SEP ] to the beginning and end of the sentence, and finally obtain the following input sentence: "[ CLS ] is [ MASK ] [ MASK ] [ MASK ] [ MASK ] [ MASK ] [ MASK ] business, and the old people usually transact regular renewal business. [ SEP ] ". There are two tags that are set to be input at the same time, one is a tag mlm _ label of [ MASK ] 'personal fixed deposit', and the other determines whether it is a positive example, and eq _ label is 1. The expansion process is then: the tag of the piece of data is set to other categories, for example, another category is "teller management", the input sentence is changed to "CLS", which is [ MASK ] [ MASK ] [ MASK ] [ MASK ] business, and the old people often transact regular renewal business. [ SEP ] ", two set tags mlm _ label ═ teller management', which is not the correct category, so eq _ label ═ 0. Then, assuming that the total number of classes of tags is 11 classes, the sentence is copied 10 times, each time the sentence is replaced by another tag, and the total amount of data is expanded by 11 times.
And step three, inputting the expanded linguistic data into the pre-training language model, and then executing a mask language model task, thereby updating the parameters of the pre-training model.
And step four, taking the updated model as a semantic feature extractor, thereby converting all the expanded linguistic data into semantic vectors and taking the semantic vectors as a query retrieval library.
And step five, extracting partial linguistic data from the original linguistic data, adding masks before and after each sentence in the linguistic data, adding fixed words in the step two, copying according to the tag type number recorded in the step one to obtain sentences with the same number, and inputting the sentences into a model to obtain the semantic vector of each sentence.
And step six, carrying out similarity calculation on the obtained semantic vector result and the query search library, and taking the label with the highest occurrence frequency in the first N labels with the highest similarity as the label of the corpus without the original label.
And step seven, returning to the step three, taking the corpus of the label obtained in the step six as the input of the model, and continuously updating the parameters of the model until the loss function is converged to finish the model training.
And step eight, labeling the corpora in the same field as the corpora in the step one by adopting the model trained in the step seven, thereby realizing classification.
Specifically, in step three, the task of executing the mask language model includes:
after each sentence is input into the pre-training language model, a mapped low-dimensional vector representation is obtained, a loss function of the low-dimensional vector and a mask position label mlm _ label is calculated for a mask position, a loss function of the low-dimensional vector and a label eq _ label is calculated for a sentence start position [ cls ], and the two loss functions are added to be used as a loss function of the whole pre-training language model. The corresponding loss function L is formulated as follows:
L=mlm_loss+eq_loss
eq_loss=-[yjlog(pj)+(1-yj)log(1-pj)]
for mlm _ loss, where V is the number of words of the mask, yiOne-hot format, p, representing a tag replaced by a maskiRepresenting the probability of the word predicted by the model. For eq _ loss, yjDenotes the value of eq _ label, pjRepresenting the probability of whether the model prediction is a positive example. Where mlm _ label is calculated by softmax and eq _ label is calculated by sigmoid.
Based on the steps, iteration is repeated until the model loss value continuously decreases until convergence.
Further, in the fourth step, all the expanded corpora are converted into semantic vectors, that is, the original sentences without tags and the tags are mapped into low-dimensional vectors through the multi-layer Transformer output of the model respectively, and the mean value of all the words is taken as the semantic vector of the sentence.
In the fifth step, the semantic vector of each sentence comprises a low-dimensional vector mean value and a predicted mask vector mean value, wherein the low-dimensional vector mean value of the sentence is the mean value of the vector of each word in the sentence, and the mean value of the mask vector refers to the mean value of the word vector of the position replaced by the mask in the sentence.
In the sixth step, the similarity calculation is realized through cosine similarity, wherein the calculation formula is as follows:
w1+w2=1
wherein w1、w2Weight for two similarities, vm1、vm2Representing the predicted mask vector and the actual tag vector, v, of the model, respectivelys1、vs2It is represented as a sentence vector to be predicted and a sentence vector in the search base.
And step six, taking the top N labels with the highest similarity as the labels of the corpus without the original labels, sequencing the results of all similarity calculation from large to small, taking the top N results, and using a voting method in knn, wherein the label with the highest occurrence frequency in the top N results is taken as the closest sentence label.