Language model pre-training method based on co-fingering eliminationTechnical Field
The invention relates to the technical field of natural language processing, in particular to a language model pre-training method based on co-fingering elimination.
Background
The task of coreference resolution is to categorize expressions (including pronouns, named entities, noun phrases, etc.) in the text that refer to the same entity. At present, an advanced end-to-end neural network coreference resolution model takes word vectors as input, obtains span (span) representation based on an attention module, and then scores span pairs in a coreference way, so that coreference resolution is realized. Implementation of coreference resolution requires reasoning with context information and world knowledge, i.e., advanced language models are required to obtain a more semantically rich word vector representation. Bert (Bidirectional Encoder Representations from Transformer) is a language model that is currently used to predict covered words based on a fransformer algorithm framework by randomly covering words (mainly aiming at English corpus, one word in English of word finger) on massive text corpus, however, the pretraining mode has some disadvantages. For example, in the sentence "Harry Potter is a wonderful work of magic literature", if only "Harry" is covered, it is easy to predict "Potter", so that the word vector of "Harry Potter" learned by the model cannot contain such information as "magic literature", i.e. the context information is not rich enough. Especially in the field of coreference resolution, more semantic information-rich linguistic representations are needed to capture relationships between entities. In addition, in the coreference resolution task, due to weak semantics of the pronouns, the pronouns resolution error rate is high, the coverage probability of the Bert pre-training method on the pronouns is low, and more external knowledge is needed when the pronouns are resolved, so that the model needs to be enhanced for learning the pronouns.
Spanbert provides a pre-training method for randomly covering any continuous span aiming at span-level (span) tasks such as knowledge question and answer, named entity recognition and the like, and span lengths are subject to distribution of L-Geo (0.2).
The ERNIE model of hundred degrees uses a three-stage covering mechanism for Chinese to pretrain, namely basic-level, phrase-level and entity-level, and the single word, phrase and entity granularity level is progressive, so that phrases and entity knowledge are hidden, and the representation capability of the language model is greatly improved. However, this pre-training approach of ERNIE found in actual use results in forgetting basic-level knowledge during the stage of entity-level training, thus degrading the model word representation.
Accordingly, those skilled in the art are working to develop a language model pre-training method based on co-fingering cancellation.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the technical problem of how to provide word vectors with richer semantic information for a language model pre-training method for co-reference cancellation of english, thereby improving prediction accuracy of co-reference cancellation.
The process of learning language by human is generally to learn basic words, then learn phrases, and finally apply to sentence and chapter level tasks. However, since the knowledge of the neural network language model is stored in the form of network weights, if words are trained first and phrases are trained, there may be cases where word granularity information is forgotten. Therefore, the inventor proposes to adaptively train word blocks with different granularity according to the current function loss, and meanwhile, aiming at the problem of low pronoun resolution precision in coreference resolution, the training of a language model on pronouns is increased. In the training stage, word-learning mode is adopted for learning word information first by 20% of steps, words are trained, word-learning mode training words or phrase-learning mode training phrases are adaptively selected according to loss by the last 80% of steps, and different loss functions are adopted for the two modes.
In one embodiment of the invention, a language model pre-training method based on co-fingering cancellation is provided, comprising.
S100, preprocessing data, namely extracting pronouns in the corpus through character string matching, and extracting named entities, noun phrases and the like in the corpus by a processing tool as a covering candidate set in a training data generation stage;
S200, training data are generated, and mask word mode (namely word masking mode) and mask phrase mode (namely phrase masking mode) are used for masking, so that mask word training data and mask phrase training data are generated respectively:
S300, pre-training, namely adaptively switching a word-learning mode or a phrase-learning mode according to a training mode selection factor alphat to train.
Optionally, in the language model pre-training method based on co-finger cancellation in the above embodiment, step S100 includes:
S110, acquiring English Wikipedia data;
S120, extracting pronouns in the corpus, and establishing a pronoun set PronounSet;
S130, extracting all named entities in the corpus, and establishing an entity set ENTITYSET;
And S140, extracting noun phrases in the corpus, and removing phrases coincident with ENTITYSET to obtain a noun phrase set NounPhraseSet.
Further, in the language model pre-training method based on co-fingering cancellation in the above embodiment, the extracting tool in step S100 is an entity identification module in the python natural language processing kit Spacy.
Further, in the language model pre-training method based on co-fingering cancellation in the above embodiment, step S140 uses the noun phrase extraction module in Spacy.
Optionally, in the language model pre-training method based on co-fingering cancellation in any of the foregoing embodiments, step S200 includes:
s210, copying the data into two copies, namely data one and data two;
s220, creating training examples for texts in the data I and the data II according to the generation mode of the training data of the BERT, wherein each example comprises a plurality of sentences;
S230, performing covering processing by using a mask_word mode (namely a word covering mode) and a mask_phrase mode (namely a phrase covering mode) respectively by using an instance created by the first data and the second data;
S240, generating MASK word training data, randomly selecting 15% of words from sentences in an example created by the data, putting the words into CANDIDATESET (covering the candidate word set), replacing each word of CANDIDATESET1 with 'MASK' at 80% probability, replacing 10% probability by other random words, and keeping 10% probability unchanged;
S250, generating mask_phrase training data, randomly selecting named entities and noun phrases in sentences in an instance created by the second data, adding the named entities and noun phrases into CANDIDATESET (covering the second candidate set), replacing each word block in CANDIDATESET with 'MASK' at 80%, wherein 10% of the word blocks are replaced by other random words, 10% of the word blocks are unchanged, and all word replacement behaviors in each word block are consistent, namely the word blocks are replaced simultaneously or all word blocks are unchanged.
Further, in the language model pre-training method based on co-fingering cancellation in the above embodiment, step S220 includes performing the alignment process for limiting the sentence length to 128 english words, less than 128 english words, i.e., performing the truncation process for adding english words to 128, more than 128.
Further, in the language model pre-training method based on co-fingering cancellation in the above embodiment, the pronoun in CANDIDATESET in step S240 is about one third, the rest of the words are two thirds, and when the total number of the modern words is less than one third, the common words are used for substitution.
Further, in the language model pre-training method based on co-fingering cancellation in the above embodiment, the number of named entities and nouns selected in CANDIDATESET in step S250 accounts for 15% of the sentence length, where the named entities and noun phrases each account for 50%.
Optionally, in the language model pre-training method based on co-fingering cancellation in any of the above embodiments, step S300 includes, in a word_learning mode, inputting mask_word training data into the BERT network to predict covered words and calculate corresponding losses, and in a phrase_learning mode, inputting mask_phrase training data into the BERT network to predict covered phrases and calculate corresponding losses:
Further, in the language model pre-training method based on co-finger cancellation in any of the above embodiments, step S300 includes;
s310, preheating training, namely firstly learning basic words, carrying out 20% of training steps before, preheating training by using a word-learning mode, and storing initial word-learning prediction lossAnd initial 62. Learning prediction loss
S302, self-adaptive training, wherein the number of training steps is 80%, word-learning or phrase-learning mode training is adopted in the t+1 step according to a selection factor alphat, and the method is specifically as follows:
and respectively representing the losses of the two modes when the t training step is performed, wherein when alphat is more than 0, the word-learning mode is adopted in the t+1 step, and otherwise, the training is continued by adopting the phrase-learning mode.
The invention increases semantic training on pronouns, phrases and entities, adaptively switches learning modes, enhances the semantic representation capability of the model, and is better suitable for co-fingering elimination tasks.
The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.
Drawings
FIG. 1 is a flow diagram illustrating a co-fingering cancellation-based language model pre-training method in accordance with an exemplary embodiment.
Detailed Description
The following description of the preferred embodiments of the present invention refers to the accompanying drawings, which make the technical contents thereof more clear and easy to understand. The present invention may be embodied in many different forms of embodiments and the scope of the present invention is not limited to only the embodiments described herein.
In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components is schematically and appropriately exaggerated in some places in the drawings for clarity of illustration.
The inventor designs a language model pre-training method based on common-finger elimination, as shown in fig. 1, comprising the following steps:
S100, preprocessing data, extracting pronouns in corpus through character string matching, extracting named entities, noun phrases and the like in corpus by a processing tool as a covering candidate set in a training data generation stage, wherein the extracting tool is an entity identification module in a python natural language processing tool kit Spacy, and specifically comprising:
S110, acquiring English Wikipedia data;
S120, extracting pronouns in the corpus, and establishing a pronoun set PronounSet;
S130, extracting all named entities in the corpus, and establishing an entity set ENTITYSET;
And S140, extracting noun phrases in the corpus by using a noun phrase extraction module Spacy, and removing phrases overlapped with ENTITYSET to obtain a noun phrase set NounPhraseSet.
S200, training data generation, which is to carry out covering processing through a mask word mode (namely a word covering mode) and a mask phrase mode (namely a phrase covering mode) to respectively generate mask word training data and mask phrase training data, wherein the method specifically comprises the following steps:
s210, copying the data into two copies, namely data one and data two;
s220, creating training examples for texts in the data I and the data II according to the BERT training data generation mode, wherein each example comprises a plurality of sentences, the sentence length is limited to 128 English words, the number of the English words is less than 128, namely the English words are supplemented to 128, and the number of the English words is more than 128, and the number of the English words is truncated;
S230, performing covering processing by using a mask_word mode (namely a word covering mode) and a mask_phrase mode (namely a phrase covering mode) respectively by using an instance created by the first data and the second data;
S240, generating MASK word training data, randomly selecting 15% of words from sentences in an example created by the data, putting the words into CANDIDATESET (covering the candidate word set), wherein the pronouns account for about one third, the rest words account for two thirds, and when the total number of the modern words is less than one third, replacing the words with general words, replacing each word of CANDIDATESET1 with a 'MASK' at 80% probability, and replacing 10% of words with other random words at 10% probability, wherein the 10% probability is unchanged;
S250, generating mask_phrase training data, randomly selecting named entities and noun phrases in sentences in an instance created by the second data, adding the named entities and noun phrases to CANDIDATESET (covering the second candidate set), wherein the number of the selected named entities and noun phrases accounts for 15% of the length of the sentences, wherein the number of the named entities and noun phrases accounts for 50% of each of the sentences, replacing each word block in CANDIDATESET by 'MASK' at 80%, replacing the 10% probability by other random words, and keeping the 10% probability unchanged, wherein the replacement behaviors of all words in each word block are consistent, namely the replacement or the replacement are kept unchanged at the same time.
S300, pre-training, namely adaptively switching a word-learning mode or a phrase-learning mode according to a training mode selection factor alphat to train, wherein in the word-learning mode, mask-word training data are input into a BERT network to predict covered words and calculate corresponding losses, in the phrase-learning mode, mask-phrase training data are input into the BERT network to predict covered phrases and calculate corresponding losses, and the method specifically comprises the following steps:
s310, preheating training, namely firstly learning basic words, carrying out 20% of training steps before, preheating training by using a word-learning mode, and storing initial word-learning prediction lossAnd initial 62. Learning prediction loss
S302, self-adaptive training, wherein the number of training steps is 80%, word-learning or phrase-learning mode training is adopted in the t+1 step according to a selection factor alphat, and the method is specifically as follows:
Respectively representing the loss of two modes when the t training step is performed, wherein when alphat is more than 0, the t+1 step adopts a word-learning mode, otherwise, the training is continued by adopting a phrase-learning mode.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.