Sequence labeling method, system and computer equipmentTechnical Field
The embodiments of the present invention relate to the field of sequence annotation, and in particular, to a sequence annotation method, system, computer device, and computer-readable storage medium.
Background
Named entity recognition is the most basic and most widely used one of all natural language processing applications. It is an entity with special meaning in the identification text, mainly including name of person, place name, organization name, proper noun, etc. Named entity recognition application is an important basic tool for subsequent application fields of other applications, such as information extraction, question and answer systems, syntactic analysis, machine translation, semantic web-oriented metadata annotation and the like. Through the application of this tool to named entity recognition, a natural language model can be constructed that can understand, analyze, and answer the results of natural language like a human. But the existing model often can not consider long-term context information, so that the recognition accuracy is limited.
Therefore, how to solve the problem that the existing model cannot consider the long-term context information relationship, so as to further improve the identification accuracy of sequence labeling, becomes one of the technical problems to be solved at present.
Disclosure of Invention
In view of the above, it is necessary to provide a sequence labeling method, a sequence labeling system, a computer device, and a computer readable storage medium, so as to solve the technical problems that the existing model cannot consider the long-term context information relationship, so that the recognition accuracy of sequence labeling is limited, and the like.
In order to achieve the above object, an embodiment of the present invention provides a sequence tagging method, where the method includes:
receiving a target text sequence, and converting the target text sequence into corresponding sentence vectors, word vectors of all words and position vectors of all words;
inputting sentence vectors of the target text sequence, word vectors of all words and position vectors of all words into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model, wherein the first labeling sequence comprises a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vectors represent first probabilities of corresponding words belonging to each first label in n first labels;
inputting the first labeling sequence into a full-connection layer, and outputting a second labeling sequence through the full-connection layer, wherein the second labeling sequence comprises a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vectors represent second probabilities of the corresponding words belonging to each of n second tags;
using the second labeling sequence as an input sequence of a conditional random field CRF model to output a tag sequence Y (Y) through the CRF model1,y2,...,ym) (ii) a And
and generating a named entity sequence according to the label sequence, and outputting the named entity sequence.
Illustratively, the step of converting the target text sequence into corresponding sentence vectors, word vectors for respective words, and position vectors for respective words includes:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes;
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence; and
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Illustratively, the step of outputting, by the BERT model, a first annotation sequence corresponding to the target text sequence includes:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
Illustratively, the step of inputting the first annotation sequence into a fully-connected layer and outputting a second annotation sequence through the fully-connected layer comprises:
inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each second label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters;
and generating a second labeling sequence according to the second probability of each label of each word in the target text sequence.
Illustratively, the second tagging sequence is used as an input sequence of a conditional random field CRF model to output a tag sequence Y ═ Y (Y) by the CRF model1,y2,...,ym) The method comprises the following steps:
inputting the second annotation sequence into a CRF model;
performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence;
and generating a label sequence according to the optimal solving path.
In order to achieve the above object, an embodiment of the present invention further provides a sequence annotation system, including:
the text receiving module is used for receiving a target text sequence and converting the target text sequence into corresponding sentence vectors, word vectors of all the words and position vectors of all the words;
a first labeling module, configured to input a sentence vector of the target text sequence, a word vector of each word, and a position vector of each word into a trained BERT model, and output a first labeling sequence corresponding to the target text sequence through the BERT model, where the first labeling sequence includes a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vector represents a first probability that a corresponding word belongs to each first tag of the n first tags;
a second labeling module, configured to input the first labeling sequence into a full-connected layer, and output a second labeling sequence through the full-connected layer, where the second labeling sequence includes a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vector represents a second probability that a corresponding word belongs to each of n second tags;
an output tag module, configured to use the second tagging sequence as an input sequence of a conditional random field CRF model, so as to output a tag sequence Y ═ (Y) through the CRF model1,y2,...,ym) (ii) a And
and the output entity module is used for generating a named entity sequence according to the label sequence and outputting the named entity sequence.
Illustratively, the receive text module is further configured to:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes; and
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence;
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Illustratively, the first annotation module is further configured to:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
To achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when executed by the processor, the computer program implements the steps of the sequence tagging method as described above.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the sequence tagging method as described above.
The sequence labeling method, the system, the computer equipment and the computer readable storage medium provided by the embodiment of the invention provide an effective sequence labeling method for a text sequence; the embodiment of the invention solves the technical problem that the model in the prior art can not consider long-term context information, so that the recognition accuracy is limited, achieves the technical effects that named entities in sentences can be extracted by directly inputting the original sentences into the model, has strong adaptability and wide application range, and improves the accuracy of sequence labeling on entity recognition.
Drawings
Fig. 1 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of program modules of a second embodiment of the sequence annotation system.
Fig. 3 is a schematic diagram of a hardware structure of a third embodiment of the computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
In the following embodiments, thecomputer device 2 will be exemplarily described as an execution subject.
Example one
Referring to fig. 1, a flowchart of steps of a sequence labeling method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with thecomputer device 2 as the execution subject. The details are as follows.
Step S100, receiving a target text sequence, and converting the target text sequence into a corresponding sentence vector, a word vector of each word, and a position vector of each word.
Specifically, the step S100 may further include:
step S100a, inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, where the plurality of word vectors includes at least one punctuation vector.
Illustratively, when the target text sequence received is [ Curie in Poland, resident in the United states ], then each word and special symbol needs to be converted into a word-embedded vector because the neural network can only perform numerical calculations.
Step S100b, inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, where the n word vector sets correspond to n segmentation codes.
Illustratively, the target text sequence is divided into a sentence a [ curie is living in poland and resides in the united states ] and a sentence B [ living in the united states ], and the first half sentence is added with a division code a and the second half sentence is added with a division code B.
Step S100c, performing encoding operation on each division code through position encoding, and determining position information encoding of each division code to obtain a position vector of each word in the target text sequence.
Illustratively, the position information encoding may be used to determine the position of each word in the target text sequence.
Step S100d, generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Step S102, inputting the sentence vector of the target text sequence, the word vector of each word and the position vector of each word into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model, wherein the first labeling sequence comprises a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vectors represent first probabilities of the corresponding words belonging to each first label in n first labels.
For example, the n first tags may be a plurality of location tags and a plurality of semantic tags, and the n first tags may also be a plurality of location tags and a plurality of part-of-speech tags.
Illustratively, the BERT is an existing pre-training model, which is collectively referred to as bidirectional Encoder recurrences from Transformers, i.e., the Encoder of a bidirectional Transformer (Encoder); wherein the Transformer is a method that relies entirely on self-attention to compute input and output characterizations; BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification on specific tasks.
Illustratively, the BERT model may be obtained by capturing words by a Masked Language Model (MLM) method and expressing Sentence levels by a "Next Sentence Prediction" method; wherein, masking some words (tokens) in the language model input randomly, aiming at predicting the original vocabulary id based on the context of the masked words only, unlike the pre-training of the language model from left to right, the training target of the masked language model allows representing the contexts fusing the left and right sides, thereby pre-training a deep bidirectional Transformer; "Next Sentence Prediction", that is, Next sequence Prediction means that when a language model is pre-trained, two sentences are selected in two cases, one is two sentences which are really connected in sequence in a selected corpus; the other is that the second sentence throws the color son from the corpus and randomly selects one to be pieced after the first sentence. Besides the Masked language model task, the model is required to additionally perform sentence relation prediction to judge whether the second sentence is a subsequent sentence of the first sentence or not.
Illustratively, the training of the pre-trained BERT model may include: the method comprises the steps of obtaining a plurality of training text sequences, using the training text sequences as a training set of a BERT model, inputting the training set into a pre-trained BERT model, and training the pre-trained BERT model through the training set to obtain the trained BERT model.
Specifically, the step S102 may further include:
step S102a, performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence.
Step S102b, generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
Step S104, inputting the first labeling sequence into a full-connection layer, and outputting a second labeling sequence through the full-connection layer, wherein the second labeling sequence comprises a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vectors represent second probabilities of the corresponding words belonging to each second label of the n second labels.
For example, the n second tags may be a plurality of location tags and a plurality of semantic tags, and the n second tags may also be a plurality of location tags and a plurality of part-of-speech tags.
Specifically, the step S104 may further include:
step S104a, inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters;
for example, the fully-connected layer neural network structure of this embodiment may be a multi-layer transform structure, where the multi-layer transform structure further includes an attention mechanism, and the first annotation sequence is processed by the attention mechanism and then input to the feed-forward fully-connected neural network structure for additional feature extraction, so as to obtain a second probability of each second label of each word in the target text sequence; and obtaining a second probability of each second label of each word in the target text sequence through the operation of wx + b, wherein x is the sequence, and w and b are model learning parameters.
Step S104b, generating a second labeling sequence according to the second probability of each second label of each word in the target text sequence.
Step S106, the second labeling sequence is used as an input sequence of a conditional random field CRF model, and a label sequence Y is output through the CRF model (Y ═ Y)1,y2,...,ym)。
Specifically, the step S106 may further include:
step S106a, inputting the second annotation sequence into a CRF model;
step S106b, performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence;
exemplarily, in this step, the output object corresponding to the target text sequence is determined according to the probability value of the second probability of each second tag of each word in the target text sequence; here, the method is implemented by a viterbi algorithm, which does not output the highest label probability among the second probabilities of the respective second labels of each word in the target text sequence, but outputs the highest probability labeled sequence of the entire target text sequence.
Illustratively, the viterbi algorithm may include: when the path with the highest second probability of each second label of each word in the target text sequence passes through a certain point of the fence network, the sub-path from the starting point to the point must also be the path with the highest probability from the beginning to the point; when there are k states at the ith time, there are k shortest paths from the beginning to the k states at the ith time, and the final shortest path must pass through one of them.
And step S106, 106c, generating a label sequence according to the optimal solution path.
For example, the viterbi algorithm is used to calculate the highest probability labeling sequence of the whole target text sequence, and when the shortest path of the (i + 1) th state is calculated, only the shortest path from the beginning to the current k state values and the shortest path from the current state value to the (i + 1) th state value need to be considered.
And S108, generating a named entity sequence according to the label sequence, and outputting the named entity sequence.
Illustratively, a named entity sequence can be generated according to the tag sequence, and the named entity sequence is a target text sequence predicted by an annotation system. Wherein the named entities comprise place names, person names and the like; the sequence marking adopts a BIOES form, wherein B is an entity beginning, I is an entity middle, O is a non-entity, E is an entity ending and S is a single character entity; each named entity label corresponds to an entity category, and can be detailed as B-place name: the beginning of the place name entity, and the like. Here, place names and person names are taken as examples; for example, now the sentence "curie is from huasha" is taken as an example, and the sentence will be split into word sequences. Then the curio is labeled as the name of B, the lire is labeled as the name of E, the raw word is labeled O, the wave is labeled as the name of B place, and the blue is labeled as the name of E-place.
Example two
FIG. 2 is a schematic diagram of program modules of a second embodiment of the sequence annotation system. Thesequence annotation system 20 can include or be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement the present invention and implement the above-described sequence annotation methods. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than the program itself for describing the execution process of thesequence labeling system 20 in the storage medium. The following description will specifically describe the functions of the program modules of the present embodiment:
a text receiving module 200, configured to receive a target text sequence and convert the target text sequence into corresponding sentence vectors, word vectors of each word, and position vectors of each word.
Illustratively, the receive text module 200 is further configured to: inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector; inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes; performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence; and generating sentence vectors of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
A first labeling module 202, configured to input a sentence vector of the target text sequence, a word vector of each word, and a position vector of each word into a trained BERT model, and output a first labeling sequence corresponding to the target text sequence through the BERT model, where the first labeling sequence includes a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vector represents a first probability that the corresponding word belongs to each first tag in n first tags.
Illustratively, the first annotation module 202 is further configured to: performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence; and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
A second labeling module 204, configured to input the first labeling sequence into a full-connected layer, and output a second labeling sequence through the full-connected layer, where the second labeling sequence includes a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vector represents a second probability that the corresponding word belongs to each second tag of the n second tags.
Illustratively, the second annotation module 204 is further configured to: inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each second label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters; and generating a second labeling sequence according to the second probability of each second label of each word in the target text sequence.
An output tag module 206, configured to use the second tagging sequence as an input sequence of a conditional random field CRF model, so as to output a tag sequence Y ═ (Y) through the CRF model1,y2,...,ym)。
Illustratively, the output tag module 206 is further configured to: inputting the second annotation sequence into a CRF model; performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence; and generating a label sequence according to the optimal solving path.
And the output entity module 208 is configured to generate a named entity sequence according to the tag sequence, and output the named entity sequence.
EXAMPLE III
Fig. 3 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, thecomputer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. Thecomputer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown, thecomputer device 2 includes, but is not limited to, at least amemory 21, aprocessor 22, anetwork interface 23, and asequence annotation system 20 communicatively coupled to each other via a system bus.
In this embodiment, thememory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, thestorage 21 may be an internal storage unit of thecomputer device 2, such as a hard disk or a memory of thecomputer device 2. In other embodiments, thememory 21 may also be an external storage device of thecomputer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on thecomputer device 2. Of course, thememory 21 may also comprise both internal and external memory units of thecomputer device 2. In this embodiment, thememory 21 is generally used for storing an operating system installed on thecomputer device 2 and various application software, such as the program code of thesequence annotation system 20 of the second embodiment. Further, thememory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. Theprocessor 22 is typically used to control the overall operation of thecomputer device 2. In this embodiment, theprocessor 22 is configured to execute the program code stored in thememory 21 or process data, for example, execute thesequence annotation system 20, so as to implement the sequence annotation method of the first embodiment.
Thenetwork interface 23 may comprise a wireless network interface or a wired network interface, and thenetwork interface 23 is generally used for establishing communication connection between thecomputer device 2 and other electronic apparatuses. For example, thenetwork interface 23 is used to connect thecomputer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between thecomputer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 3 only shows thecomputer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, thesequence annotation system 20 stored in thememory 21 can be further divided into one or more program modules, and the one or more program modules are stored in thememory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.
For example, fig. 2 is a schematic diagram of program modules for implementing thesequence annotation system 20 according to a second embodiment of the present invention, in which thesequence annotation system 20 can be divided into a text receiving module 200, a first annotation module 202, a second annotation module 204, an output tag module 206, and an output entity module 208. The program modules referred to herein are a series of computer program instruction segments that can perform specific functions, and are more suitable than programs for describing the execution process of thesequence labeling system 20 in thecomputer device 2. The specific functions of the program modules 200-208 have been described in detail in the second embodiment, and are not described herein again.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for thesequence annotation system 20, and when executed by a processor, the method for sequence annotation of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.