Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Exemplary method
Fig. 1 is a flowchart illustrating an entity identification method based on deep learning according to an exemplary embodiment of the present application. As shown in fig. 1, the entity recognition method based on deep learning includes:
step 110: splitting an input natural sentence into a plurality of word vectors; wherein a plurality of word vectors form a natural sentence.
Because some words which have little meaning on expression semantics may exist in the natural sentence, the natural sentence is split to obtain a plurality of word vectors, so that the natural sentence is converted into a plurality of word vectors which form the natural sentence, and each word vector is analyzed and identified.
Step 120: respectively extracting the features of the plurality of word vectors to obtain the characteristic information of each word vector; wherein the feature information comprises class information of the word vector.
After obtaining a plurality of word vectors, features are extracted for each word vector, and feature information, such as category information and semantic information of the word vector, is obtained for each word vector. In a transform-based bi-directional Encoder (BERT), the word embedding dimension and the embedding dimension of the Encoder output are 768. The word level embedding in the present application is a representation without context dependency, and the output value of the hidden layer includes not only the meaning of the word in life but also some context information, and the representation of the hidden layer should include some more information, so the dimension of the word vector of the present application ALBERT (a Lite BERT, simplified BERT) is smaller than the dimension of the encoder output value. In the recognition task, the dictionary is usually large, the quantity of parameters of the embedded matrix is large, and the updated content is sparse in the process of back propagation. In combination with the two points described above, the present application employs a factorization approach to reduce the number of parameters. Firstly, mapping the word vector to a low-dimensional space, and then mapping to a high-dimensional space, namely, firstly, changing the dimension into the space of a hidden layer through an embedded matrix with a very low dimension and then through a high-dimensional matrix, thereby realizing the dimension reduction of the parameters. The application also provides a parameter sharing method, the parameter sharing in the Transformer has a plurality of schemes, for example, only the full connection layer is shared, or only the attention layer is shared, the full connection layer and the attention layer are shared by the application, that is, all parameters in the encoder are shared, after the Transformer in the same magnitude level adopts the scheme, the effect is actually reduced, but the parameter amount is greatly reduced, and the training speed is also greatly improved. The NSP task of BERT is actually a two-classification, with the positive sample of training data being by sampling two consecutive sentences in the same document, and the negative sample being by taking sentences of two different documents. In the ALBERT, in order to keep only the consistent task to remove the influence of the subject recognition, a new task presence-order prediction (SOP), nsp (next sequence prediction), is proposed: in the next sentence prediction, positive samples are 2 sentences adjacent to each other up and down, and negative samples are 2 random sentences. Sop (sequence): sentence order prediction, positive samples are 2 adjacent sentences in normal order, and negative samples are 2 adjacent sentences in transposed order. The task is inferred for NLI natural language. Studies have found that the NSP task is not effective, primarily because it is too simple to perform. NSP actually contains two subtasks, topic prediction and relationship consistency prediction, but topic prediction is much simpler than relationship consistency prediction because the model can learn more information as long as the model finds that the topics of two sentences are different. SOP, because it is selected in the same document, only concerns the order of sentences and has no influence on the subject.
Step 130: carrying out bidirectional coding on a plurality of word vectors respectively to obtain bidirectional coding information of each word vector; the bidirectional coding information comprises corresponding relation information between the current word vector and a word vector before and after the current word vector.
After a plurality of word vectors are obtained, bidirectional coding is respectively carried out on each word vector to obtain bidirectional coding information of each word vector, and relation information between each word vector and a previous word vector and between each word vector and a next word vector is obtained, wherein the relation information comprises the probability of combination of each word vector and other word vectors, expressed semantics and the like. Specifically, Word Lattice Long Short-Term Memory (WC-LSTM) is adopted, and four different strategies are utilized to code Word information into vectors with fixed sizes, so that the Word information can be trained in batches and adapted to various application scenarios. The WC-LSTM converts the chain structure into a graph structure by utilizing dictionary information on the basis of the LSTM, the extra nodes are dictionary information, and the weights are updated in the training process. WC-LSTM is consistent with the concept of lattice LSTM, but some modifications are made based on the shortcomings of lattice LSTM. WC-LSTM utilizes four different strategies to encode word information into fixed-size vectors that can be batch trained and adapted to various application scenarios. The reason why lattice LSTM is not able to batch training is that the number of nodes added per word is not consistent, and may be 0 or more. WC-LSTM directly and rigidly specifies that there is only one node between each word representing word information, and if there is no word information between words represented by null, such modification makes the structure uniform, and thus enables the use of batch training. And finally, outputting a final vector by the word vector and the word vector concat. The strategy of using word encoding is: short Word First, Longest Word First, Average: the mean of the first two, Self-Attention.
Step 140: and obtaining an identification result according to the characteristic information and the bidirectional coding information of the plurality of word vectors.
After the feature information and the bidirectional coding information of each word vector are obtained, the information such as the semantics of each word vector and the information such as the probability of context combination can be integrated, and partial word vectors in the natural sentence are combined according to a certain sequence, so that the recognition result is obtained. Specifically, the method adopts a Conditional Random Field (CRF for short), an emission probability matrix and a transition probability matrix are used in calculation, a viterbi algorithm is used for prediction, an optimal path is solved, and the optimal path is a final recognition result of an output sequence.
According to the entity identification method based on deep learning, an input natural sentence is divided into a plurality of word vectors, then the characteristic information of each word vector is obtained by performing feature extraction on the plurality of word vectors, bidirectional coding is performed on the plurality of word vectors to obtain bidirectional coding information of each word vector, and finally an identification result is obtained by synthesis according to the characteristic information and the bidirectional coding information of the plurality of word vectors; the method comprises the steps of extracting characteristics of each word in a natural sentence and performing bidirectional coding on each word to obtain semantic features and context features of each word, so that named entities can be accurately identified.
Fig. 2 is a flowchart illustrating an entity recognition method based on deep learning according to another exemplary embodiment of the present application. As shown in fig. 2, afterstep 120, the entity identification method may further include:
step 150: and performing dimension reduction processing on the feature information to obtain the feature information after dimension reduction.
Correspondingly,step 140 is adjusted to: and obtaining an identification result according to the feature information and the bidirectional coding information after dimension reduction. The feature information is subjected to dimension reduction processing, so that the complexity of the feature information of each word vector is simplified, the operation speed is increased, and the recognition efficiency is increased.
In an embodiment, the specific implementation manner ofstep 150 may be: global parameter information and attention parameter information for a plurality of word vectors are shared.
Fig. 3 is a flowchart illustrating a bidirectional encoding method according to an exemplary embodiment of the present application. As shown in fig. 3, thestep 130 may include:
step 131: converting the chain structure of the plurality of word vectors into a graph structure.
Step 132: weights are set for the encoded information between every two word vectors in the graph structure.
The chain structure of a plurality of word vectors is converted into a graph structure, and the weight is set for the coding information between every two word vectors in the graph structure to obtain the probability of combination between every two word vectors, so that reference is provided for obtaining the identification result by subsequent combination.
In an embodiment, the specific implementation manner of thestep 131 may be: setting an information node between every two word vectors; the information node comprises bidirectional coded information, and the byte length of the information node is a preset fixed value. By encoding the information of the word vector into a vector with a fixed size, the method can be trained in batches and adapted to various application scenes. Specifically, when there is no bidirectional encoding information between two word vectors, an information node between the two word vectors is set as a null vector of a preset byte length. If no word information exists between the word vectors and the word vectors, the word vectors are represented by null, and the modification enables the structure to be uniform, so that batch training can be used, and the training efficiency is improved.
Fig. 4 is a flowchart illustrating a method for obtaining an identification result according to an exemplary embodiment of the present application. As shown in fig. 4, thestep 140 may include:
step 141: obtaining a plurality of prediction paths according to the characteristic information and the bidirectional coding information of the plurality of word vectors; wherein the predicted path characterizes an order of arrangement of the plurality of word vectors.
And obtaining a plurality of prediction paths according to a certain combination sequence by the plurality of word vectors according to the characteristic information and the bidirectional coding information of the plurality of word vectors, namely obtaining a plurality of optional recognition results.
Step 142: and evaluating the plurality of predicted paths to obtain a plurality of evaluation results.
And evaluating the plurality of predicted paths, for example, scoring each predicted path according to a preset rule to obtain a plurality of evaluation results.
Step 143: and selecting a prediction path corresponding to the optimal result in the multiple evaluation results as an identification result.
And selecting a prediction path corresponding to the optimal result (namely the evaluation result with the highest score) in the plurality of evaluation results as an identification result so as to ensure the identification precision.
Taking the example of the chief complaint of "abdominal pain 3 days", labeled "abdominal pain" (clinical manifestation) + "3 days" (duration), inputting the model, learning the semantic structure and meaning of the model, and completing the training, the specific training process is shown in fig. 5. When entity identification is performed, the new input "headache 1 hour" is identified by the model as "headache" (clinical presentation) + "1 hour (duration)", the specific identification process is shown in fig. 6.
The WC-LSTM + CRF portion of ALBERT + WC-LSTM + CRF in this application may also be replaced with LM-LSTM + CRF. In the traditional BilSTM + CRF process, the syntax/semantic feature output of word/token-level is extracted through the BilSTM, and then a CRF layer is connected to ensure the legality and global optimization of sequence labeling transfer. The LM-LSTM technical scheme is mainly characterized in that a language model of character level (char-level) is further introduced to carry out combined training on the basis of BilSTM + CRF, and the overall structure of the model comprises the following steps:
(1) the language model of char-level builds LM model with BilSTM at char-level, so that potential characteristics can be extracted from self-supervision task linguistic data. Unlike the simple approach of letting each char predict the next char (which may result in the model machine remembering the spelling order of the words), this document inserts a space ID after the last char of each word, where the next word is predicted.
(2) According to the word-level sequence standard model, embedding of each word is spliced by a pre-trained word vector and the output information of the space mark, and then BiLSTM + CRF is accessed to predict a sequence.
The LM-BilSTM + CRF model is skillfully designed and trained, and the main core points comprise:
(1) the use of different granularity information. The text sequence labeling task is originally a word-level task, but structural and semantic features between char and word can be learned from the task text in a self-supervision manner by means of a language model of char-level (or other joint training tasks).
(2) High-way skills. Considering the weak correlation between the language model and the labeling task, when the model is used for output prediction of LM and splicing of word-embedding, high-way is introduced to char to map the output of char-level to different semantic spaces, so that the underlying bilstm is more focused on extraction of general features among chars (certainly, transmission of information in the gradient BP process can also be ensured), and parameter learning of high-way is more focused on the labeling task.
(3) Fusion of word vectors and fine-tune. Considering that the corpus is not large in scale and time-consuming in computation, the word vector of the paper directly selects a word vector finetune which is glove based on massive corpus training, rather than performing pre-train directly on the corpus or performing random initialization and joint training directly in LM-BilTM + CRF. Meanwhile, in addition to glove, word-level word vectors fully fuse information obtained from char-level, so that the top-level BilSTM + CRF model can obtain the information.
(4) Alignment of token. Generally, when BilSTM is used, we use the output on token directly (the result of bidirectional concat). But the word vector fusion is carried out in the word-level layer, the alignment of tokens is particularly noticed, and for the forward LSTM, hidden layer information behind each token is used; for the backward LSTM, the information before each token (in terms of absolute position) is used.
(5) Differences in training and inference phases. In the training stage, the cross entropy loss of char-level LM and the Viterbi loss of word-level sequence marking task need to be considered simultaneously by the model. In the prediction phase, only the output of the sequence label needs to be known. Therefore, the vocabulary in the char-level LM and the Embedding vocabulary in the word level can be different, that is, the vocabulary in the char-level LM only needs to cover the word in the training sample; in order for the model to be suitable for inference of more corpora, Embedding should select a larger vocabulary to overcome the OOV problem on the training text. However, the char-level task is not suitable for the Chinese task, so that the model is difficult to be directly suitable for Chinese linguistic data, and word-level (Chinese character level) is required to be retrained again.
Exemplary devices
Fig. 7 is a schematic structural diagram of an entity recognition apparatus based on deep learning according to an exemplary embodiment of the present application. As shown in fig. 7, the entity identifying apparatus 50 includes: a splitting module 51, configured to split an input natural sentence into a plurality of word vectors; wherein the plurality of word vectors form a natural sentence; the extracting module 52 is configured to perform feature extraction on the plurality of word vectors respectively to obtain characteristic information of each word vector; wherein the feature information comprises class information of the word vector; the encoding module 53 is configured to perform bidirectional encoding on the plurality of word vectors respectively to obtain bidirectional encoding information of each word vector; the bidirectional coding information comprises corresponding relation information between a current word vector and a word vector before and after the current word vector; and an identification module 54, configured to obtain an identification result according to the feature information and the bidirectional coding information of the plurality of word vectors.
According to the entity recognition device based on deep learning, an input natural sentence is split into a plurality of word vectors through a splitting module 51, then an extraction module 52 carries out feature extraction on the plurality of word vectors to obtain characteristic information of each word vector, an encoding module 53 carries out bidirectional encoding on the plurality of word vectors to obtain bidirectional encoding information of each word vector, and finally a recognition module 54 synthesizes characteristic information and bidirectional encoding information of the plurality of word vectors to obtain a recognition result; the method comprises the steps of extracting characteristics of each word in a natural sentence and performing bidirectional coding on each word to obtain semantic features and context features of each word, so that named entities can be accurately identified.
Fig. 8 is a schematic structural diagram of an entity recognition apparatus based on deep learning according to another exemplary embodiment of the present application. As shown in fig. 8, the entity identifying apparatus 50 may further include: and the dimension reduction module 55 is configured to perform dimension reduction processing on the feature information to obtain the feature information after dimension reduction.
In an embodiment, as shown in fig. 8, the encoding module 53 may include: a conversion unit 531 for converting the chain structure of the plurality of word vectors into a graph structure; a weight setting unit 532, configured to set a weight for the encoded information between every two word vectors in the graph structure.
In one embodiment, as shown in fig. 8, the identification module 54 may include: a predicted path obtaining unit 541 configured to obtain a plurality of predicted paths from feature information and bidirectional encoding information of the plurality of word vectors; wherein the prediction path characterizes an arrangement order of the plurality of word vectors; the evaluation unit 542 is configured to evaluate the plurality of predicted paths to obtain a plurality of evaluation results; the result determining unit 543 is configured to select a predicted path corresponding to an optimal result of the multiple evaluation results as an identification result.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 9. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.
FIG. 9 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 9, the electronic device 10 includes one or more processors 11 andmemory 12.
The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the deep learning based entity identification methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.
The input device 13 may also include, for example, a keyboard, a mouse, and the like.
The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 9, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the deep learning based entity recognition method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the deep learning based entity recognition method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.