CN111651992B

Movatterモバイル変換

Info

Publication number: CN111651992B
Application number: CN202010333674.9A
Authority: CN
Inventors: 陈桢博; 金戈; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2024-11-29
Anticipated expiration: 2040-04-24
Also published as: CN111651992A; WO2021212749A1

Abstract

The application relates to the field of artificial intelligence, and provides a named entity labeling method and related equipment, wherein the named entity labeling method comprises the steps of constructing word vectors of sentences in resume texts; the method comprises the steps of carrying out multi-layer convolution operation on a word vector through multi-layer convolution layers of a TextCNN model to obtain a word vector matrix, calculating the word vector matrix to obtain a query vector, a key vector and a value vector, calculating to obtain an attention weight matrix between every two characters in a sentence, adjusting based on the value vector pair, and carrying out classification on the word vector matrix and the adjusted attention weight matrix through a full-connection layer processing and then inputting the obtained attention weight matrix into a softmax classification layer fused with Gaussian errors to obtain a first named entity label of each character in the sentence. The present application enhances the ability to capture local contexts. In addition, the application also relates to the field of blockchains, and resume text can be stored in the blockchain.

Description

Named entity labeling method, named entity labeling device, named entity labeling computer equipment and named entity labeling storage medium

Technical Field

The present application relates to the field of classification models, and in particular, to a named entity labeling method, a named entity labeling device, a computer device, and a storage medium.

Background

The Named Entity Recognition (NER) task mainly recognizes and classifies proper names such as person names, place names, organization names and the like appearing in corresponding texts, and is the basis of various natural language processing tasks such as information extraction, information retrieval, question-answering systems and the like. For example, in a resume identification scenario, it is often necessary to identify named entities such as school names, place names, and the like in the resume text.

The named entity labeling task is a necessary process in named entity recognition, and refers to a process of classifying and labeling each character in a text. Although the traditional deep learning method has good effect, the recognition accuracy of the short-distance key features cannot achieve ideal effect because the long-distance features in all sentences are given the same feature weight to perform model operation.

Disclosure of Invention

The application mainly aims to provide a named entity labeling method, a named entity labeling device, computer equipment and a storage medium, and overcomes the defect that the recognition accuracy of short-distance key features is low when named entities are labeled.

In order to achieve the above purpose, the application provides a named entity labeling method, which comprises the following steps:

acquiring sentences in a resume text, and constructing word vectors of the sentences;

performing multi-layer convolution operation on the word vector through a multi-layer convolution layer of the TextCNN model obtained through pre-training to obtain a word vector matrix;

calculating the word vector matrix based on the full connection layer of the TextCNN model to obtain a query vector, a key vector and a value vector;

According to the query vector and the key vector, and by combining with a Gaussian deviation matrix, calculating an attention weight matrix between every two characters in the sentence, and adjusting the attention weight matrix based on the value vector;

Based on the word vector matrix and the adjusted attention weight matrix, performing full-connection layer processing through the TextCNN model, and then inputting the full-connection layer processing into a softmax classification layer fused with Gaussian errors for classification to obtain a first named entity label of each character in the sentence.

Further, the step of calculating the word vector matrix based on the full connection layer of TextCNN models to obtain a query vector, a key vector and a value vector includes:

Calculating the word vector matrix based on query vector calculation parameters which are obtained by training in the full-connection layer of the TextCNN model in advance to obtain the query vector;

Calculating the word vector matrix based on key vector calculation parameters which are obtained by training in the full-connection layer of the TextCNN model in advance to obtain the key vector;

and calculating the word vector matrix based on a value vector calculation parameter which is obtained by training in the full connection layer of the TextCNN model in advance to obtain the value vector.

Further, after the step of performing full connection layer processing through the TextCNN model based on the word vector matrix and the adjusted attention weight matrix and then inputting the full connection layer processing to a softmax classification layer fused with gaussian errors for classification, obtaining a first named entity label of each character in the sentence, the method comprises the following steps:

adding the classified named entity labels to each character in the sentence to generate a first training sample;

Placing back and sampling the first training samples to obtain a plurality of groups of training sample sets, and training an initial TextCNN model based on each group of training sample sets to obtain TextCNN submodels with corresponding numbers;

Inputting the same unlabeled resume text into all TextCNN submodels to output a named entity labeling result predicted by each TextCNN submodel;

Judging whether the labeling results of all the naming entities predicted by the TextCNN submodels are the same, if so, verifying that the training of the TextCNN submodels is completed, and verifying that the labeling of the first naming entity of each character in the sentence is correct.

Further, the step of obtaining the sentence in the resume text and constructing the word vector of the sentence comprises the following steps:

obtaining resume text;

Inputting the resume text into a preset text detection model to detect each text area in the resume text, wherein the text detection model is trained based on a natural scene text detection model;

respectively adding a mark frame outside each text area;

Identifying each marking frame based on an image identification technology, carrying out character identification on character contents in each marking frame through a character identification model so as to identify character information in each marking frame, and taking each identified character information as a sentence;

and constructing a word vector corresponding to each character in each sentence based on a preset word embedding model.

Further, the step of calculating an attention weight matrix between every two characters in the sentence according to the query vector and the key vector and combining with a gaussian deviation matrix, and adjusting the attention weight matrix based on the value vector comprises the following steps:

according to the query vector and the key vector, calculating parameters based on the corresponding weight matrix to obtain a weight matrix;

calculating parameters based on the corresponding Gaussian deviation matrix according to the query vector and the key vector to obtain the Gaussian deviation matrix;

Adding the weight matrix and the Gaussian deviation matrix, and normalizing to obtain the attention weight matrix;

and multiplying the attention weight matrix with the value vector to adjust the attention weight matrix.

The application also provides a named entity labeling device, which comprises:

The acquiring unit is used for acquiring sentences in the resume text and constructing word vectors of the sentences;

the first calculation unit is used for carrying out multi-layer convolution operation on the word vector through multi-layer convolution layers of the TextCNN model obtained through pre-training to obtain a word vector matrix;

The second calculation unit is used for calculating the word vector matrix based on the full connection layer of the TextCNN model to obtain a query vector, a key vector and a value vector;

The third calculation unit is used for calculating an attention weight matrix between every two characters in the sentence according to the query vector and the key vector and by combining with a Gaussian deviation matrix, and adjusting the attention weight matrix based on the value vector;

The classifying unit is used for carrying out full connection layer processing through the TextCNN model based on the word vector matrix and the adjusted attention weight matrix, and then inputting the full connection layer processing to a softmax classifying layer fused with Gaussian errors for classification to obtain a first named entity label of each character in the sentence.

Further, the second calculation unit includes:

The first calculating subunit is used for calculating the word vector matrix based on query vector calculation parameters obtained by training in advance in the full-connection layer of the TextCNN model to obtain the query vector;

the second calculation subunit is used for calculating the word vector matrix based on key vector calculation parameters which are obtained by training in advance in the full-connection layer of the TextCNN model to obtain the key vector;

and the third calculation subunit is used for calculating the word vector matrix based on the value vector calculation parameters obtained by pre-training in the full-connection layer of the TextCNN model to obtain the value vector.

Further, the named entity labeling device further comprises:

the generation unit is used for adding the named entity labels obtained by classification to each character in the sentence and generating a first training sample;

The training unit is used for placing back and sampling the first training samples to obtain a plurality of groups of training sample sets, and training an initial TextCNN model based on each group of training sample sets to obtain TextCNN submodels with corresponding numbers;

the output unit is used for inputting the same unlabeled resume text into all TextCNN submodels so as to output a named entity labeling result predicted by each TextCNN submodel;

And the verification unit is used for judging whether the labeling results of all the naming entities predicted by the TextCNN submodels are the same, if so, verifying that the training of the TextCNN submodels is completed, and verifying that the labeling of the first naming entity of each character in the sentence is correct.

Further, the acquisition unit includes:

the acquisition subunit is used for acquiring the resume text;

the detection subunit is used for inputting the resume text into a preset text detection model to detect each text area in the resume text, wherein the text detection model is trained based on a natural scene text detection model;

An adding subunit, configured to add a marking frame outside each text area;

The identification subunit is used for identifying each marking frame based on an image identification technology, carrying out character identification on the character content in each marking frame through a character identification model so as to identify the character information in each marking frame, and taking the identified character information as a sentence respectively;

And the construction subunit is used for constructing a word vector corresponding to each character in each sentence based on a preset word embedding model.

Further, the third computing unit includes:

A fourth calculation subunit, configured to calculate, according to the query vector and the key vector, a weight matrix based on a corresponding weight matrix calculation parameter;

A fifth calculation subunit, configured to calculate, according to the query vector and the key vector, the gaussian deviation matrix based on corresponding gaussian deviation matrix calculation parameters;

the adding subunit is used for adding the weight matrix and the Gaussian deviation matrix, and normalizing the weight matrix to obtain the attention weight matrix;

And the adjustment subunit is used for multiplying the attention weight matrix with the value vector to calculate so as to adjust the attention weight matrix.

The application also provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of any of the methods described above when the computer program is executed.

The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.

The named entity labeling method, device, computer equipment and storage medium comprise the steps of obtaining sentences in resume texts and constructing word vectors of the sentences, carrying out multi-layer convolution operation on the word vectors through multi-layer convolution layers of a TextCNN model obtained through pre-training to obtain a word vector matrix, calculating the word vector matrix based on a full-connection layer of the TextCNN model to obtain query vectors, key vectors and value vectors, calculating attention weight matrixes between every two characters in the sentences according to the query vectors and the key vectors and combining Gaussian deviation matrixes, adjusting the attention weight matrixes based on the value vectors, carrying out full-connection layer processing on the word vector matrixes and the adjusted attention weight matrixes, and then inputting the processed word vector matrix into a softmax classification layer fused with Gaussian errors to classify the word vector matrix to obtain first named entity labeling of each character in the sentences. The application introduces the weighting of the leachable Gaussian deviation matrix, introduces the central position of the local range and the moving window to calculate the Gaussian deviation, and puts the Gaussian deviation into the softmax function to correct the locally enhanced weight distribution, thereby enhancing the capability of capturing local context.

Drawings

FIG. 1 is a schematic diagram of steps of a named entity labeling method according to an embodiment of the present application;

FIG. 2 is a block diagram of a named entity labeling device according to an embodiment of the application;

fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, in one embodiment of the present application, a named entity labeling method is provided, including the following steps:

step S1, acquiring sentences in resume texts and constructing word vectors of the sentences;

S2, carrying out multi-layer convolution operation on the word vector through multi-layer convolution layers of TextCNN models obtained through pre-training to obtain a word vector matrix;

Step S3, calculating the word vector matrix based on the full connection layer of the TextCNN model to obtain a query vector, a key vector and a value vector;

step S4, according to the query vector and the key vector and by combining a Gaussian deviation matrix, calculating an attention weight matrix between every two characters in the sentence, and adjusting the attention weight matrix based on the value vector;

And S5, based on the word vector matrix and the adjusted attention weight matrix, performing full connection layer processing through the TextCNN model, and then inputting the processed full connection layer processing into a softmax classification layer fused with Gaussian errors for classification to obtain a first named entity label of each character in the sentence.

In this embodiment, the named entity labeling method is applied to a scenario of automatically extracting named entities such as school names, company names, professional information and the like in resume texts.

As described in the above step S1, in this embodiment, the resume text generally includes a plurality of sentences, in this embodiment, each sentence in the resume text is obtained, and a corresponding word vector is constructed for each sentence, and it is understood that, before constructing the word vector of each sentence, the sentence may be further preprocessed, where the preprocessing includes removing characters such as special symbols and stop words, and converting the non-formatted text into a format that can be operated by an algorithm. After the preprocessing is completed, an embedding layer (i.e., embedding layer) of the word embedding model is input to convert each word (character) in the sentence into a corresponding word vector (the word vector is typically 300 dimensions). The Word vector dictionary in the embedded layer is obtained by training a Word2Vec or Glove algorithm in advance, and will not be described here.

As described in the above step S2, the TextCNN model is an algorithm for classifying the text by using the convolutional neural network, which can well extract all information in the text. Which has a forward propagation layer and a backward propagation layer for learning the context and the context information, respectively, while being connected to the output layer. In this embodiment, the TextCNN model calculates the input word vector through a plurality of convolution layers to obtain a word vector matrix. The convolution kernel of TextCNN model is 1-dimensional, the length can be set to 2-3, the number of convolution channels is set to 128 in the scheme, and the activation function of convolution layer is ReLU. In this embodiment, the above statement is exemplified by length m, and is converted into a matrix of m×300 after being processed by the embedding layer, and then a word vector matrix of m×128 is outputted after being processed and operated by the multi-layer convolution layer based on the TextCNN model.

After the word vector matrix is obtained by the operation in the step S3, three vectors, i.e., a query vector Q, a key vector K, and a value vector V, are obtained by the operation on the word vector matrix by a full connection layer of TextCNN models, and are all m×n matrices. The query vector Q, the key vector K and the value vector V are all obtained by carrying out operation on the same word vector matrix through a full connection layer, the difference is only that the calculation parameters are different, and the purpose of constructing the query vector Q is to calculate the influence weight among all characters in the same sentence. When the named entity is identified, the judgment needs to be carried out by referring to the characters at other positions of the sentence, and the influence weights of other characters are necessarily needed to be considered. The query vector Q and the key vector K construct a similar weight matrix for calculating weights between words in the sentence to quantify the influence relationship.

It should be understood that, unlike the existing model, the TextCNN model in this embodiment introduces the calculation parameters of the query vector Q, the key vector K, the value vector V, and the calculation parameters of the weight matrix and the gaussian deviation matrix in the full connection layer, and the optimal calculation parameters of the query vector Q, the key vector K, the value vector V, and the calculation parameters of the weight matrix and the gaussian deviation matrix are obtained through iterative training during training of the TextCNN model.

As described in step S4, according to the query vector and the key vector, the attention weight matrix between every two characters in the sentence is obtained by combining the gaussian deviation matrix, in this embodiment, the weight of the learnable gaussian deviation matrix is introduced, the center position of the local range and the moving window are introduced to calculate the gaussian deviation, and the gaussian deviation is put into the softmax function to correct the locally enhanced weight distribution, thereby enhancing the capability of capturing the local context.

The attention weight matrix between every two characters refers to that for each character, each character in the whole sentence is used for scoring the character, and the scoring score determines the importance of the character to the other characters in the sentence. Specifically, the attention weight matrix is obtained by multiplying the query vector with the key vector and normalizing the same, more specifically, after multiplying the query vector with the key vector, the vector obtained by multiplying the result is divided by the vector obtained by multiplying the result in order to control the distribution range of the result so as not to cause the gradient update amount to be overlarge due to the occurrence of the maximum valueAnd then normalization is performed, so that the gradient is more stable. Where d is the dimension of the key vector K.

As described in step S5, the word vector matrix and the adjusted attention weight matrix are added first, and processed by the full-connection layer to obtain a classification matrix, and then the classification matrix is input to the softmax classification layer, and the probability of the BIOES label of each character is output through the classification calculation of the softmax function, and then the highest probability can be directly output as the first named entity label corresponding to each character, or the label output processing can be performed by stacking the CRF algorithm.

In this embodiment, a BIOES labeling manner is adopted, B represents the beginning of an entity, I represents the inside of the entity, O represents a non-entity, E represents the end of the entity, and S represents a single word entity. For example, a character may be marked as the beginning of a person name or may be marked as the inside of an entity in a place name.

In an embodiment, the TextCNN model in this embodiment is different from the existing model in that the calculation parameters of the query vector Q, the key vector K, and the value vector V are introduced in the full connection layer, and the calculation parameters of the optimal query vector Q, the key vector K, and the value vector V are obtained through iterative training when the TextCNN model is trained.

Therefore, the step S3 of calculating the word vector matrix to obtain the query vector, the key vector, and the value vector based on the full connection layer of TextCNN models includes:

Calculating the word vector matrix based on key vector calculation parameters which are obtained by training in the full-connection layer of the TextCNN model in advance to obtain the key vector; the query vector Q is constructed, and the key vector K is used for calculating the influence weight between characters in the same sentence.

And calculating the word vector matrix based on a value vector calculation parameter which is obtained by training in the full connection layer of the TextCNN model in advance to obtain the value vector. The value vector is constructed, and the attention weight matrix is adjusted.

In an embodiment, after the step S5 of obtaining the first named entity label of each character in the sentence by inputting the processed full-connection layer to the softmax classification layer fused with gaussian error for classification based on the word vector matrix and the adjusted attention weight matrix through the TextCNN model, the method includes:

Step S6, adding the classified named entity labels to each character in the sentence to generate a first training sample;

Step S7, placing back and sampling the first training samples to obtain a plurality of groups of training sample sets, and training an initial TextCNN model based on each group of training sample sets to obtain TextCNN submodels with corresponding numbers;

s8, inputting the same unlabeled resume text into all TextCNN submodels to output a named entity labeling result predicted by each TextCNN submodel;

And S9, judging whether the labeling results of all the naming entities predicted by the TextCNN submodels are the same, if so, verifying that the training of the TextCNN submodels is completed, and verifying that the labeling of the first naming entity of each character in the sentence is correct.

It can be understood that the training set adopted by the TextCNN submodel is a text in the resume field, and the training set has stronger pertinence to the professional field after the iterative training. Meanwhile, when the TextCNN submodels are trained, the adopted multiple groups of TextCNN submodels are trained simultaneously, and only when all results are the same, the training can be verified to be finally completed. Meanwhile, a plurality of groups TextCNN of submodels are trained simultaneously, and when all results are the same, the fact that the first named entity of each character in the sentence is correctly marked can also be indicated.

When the TextCNN submodels are used for naming entity labeling later, the same resume text can be input into a plurality of TextCNN submodels for prediction, and the same naming entity labeling result is used as the naming entity labeling result of the resume text only when the naming entity labeling results predicted by all TextCNN submodels are the same.

In one embodiment, the step S1 of obtaining the sentence in the resume text and constructing the word vector of the sentence includes:

and S11, acquiring resume text, wherein the resume text can be a word electronic document or a picture.

It should be emphasized that, to further ensure the privacy and security of the resume text, the resume text may also be stored in a blockchain node.

And step S12, inputting the resume text into a preset text detection model to detect each text region in the resume text, wherein the text detection model is trained based on a natural scene text detection model, and the text detection model is used for detecting the region where the text appears in the resume text, is only used for detecting the region where the text is located, and is not used for identifying what the text in the region is.

And step S13, adding a marking frame outside each text region, wherein the corresponding text region can be conveniently identified after the marking frame is added, and the subsequent identification processing capacity can be simplified.

Step S14, recognizing each marking frame based on an image recognition technology, performing character recognition on the character content in each marking frame through a character recognition model to recognize character information in each marking frame, and taking the recognized character information as a sentence respectively, wherein after each marking frame is recognized, the character content in each marking frame can be directly recognized by adopting the character recognition model, and the content in each marking frame is taken as a sentence respectively.

And S15, constructing a word vector corresponding to each character in each sentence based on a preset word embedding model. The Word embedding model is obtained through training of Word2Vec or Glove algorithm and is used for converting characters in each sentence into a corresponding Word vector.

In an embodiment, the step S4 of calculating the attention weight matrix between every two characters in the sentence according to the query vector and the key vector and in combination with the gaussian deviation matrix, and adjusting the attention weight matrix based on the value vector includes:

step S41, calculating parameters according to the query vector and the key vector and based on the corresponding weight matrix to obtain a weight matrix, calculating parameters according to the query vector and the key vector and based on the corresponding Gaussian deviation matrix to obtain the Gaussian deviation matrix, wherein the calculated parameters adopted by the weight matrix and the Gaussian deviation matrix are different, and it is understood that the calculated parameters are obtained by iteratively training the TextCNN model.

Step S42, adding the weight matrix and the Gaussian deviation matrix, and normalizing to obtain the attention weight matrix;

And step S43, multiplying the attention weight matrix with the value vector to calculate so as to adjust the attention weight matrix.

Wherein the calculated parameters used for calculating M and G are different, it should be understood that the calculated parameters are obtained by iteratively training the TextCNN model.

The gaussian error matrix G is used to adjust the weight matrix M to introduce a center position of a local range and a moving window to calculate a gaussian deviation, and the gaussian deviation is put into a softmax function to correct a locally enhanced weight distribution, so that the capability of capturing local contexts is enhanced.

Further, an attention weight matrix ATT is calculated from the weight matrix M and the Gaussian deviation matrix G, wherein ATT (Q, K) =Softmax (M+G).

In order to adjust the attention weight matrix, the obtained attention weight matrix is multiplied by the value vector, that is, att×v. In the present embodiment, the obtained attention weight matrix is constructed and used as a weight for calculation of the value vector V. It can be understood that the optimization algorithm automatically optimizes parameters according to results to obtain optimal calculation parameters when the model is trained and under the supervision and learning task, so that optimal matrixes Q and K can be conveniently found in the specific prediction process of the model, and an accurate attention weight matrix can be obtained.

In an embodiment, the calculation process of obtaining the first named entity label of each character in the sentence based on the word vector matrix and the adjusted attention weight matrix by performing full connection layer processing through the TextCNN model and then inputting the full connection layer processing to a softmax classification layer fused with gaussian errors for classification includes:

combining the word vector matrix with the adjusted attention weight matrix to obtain L1=C+ATTxV1, wherein C is the word vector matrix, obtaining L2=FC (L1) through full connection layer processing, and finally classifying through a softmax classification layer to obtain the BIOES labeling probability of each character in the sentence, namely L3=softmax (L2). The label with the highest probability is usually used as the label result corresponding to the character.

In this embodiment, a gaussian bias matrix G is added to the softmax activation function in the softmax classification layer, where the gaussian bias matrix G is a matrix of L x L, L being the character length in the sentence, G_ij measures the closeness between the character x_j and the predicted center position P_i, and D_i is the gaussian error that is also twice the window size.

The attention weight matrix between every two characters is as follows:

ATT(Q,K)=Softmax(M+G)

G_ij is:

The calculation method of P_i and D_i is that in order to make P_i and D_i between 0 and L, a labeling factor L is added. Because each center position depends on the corresponding query vector, a feed-forward mechanism is applied to convert the vector into a hidden state, mapped to a scalar with a linear mapping.

Wherein the method comprises the steps ofAnd W_p is a trainable linear mapping, Q_i is a query vector.

In the embodiment, the weight of the leachable Gaussian error is introduced, the center position of the local range and the moving window are introduced to calculate the Gaussian error, the Gaussian error is put into the softmax function to correct the locally enhanced weight distribution, the neighbor relation in the small range is learned while the long-distance dependency is ensured to be obtained, and the capability of capturing the local context is enhanced.

Referring to fig. 2, in an embodiment of the present application, there is further provided a named entity labeling device, including:

an obtaining unit 10, configured to obtain a sentence in a resume text, and construct a word vector of the sentence;

the first calculation unit 20 is configured to perform a multi-layer convolution operation on the word vector through a multi-layer convolution layer of the TextCNN model obtained by training in advance, so as to obtain a word vector matrix;

The second calculating unit 30 is configured to calculate the word vector matrix based on the full connection layer of the TextCNN model, to obtain a query vector, a key vector, and a value vector;

a third calculation unit 40, configured to calculate an attention weight matrix between every two characters in the sentence according to the query vector and the key vector and in combination with a gaussian deviation matrix, and adjust the attention weight matrix based on the value vector;

The classifying unit 50 is configured to perform full connection layer processing through the TextCNN model based on the word vector matrix and the adjusted attention weight matrix, and then input the processed word vector matrix and the adjusted attention weight matrix into a softmax classifying layer fused with gaussian errors for classification, so as to obtain a first named entity label of each character in the sentence.

In an embodiment, the second computing unit 30 includes:

In an embodiment, the named entity labeling device further includes:

In an embodiment, the acquiring unit 10 includes:

the acquisition subunit is used for acquiring the resume text;

An adding subunit, configured to add a marking frame outside each text area;

In an embodiment, the third computing unit 40 includes:

In this embodiment, the specific implementation of the above units/sub-units refers to the corresponding parts in the above method embodiments, and will not be described herein again.

Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store text data, training data, etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a named entity labeling method.

It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements a named entity labeling method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

In summary, the named entity labeling method, device, computer equipment and storage medium provided by the embodiment of the application comprise the steps of obtaining sentences in resume text and constructing word vectors of the sentences, carrying out multi-layer convolution operation on the word vectors through multi-layer convolution layers of a TextCNN model obtained through pre-training to obtain word vector matrixes, calculating the word vector matrixes based on full-connection layers of the TextCNN model to obtain query vectors, key vectors and value vectors, calculating attention weight matrixes between every two characters in the sentences according to the query vectors and the key vectors by combining with Gaussian bias matrixes, adjusting the attention weight matrixes based on the value vectors, carrying out full-connection layer processing on the TextCNN model based on the word vector matrixes and the adjusted attention weight matrixes, and then inputting the processed word vector matrixes into a softmax classification layer fused with Gaussian errors to classify the word vectors to obtain first named entity labeling of each character in the sentences. The application introduces the weighting of the leachable Gaussian deviation matrix, introduces the central position of the local range and the moving window to calculate the Gaussian deviation, and puts the Gaussian deviation into the softmax function to correct the locally enhanced weight distribution, thereby enhancing the capability of capturing local context.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or direct or indirect application in other related technical fields are included in the scope of the present application.

Claims

Translated fromChinese

1.一种命名实体标注方法，其特征在于，包括以下步骤：1. A method for labeling named entities, comprising the following steps:

获取简历文本中的语句，并构建所述语句的字向量；Get the sentences in the resume text and construct word vectors for the sentences;

通过预先训练得到的TextCNN模型的多层卷积层对所述字向量进行多层卷积运算，得到字向量矩阵；Perform multi-layer convolution operations on the word vectors through the multi-layer convolution layers of the pre-trained TextCNN model to obtain a word vector matrix;

基于所述TextCNN模型的全连接层，对所述字向量矩阵进行计算，得到查询向量、键向量、值向量；Based on the fully connected layer of the TextCNN model, the word vector matrix is calculated to obtain a query vector, a key vector, and a value vector;

根据所述查询向量、键向量，并结合高斯偏差矩阵，计算得到所述语句中的每两个字符之间的注意力权重矩阵，并基于所述值向量对所述注意力权重矩阵进行调整；According to the query vector, the key vector, and in combination with the Gaussian deviation matrix, an attention weight matrix between every two characters in the sentence is calculated, and the attention weight matrix is adjusted based on the value vector;

基于所述字向量矩阵、调整后的所述注意力权重矩阵，通过所述TextCNN模型进行全连接层处理后再输入至融合高斯误差的softmax分类层中进行分类，得到所述语句中的每个字符的第一命名实体标注；Based on the word vector matrix and the adjusted attention weight matrix, the TextCNN model is used for full connection layer processing and then input into the softmax classification layer fused with Gaussian error for classification, so as to obtain the first named entity label of each character in the sentence;

所述基于所述字向量矩阵、调整后的所述注意力权重矩阵，通过所述TextCNN模型进行全连接层处理后再输入至融合高斯误差的softmax分类层中进行分类，得到所述语句中的每个字符的第一命名实体标注的步骤之后，包括：After the step of obtaining the first named entity label of each character in the sentence based on the word vector matrix and the adjusted attention weight matrix, the step of performing full connection layer processing through the TextCNN model and then inputting into the softmax classification layer fused with Gaussian error for classification includes:

将所述语句中的每个字符添加分类得到的所述命名实体标注，生成第一训练样本；Adding the named entity annotation obtained by classification to each character in the sentence to generate a first training sample;

对所述第一训练样本进行放回抽样，得到多组训练样本集；并基于每一组训练样本集分别对一个初始TextCNN模型进行训练，得到对应个数的TextCNN子模型；Performing replacement sampling on the first training samples to obtain multiple sets of training sample sets; and training an initial TextCNN model based on each set of training sample sets to obtain a corresponding number of TextCNN sub-models;

将同一个无标注简历文本输入至所有的所述TextCNN子模型中，以输出每一个所述TextCNN子模型预测的命名实体标注结果；Input the same unlabeled resume text into all the TextCNN sub-models to output the named entity labeling results predicted by each TextCNN sub-model;

判断所有所述TextCNN子模型预测的命名实体标注结果是否相同，若相同，则验证所述TextCNN子模型训练完成，以及验证所述语句中的每个字符的第一命名实体标注正确；Determine whether the named entity labeling results predicted by all the TextCNN sub-models are the same. If they are the same, verify that the training of the TextCNN sub-model is complete, and verify that the first named entity labeling of each character in the sentence is correct;

所述根据所述查询向量、键向量，并结合高斯偏差矩阵，计算得到所述语句中的每两个字符之间的注意力权重矩阵，并基于所述值向量对所述注意力权重矩阵进行调整的步骤，包括：The step of calculating the attention weight matrix between every two characters in the sentence according to the query vector, the key vector, and in combination with the Gaussian deviation matrix, and adjusting the attention weight matrix based on the value vector includes:

根据所述查询向量与所述键向量，基于对应的权重矩阵计算参数计算得到权重矩阵；A weight matrix is calculated based on the query vector and the key vector and corresponding weight matrix calculation parameters;

根据所述查询向量与所述键向量，基于对应的高斯偏差矩阵计算参数计算得到所述高斯偏差矩阵；The Gaussian deviation matrix is calculated based on the query vector and the key vector and corresponding Gaussian deviation matrix calculation parameters;

将所述权重矩阵与所述高斯偏差矩阵进行加和，并进行归一化处理后得到所述注意力权重矩阵；The weight matrix is added to the Gaussian deviation matrix, and then normalized to obtain the attention weight matrix;

将所述注意力权重矩阵与所述值向量进行相乘计算，以对所述注意力权重矩阵进行调整。The attention weight matrix is multiplied by the value vector to adjust the attention weight matrix.

2.根据权利要求1所述的命名实体标注方法，其特征在于，所述基于所述TextCNN模型的全连接层，对所述字向量矩阵进行计算，得到查询向量、键向量、值向量的步骤，包括：2. The named entity tagging method according to claim 1, characterized in that the step of calculating the word vector matrix based on the fully connected layer of the TextCNN model to obtain the query vector, the key vector, and the value vector comprises:

基于所述TextCNN模型的全连接层中预先训练得到的查询向量计算参数，对所述字向量矩阵进行计算，得到所述查询向量；Based on the query vector calculation parameters pre-trained in the fully connected layer of the TextCNN model, the word vector matrix is calculated to obtain the query vector;

基于所述TextCNN模型的全连接层中预先训练得到的键向量计算参数，对所述字向量矩阵进行计算，得到所述键向量；Based on the key vector calculation parameters pre-trained in the fully connected layer of the TextCNN model, the word vector matrix is calculated to obtain the key vector;

基于所述TextCNN模型的全连接层中预先训练得到的值向量计算参数，对所述字向量矩阵进行计算，得到所述值向量。Based on the value vector calculation parameters pre-trained in the fully connected layer of the TextCNN model, the word vector matrix is calculated to obtain the value vector.

3.根据权利要求1所述的命名实体标注方法，其特征在于，所述获取简历文本中的语句，并构建所述语句的字向量的步骤，包括：3. The named entity tagging method according to claim 1, wherein the step of obtaining a sentence in the resume text and constructing a word vector for the sentence comprises:

获取简历文本，所述简历文本存储于区块链中；Obtaining a resume text, wherein the resume text is stored in a blockchain;

将所述简历文本输入至预设的文本检测模型中，以检测所述简历文本中的各个文字区域；其中，所述文本检测模型为基于自然场景文本检测模型训练得到；Inputting the resume text into a preset text detection model to detect various text regions in the resume text; wherein the text detection model is trained based on a natural scene text detection model;

分别在各个所述文字区域外添加一个标记框；Adding a marking box outside each of the text areas;

基于图像识别技术识别每个所述标记框，并通过文字识别模型对各个所述标记框中的文字内容进行文字识别，以识别到各个所述标记框中的文字信息，并将识别到的各所述文字信息分别作为一个语句；Identify each of the marked frames based on image recognition technology, and perform text recognition on the text content in each of the marked frames through a text recognition model to identify the text information in each of the marked frames, and use each of the identified text information as a sentence;

基于预设的词嵌入模型构建每一个语句中每一个字符对应的字向量。Build the word vector corresponding to each character in each sentence based on the preset word embedding model.

4.一种命名实体标注装置，用于实现权利要求1-3任一项所述的方法，其特征在于，包括：4. A named entity tagging device, used to implement the method according to any one of claims 1 to 3, characterized in that it comprises:

获取单元，用于获取简历文本中的语句，并构建所述语句的字向量；An acquisition unit, used to acquire a sentence in the resume text and construct a word vector for the sentence;

第一计算单元，用于通过预先训练得到的TextCNN模型的多层卷积层对所述字向量进行多层卷积运算，得到字向量矩阵；A first computing unit is used to perform multi-layer convolution operations on the word vectors through multi-layer convolution layers of a pre-trained TextCNN model to obtain a word vector matrix;

第二计算单元，用于基于所述TextCNN模型的全连接层，对所述字向量矩阵进行计算，得到查询向量、键向量、值向量；A second computing unit is used to calculate the word vector matrix based on the fully connected layer of the TextCNN model to obtain a query vector, a key vector, and a value vector;

第三计算单元，用于根据所述查询向量、键向量，并结合高斯偏差矩阵，计算得到所述语句中的每两个字符之间的注意力权重矩阵，并基于所述值向量对所述注意力权重矩阵进行调整；a third calculation unit, configured to calculate an attention weight matrix between every two characters in the sentence according to the query vector, the key vector, and a Gaussian deviation matrix, and adjust the attention weight matrix based on the value vector;

分类单元，用于基于所述字向量矩阵、调整后的所述注意力权重矩阵，通过所述TextCNN模型进行全连接层处理后再输入至融合高斯误差的softmax分类层中进行分类，得到所述语句中的每个字符的第一命名实体标注。A classification unit is used to obtain the first named entity label of each character in the sentence based on the word vector matrix and the adjusted attention weight matrix, and then input the fully connected layer processing through the TextCNN model into the softmax classification layer fused with Gaussian error for classification.

5.根据权利要求4所述的命名实体标注装置，其特征在于，所述第二计算单元，包括：5. The named entity tagging device according to claim 4, wherein the second computing unit comprises:

第一计算子单元，用于基于所述TextCNN模型的全连接层中预先训练得到的查询向量计算参数，对所述字向量矩阵进行计算，得到所述查询向量；A first calculation subunit is used to calculate parameters of the query vector pre-trained in the fully connected layer of the TextCNN model, calculate the word vector matrix, and obtain the query vector;

第二计算子单元，用于基于所述TextCNN模型的全连接层中预先训练得到的键向量计算参数，对所述字向量矩阵进行计算，得到所述键向量；A second calculation subunit is used to calculate the word vector matrix based on the key vector calculation parameters pre-trained in the fully connected layer of the TextCNN model to obtain the key vector;

第三计算子单元，用于基于所述TextCNN模型的全连接层中预先训练得到的值向量计算参数，对所述字向量矩阵进行计算，得到所述值向量。The third computing subunit is used to calculate parameters based on the value vector pre-trained in the fully connected layer of the TextCNN model, calculate the word vector matrix, and obtain the value vector.

6.根据权利要求4所述的命名实体标注装置，其特征在于，所述命名实体标注装置，还包括：6. The named entity tagging device according to claim 4, characterized in that the named entity tagging device further comprises:

生成单元，用于将所述语句中的每个字符添加分类得到的所述命名实体标注，生成第一训练样本；A generating unit, configured to add the named entity annotation obtained by classification to each character in the sentence to generate a first training sample;

训练单元，用于对所述第一训练样本进行放回抽样，得到多组训练样本集；并基于每一组训练样本集分别对一个初始TextCNN模型进行训练，得到对应个数的TextCNN子模型；A training unit is used to perform replacement sampling on the first training samples to obtain multiple sets of training sample sets; and train an initial TextCNN model based on each set of training sample sets to obtain a corresponding number of TextCNN sub-models;

输出单元，用于将同一个无标注简历文本输入至所有的所述TextCNN子模型中，以输出每一个所述TextCNN子模型预测的命名实体标注结果；An output unit, used to input the same unlabeled resume text into all the TextCNN sub-models to output the named entity labeling results predicted by each TextCNN sub-model;

验证单元，用于判断所有所述TextCNN子模型预测的命名实体标注结果是否相同，若相同，则验证所述TextCNN子模型训练完成，以及验证所述语句中的每个字符的第一命名实体标注正确。The verification unit is used to determine whether the named entity labeling results predicted by all the TextCNN sub-models are the same. If they are the same, it verifies that the training of the TextCNN sub-model is completed, and verifies that the first named entity labeling of each character in the sentence is correct.

7.一种计算机设备，包括存储器和处理器，所述存储器中存储有计算机程序，其特征在于，所述处理器执行所述计算机程序时实现权利要求1至3中任一项所述方法的步骤。7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 3 when executing the computer program.

8.一种计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现权利要求1至3中任一项所述的方法的步骤。8. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 3 are implemented.