Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be described in detail and completely with reference to the accompanying drawings of the specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
The invention relates to a method for predicting the opinion expression in the text, which comprises the steps of interpreting the opinion target extraction task requirement as a target for positioning the opinion expression from the text, wherein the target segment consists of one segment in the text, modeling the task as a boundary prediction task, and predicting two position indexes in the text to indicate the starting position and the ending position of an answer. Aiming at the Chinese short text, the Chinese opinion target boundary prediction method based on the dual-model structure can avoid the fussy sequence marking operation and effectively improve the accuracy of Chinese opinion target extraction. As shown in fig. 1 to 3, the method includes: step1, acquiring a Chinese opinion text data set, dividing the data set into a training set, a verification set and a test set, and then performing data preprocessing on the training set to obtain training sample data, wherein the training sample data comprises word vectors corresponding to opinion text word sequences and sequences used for representing real target boundary probability distribution conditions of opinion texts, the word vectors are used for representing words and word positions in the opinion text word sequences, and the data preprocessing comprises: converting each character in the training data set into dictionary indexes through a dictionary vocab.txt carried by a BERT-wwm-ext model, constructing a new dictionary token _ fact, and representing characters which are not in the token _ fact by using a [ unused1] label and a [ UNK ] label, wherein the [ unused1] label is used for marking space class untrained characters; disordering the training data set sequence according to random numbers, and carrying out slicing operation on opinion texts of training data in the training data set according to maxlen parameter values to obtain an opinion target label sequence and an opinion text sequence; carrying out word segmentation processing on the opinion text sequence and the opinion target label sequence to obtain an opinion text word sequence and an opinion target label word sequence, and respectively adding [ CLS ] and [ SEP ] labels to the head and the tail of the opinion text word sequence; acquiring the position of the first word of the opinion target label word sequence in the opinion text word sequence, and recording the attribute value corresponding to the position as the initial attribute value; obtaining the position of the end of the opinion target label word sequence through the position of the first word of the opinion target label word sequence in the opinion text word sequence and the length of the opinion target label word sequence; assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, filling with 0 to obtain a sequence s1, wherein the sequence s1 has the same length as the opinion text word sequence, assigning the value of the position of the first word of the opinion target tag word sequence in the opinion text word sequence to be 1, filling with 0 to obtain a sequence s2, and the sequence s2 has the same length as the opinion text word sequence; obtaining an opinion text according to token _ fact, splitting the opinion text into characters, obtaining an opinion text word vector x1 and a sequence x2 after padding the opinion textword vector x 1; carrying out the operation on each 32 data in the training data set as a batch, and sequentially obtaining vectors X1, X2, S1 and S2 for X1, X2, S1 and S2 of the 32 data obtained after the operation; and padding filling operation is carried out on the batch data according to the maximum length of single data in a batch to obtain new X1n, X2n, S1n and S2n as training sample data.
In this embodiment, the chinese opinion target boundary prediction method based on the dual-model structure provided by the present invention may be executed by a server or an intelligent terminal device, and the specific operation process of the data preprocessing may be implemented by the following steps: first, a data set is obtained from some network reviews of three internet companies, Baidu (baidu), Dianping (dianping), Mahalanobis (Mahengwo).
Second, the data set is then partitioned. One of the data instances can be shown as: (the 'belong lake is the first scenic spot of the general park, the lake water is clear, the lake is taken by cold wind by walking at the lake, the tree shallot is a good place', 'belong lake'), and the example is an expression suggestion for the target 'belong lake'.
And thirdly, calling a full word coverage Chinese BERT pre-training model (BERT-wwm-ext) jointly issued by the Harbour and the major news.
And fourthly, preprocessing the text. Every 32 data are processed in a batch, and the method specifically comprises the following steps:
a1, call the contents of the vocab. txt file, and assign a value to token _ fact.
And A2, performing shuffle operation, and disordering the original data sequence according to random numbers to prevent overfitting caused by training data.
A3, training data should have two columns, the first column being comment text d [0], the second column being comment object label d [1 ]. The first column is sliced with the maxlen parameter value.
For example, sample: ' Water crab porridge is a Australian snack. 'Water crab porridge'.
The d 0 ═ water crab gruel is a Australian snack. '
d1 ═ water crab porridge'
A4, constructing a word segmentation device, marking untrained characters in space class by using [ unused1] for characters which are not in token _ fact, and representing the rest characters by using [ UNK ].
A5, dividing d 0 and d1 into words to obtain lists text _ tokens and tag _ tokens. d 0, after word division, the first and last positions of sentence are respectively added with [ CLS ] and [ SEP ] marks.
For the above sample:
text _ tokens [ ' [ CLS ] ', ' water ', ' crab ', ' porridge ', ' is ', ' ao ', ' gate ', ' small ', ' eat ', '. ', ' [ SEP ] ' ]
tag _ tokens [ 'water', 'crab', 'porridge' ]
A6, returns 2 arrays of given length and type, filled with 0, with the zeros function under Python's numpy module is s1, s2, where the length is the length of text _ keys, the type defaults to numpy.
For the above sample:
s1=[0,0,0,0,0,0,0,0,0,0,0]
s2=[0,0,0,0,0,0,0,0,0,0,0]
a7, write list _ find function, return tag _ tokens [0], position of occurrence in text _ tokens, and assign id value to start. I.e., where the first word of the opinion target appears in the comment text.
A8, obtaining the end position through the start position. The Len function returns the length of tag tokens.
end=start+len(tag_tokens)-1
For the above sample:
len(tag_tokens)=3
start=1
end=3
a9, assigning the value with id as start in the array s1 as 1 to obtain a new array s1, and assigning the value with id as end in the array s2 as 1 to obtain anew array s 2.
For the above sample:
s1=[0,1,0,0,0,0,0,0,0,0,0]
s2=[0,0,0,1,0,0,0,0,0,0,0]
a10, referring to token _ dit storing words and id mapping, using the encode function of the Tokenizer module, returning d [0] and splitting into words to generate corresponding id array x1 and array x2 after padding technology.
For the above sample:
x1=[101,3717,6101,5114,3221,4078,7305,2207,1391,511,10 2]
x2=[0,0,0,0,0,0,0,0,0,0,0]
a11, each 32 data is a batch. The X1, X2, S1 and S2 values of the 32 data obtained after processing are sequentially stored into lists X1, X2, S1 and S2.
For the batch of the sample:
x1 [ [ X1 of data 1], [ X1 of data 2], [101,3717,6101,5114,3221,4078,7305,2207,1391,511,102], …, [ X1 of data 32 ] ]
X2 [ [ X2 of data 1], [ X2 of data 2], [0,0,0,0,0,0,0, 0], …, [ X2 of data 32 ] ]
S1 [ [ S1 of data 1], [ S1 of data 2], [0,1,0,0,0,0,0,0,0, 0], …, [ S1 of data 32 ] ]
S2 [ [ S2 of data 1], [ S2 of data 2], [0,0,0,1,0,0,0,0,0, … ], [ S2 of data 32 ] ]
A12, padding technical operation is carried out on the batch data according to the maximum length of single data in the batch, namely, the data length in the batch is smaller than the maximum length of the data in the batch, and 0 is filled in the data length in the batch, so that new X1, X2, S1 and S2 (namely X1n, X2n, S1n and S2n) are obtained.
For the batch of the sample, if the length value of the longest data is 15:
x1 [ [ X1 after data 1padding ], [ X1 after data 2padding ], [101,3717,6101,5114,3221,4078,7305,2207,1391,511,102,0, 0,0], … ], [ X1 after data 32padding ]
X2 [ [ X2 after data 1padding ], [ X2 after data 2padding ], [0,0,0,0,0,0,0,0,0, … ], [ X2 after data 32padding ] ]
S1 [ [ S1 after data 1padding ], [ S1 after data 2padding ], [0,1,0,0,0,0,0,0,0, 0], …, [ S1 after data 32padding ] ]
S2 [ [ S2 after data 1padding ], [ S2 after data 2padding ], [0,0,0,1,0,0,0,0,0,0,0,0, 0], …, [ S2 after data 32padding ] ], and new X1, X2, S1, S2 are used as sample data.
Step2, constructing a Chinese opinion target extraction model based on BERT _ BiGRU by adopting a Keras framework, wherein the Chinese opinion target extraction model comprises a model I and a model II; as shown in fig. 2, in this embodiment, the specific implementation of the model includes: b1, calling a BERT-wwm-ext model, and specifically explaining as follows:
bert_model=load_trained_model_from_checkpoint(config_path, checkpoint_path,seq_len=None)
b2, describe in detail the Chinese opinion target extraction model based on BERT _ BiGRU boundary prediction in conjunction with FIG. 2.
B3, adding 4 Input layers for receiving the sample data: x1, X2, S1 and S2. The concrete explanation is as follows:
x1_ in ═ Input (shape ═ receive) # receive data X1
X2_ in ═ Input (shape ═ receive) # receive data X2
S1_ in ═ Input (shape ═ receive data S1
S2_ in ═ Input (shape ═ receive data S2
X1,X2,S1,S2=x1_in,x2_in,s1_in,s2_in
B4, adding a lambda1 layer, taking X1 as an input, assigning an output to X _ mask, and performing mask operation, wherein the specific explanation is as follows:
importkeras.backendasK
and introducing a backup packet under the keras and renaming the packet as the K packet.
x_mask=Lambda(lambdax:K.cast(K.greater(K.expand_dims(x ,2),0),'float32'))(X1)
Data X1 was calculated using the lambda function of python, its own, in combination with the cast, grease, and expanded _ dims functions under K-wrap.
B5, [ X1, X2] as inputs, processing the data using a BERT-wwm-ext model, and outputting an assignment to X, which is specifically explained as:
x=bert_model([X1,X2])
b6 and x are used as input, data are processed by using a BIGRU model, and assignment is output to x, which is specifically explained as follows:
x=Bidirectional(GRU(char_size//2,return_sequences=True ))(x)
b7, adding a lambda2 layer, [ x, x _ mask ] as an input, and assigning an output value to x, wherein the specific explanation is as follows:
x=Lambda(lambdax:x[0]*x[1])([x,x_mask])
b8, adding a Dense1 layer, taking x as an input, assigning an output value to x, and specifically explaining that:
x=Dense(char_size,use_bias=False,activation='tanh')(x)
b9, adding a Dense2 layer, taking x as an input, and assigning an output to ps1, wherein the specific explanation is as follows:
ps1=Dense(1,use_bias=False)(x)
b10, adding lambda3 layer, [ ps1, x _ mask ] as input, and assigning output to ps1, specifically explained as:
ps1=Lambda(lambdax:x[0][...,0]-(1- x[1][...,0])*1e10)([ps1,x_mask])
b11, adding a Dense3 layer, taking x as an input, assigning an output to ps2, and specifically explaining that:
ps2=Dense(1,use_bias=False)(x)
b12, adding lambda4 layer, [ ps2, x _ mask ] as input, and assigning output to ps2, specifically explained as:
ps2=Lambda(lambdax:x[0][...,0]-(1- x[1][...,0])*1e10)([ps1,x_mask])
since the input of the final model is only text related information and label information is not available, declare model one: model, used for predicting the position of the head and the end words of the opinion, with the input of [ x1_ in, x2_ in ], the output of [ ps1, ps2], specifically explained as:
model=Model([x1_in,x2_in],[ps1,ps2])
because learning is needed according to the label information during training, and the loss value is reduced, a second statement model is as follows: train _ model, model one is trained while model two is trained. Its inputs are [ x1_ in, x2_ in, s1_ in, s2_ in ], its outputs are [ ps1, ps2], specifically interpreted as:
train_model=Model([x1_in,x2_in,s1_in,s2_in],[ps1,ps2])
finally, a loss function of the train _ model is defined, and the cross entropy loss function is used for evaluating the difference between the boundary probability distribution obtained by current training and the real target boundary distribution. The difference was calculated using the following equation.
Loss=Loss1+Loss2,
Where s1_ in and s2_ in are indications of true opinion object boundaries. ps1 represents the probability of each word of text starting as an opinion target and ps2 represents the probability of each word of text ending as an opinion target.
Step3, inputting the training sample data into a second model to be trained, and predicting the relationship between the boundary of the opinion target label and the opinion text according to the output result of the second model, as shown in fig. 2 and the model implementation process, the second model comprises a plurality of cascaded data processing layers, the plurality of cascaded data processing layers comprise 1 Input layer, 4 lambda layers, 1 BERT-wwm-ext layer, 1 BIGRU layer and 3 Dense layers, and in the plurality of cascaded data processing layers, the result of each data processing unit in the previous data processing layer is Input into each data processing unit in the next data processing layer.
Step4, obtaining verification sample information of the training sample data from a verification set, training a second model based on the output result and the verification sample information to obtain a trained second model, and optimizing the first model by the trained second model to obtain a first optimal model for predicting the opinion target label boundary, wherein the training process of the second model is as follows: acquiring the training sample data, inputting the training sample data into a second model, and predicting the relation between the opinion target boundary and the text, wherein the relation between the opinion target boundary and the text is used for indicating the probability distribution condition of the opinion target boundary in the opinion text word sequence; and optimizing the second model according to the predicted probability distribution condition and the real target boundary probability distribution condition of the opinion text to obtain the trained second model. As shown in fig. 2, a model one is also optimized while a model two is trained to obtain an optimal model one, the model one is used for predicting the opinion target label boundary, the input of the optimal model one is only a test data word vector, the output is a probability ps1 that each word in the opinion text starts as an opinion target item and a probability ps2 that each word in the opinion text ends as an opinion target item, and the predicted boundary score vectors _ ps1 and _ ps2 are obtained from the probability ps1 and theprobability ps 2. Inputting the predicted boundary score vectors _ ps1 and _ ps2 into the decoding layer to obtain the final opinion objective by using a decoding algorithm.
Step5, obtaining a test sample word sequence by word segmentation of the test set data, obtaining a word vector corresponding to the test sample word sequence, inputting the word vector corresponding to the test sample word sequence into the optimal model I to obtain a predicted boundary score vector, converting the predicted boundary score vector into a final opinion target by using a decoding algorithm, and outputting, wherein the test data in the test set only comprises opinion texts. Wherein the decoding algorithm comprises: and respectively carrying out normalization processing on the boundary score vectors _ ps1 and _ ps2 by using two softmax functions, acquiring an attribute value sequence of the word with the maximum probability, and acquiring an opinion target entity fragment by using slicing operation according to the attribute value sequence.
In this embodiment, since the OTE task needs to output specific target entity segments, and after model one processing, only two sets of head and tail fractional vectors _ ps1 and _ ps2 are obtained, a decoding algorithm is needed to convert the fractional vectors into final target entity output. The specific explanation is as follows: and respectively processing the head fraction vector _ ps1 and the tail fraction vector _ ps2 by using two softmax functions, returning the id of the word with the maximum probability by using an own argmax function under a numpy module, and calculating a target entity fragment by using a slicing operation by using the following formula by using the softmax function, wherein the example is the _ps 1.
And the model II is trained on a training set and a verification set, and compared with a real result by a prediction result, the model II determines a learning direction according to the error reduction, optimizes the model II and optimizes the model I at the same time. The Chinese network comment opinion target boundary prediction model based on the dual-model structure, which is constructed by the invention, positions an opinion target extraction task as a target segment boundary prediction task, avoids complicated part-of-speech tagging operation, is a multi-input multi-output model, has high efficiency, and can compile and train two models simultaneously.
The specific experimental process is as follows:
the experiment used the same dataset of document [1] from Li et al, with 3 groups of 10 ten thousand data, and data [2] from three internet companies, hundredths (baidu), criticism (dianping), and mahangwo, with specific dataset settings as shown in table 1.
Document [1] Yanzeng Li, Tingwen Liu, Diying Li, et al, Character-based BilSt-CRF incorporation POS and Dictionaries for Chinese Opinion Target extraction. aspect reference on Machine Learning, ACML 2018, 518-.
[2]https://github.com/kdsec/chinese-opinion-target- extraction。
TABLE 1 Experimental data set
| Data set | Training set | Verification set | Test set | Total number of |
| baidu | 7500 | 1033 | 3658 | 12191 |
| dianping | 24000 | 1258 | 10825 | 36083 |
| mafengwo | 40000 | 1253 | 17681 | 58934 |
The evaluation indexes used in the experiment are Accuracy, Precision, Recall and F1, and the higher the value of the evaluation indexes, the better the classification capability of the model is represented. Defining TP: for model identification of the number of completely correct entities, FP: the result identified for the model contains the correct entities, but the number of errors in boundary determination is FN: the number of errors is identified. The evaluation index is shown by the following formula:
Accuracy=TP/(TP+FP+FN),
Precision=TP/(TP+FP),
Recall=TP/(TP+FN),
F1=2*(Precision*Recall)/(Precision+Recall),
by observing the extraction result, the condition that extraction is empty does not exist in the experimental process of the model, when the FP is calculated, the fact that the extraction result is not the same as the original sample is noticed, the number of fault-tolerant characters is less than 10, and in order to avoid index calculation understanding difference, a specific scoring algorithm is provided. As follows:
foriinrange(len(test_data)):
ifpredict_label==true_label:
TP+=1
ifpredict_label!=true_labeland(true_labelinpredict_lab eland(len(predict_label)–len(true_label)<10)):FP+=1
FN=len(test_data)-(FP+TP)。
work in document [1] has set up multiple sets of detailed comparison experiments on the same dataset, including the most popular extraction framework, the BILSTM _ CRF model, and demonstrated that its method is optimal, so the present application compares it directly.
BILSTM _ CRF: modeling as a sequence tagging task. Firstly, generating character position information characteristics ([ CP-POS ] @ C) and constructing dictionary characteristics (DictFeature), and finally integrating [ CP-POS ] @ C and DictFeature into a BILSTM _ CRF model embedded based on Word2vec characters.
BERT (bimodal structure): modeling as a boundary prediction task. The method is different from the method in that the neural network model is BERT-wwm-ext model and common Dense layer.
The method adopts BERT _ BIGRU (double-model structure) to model as a boundary prediction task, and the neural network model is a BERT-wwm-ext model and a BIGRU layer. In order to increase the reliability of the test results of the models, the running environments of all the models are as consistent as possible in the experimental process. The test results of the test set are shown in table 2.
TABLE 2 model comparison results
Table 2 shows the results of the comparison of the 3 sets of models on the test set. As can be seen from the comparison between thegroup 1 and thegroup 2, the BERT network with the dual-model structure reduces the work of generating character position information characteristics ([ CP-POS ] @ C) and constructing dictionary characteristics (DictFeature) by the preprocessing part and the sequence part-of-speech tagging operation necessary for the sequence marking task relative to the BILSTM _ CRF model of the sequence marking task, and avoids the complicated semi-automatic feature generation engineering to a great extent. Because the BERT network with the double-model structure is a multi-input multi-output model, two models are compiled and trained synchronously, the first model is used for predicting an opinion target, and the second model is used for learning the relation between the opinion target and a text. And (3) optimizing the model I while training the model II, and finally converting the boundary score vector predicted by the model I into a final answer by using a decoding algorithm for output. As can be known from comparison of the comprehensive evaluation indexes Accuracy and F1, for the data set, the BERT network with a dual-model structure obtains 91.36% of Accuracy and 95.49% of F1 on the baidu data set, 92.06% of Accuracy and 95.87% of F1 on the dianping data set, and 89.45% of Accuracy and 94.43% of F1 on the mafengwo data set, and the experimental result is superior to that of the BILSTM _ CRF model, so that the Accuracy of the OTE task can be effectively improved on the basis of not depending on sequence marks.
From the comparison between the group 3 and thegroup 2, for the three data sets, the method obtains an Accuracy value of 91.53% and an F1 value of 95.58% on the baidu data set, an Accuracy value of 91.99% and an F1 value of 95.83% on the dianping data set, and an Accuracy value of 89.76% and an F1 value of 94.61% on the mafengwo data set, and the experimental result is superior to the control group on 2 data sets, which indicates that, to some extent, the BIGRU web learning text context semantic features are added, which is beneficial to improving the Accuracy of the model for text boundary prediction.
In order to further quantify the model comparison result, index values predicted on the test set are given. The predicted value statistics are shown in table 3.
TABLE 3 predicted value statistics
Right is the total number of samples for which the model extraction is completely correct, and Wrong is the total number of samples for which the model extraction is Wrong. The above results illustrate the feasibility and effectiveness of the proposed method for predicting the target boundary based on the dual-model structure.
It should be noted that the described embodiments of the invention are only preferred ways of implementing the invention, and that all obvious modifications, which belong to the overall concept of the invention, should fall within the scope of protection of the invention.