Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a non-translation element identification method based on editing behaviors and rules, which solves the following technical problems: the rule is easy to cause misjudgment, a rule base is required to be maintained continuously, non-translated elements are exhausted, regular expression codes of the rule base are more and more complicated, and maintenance cost is increased.
The aim of the invention can be achieved by the following technical scheme:
an method for identifying non-translated elements based on editing behaviors and rules comprises the following steps:
step one: acquiring an original text, and segmenting the original text;
step two: obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector;
step three: obtaining vector representation of an original text, constructing an encoder and a pooling layer, inputting a vector representation set of original words, and outputting vector representation representing the whole original text;
step four: obtaining language direction vector characterization, splicing a source language and a target language into language direction features by using underline characters, and mapping the language direction features into 256-dimensional embedded vectors;
step five: obtaining a second-order feature interaction vector, and performing outer product operation and column summation operation on the language direction vector representation and the vector representation of the original text to generate the second-order feature interaction vector;
step six: splicing language direction vector representation, original text vector representation and second-order feature interaction vector, and generating a new spliced feature vector through MLP;
step seven: predicting the compiled time, inputting the spliced feature vector, and outputting a scalar as the compiled time;
step eight: predicting the compiled editing distance, inputting a spliced feature vector, and outputting a scalar as the compiled editing distance;
step nine: non-translated element threshold, criteria for non-translated element: the post-translation editing time is less than 10 seconds; and the compiled editing distance is less than 4; the original text is composed of English letters, numbers, punctuations and spaces;
step ten: by compiling the data after translation, the MSE loss function is designed and model weights are trained using a random gradient descent method and an error back propagation method.
Further, obtaining an original text, and segmenting the original text by using spaces and punctuation marks; judging the existence of words or characters based on the multilingual dictionary; when the word or character exists in the multilingual dictionary, the word or character is recorded as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the When a word or character is not presentThe multi-language dictionary uses the longest character string greedy matching algorithm to segment words, namely, the characters from the last character of the word to the first character of the word are cut one by one, the word is cut into two sub-words, the two sub-words are marked as a front character string and a rear character string respectively until the front character string exists in the multi-language dictionary, the front character string is regarded as a qualified sub-word of the word, and the word is marked as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword tokensrc Composition is prepared.
Further, the vector representation of the original word is composed of a word vector and a position vector, the word or character of the original word is embedded, each word or character is mapped into a 256-dimensional word vector, the position of the word or character is mapped into the 256-dimensional position vector, and the word vector and the position vector are added and regularized.
Further, a 6-layer 8-head self-attention layer encoder is constructed, a set of vector tokens of words or characters of an original text is input, and a vector token representing the entire original text is output.
Further, the language direction feature is composed of a source language and a target language, a source language code and a target language code are obtained, the source language code and the target language code are spliced by using underlined characters to form the language direction feature, and the language direction feature is mapped into a 256-dimensional embedded vector.
In the fifth step, after obtaining the vector representation of the original text and the language direction vector representation, performing an outer product operation on the two features to generate a second-order feature interaction matrix, performing a column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector.
Further, the language direction vector representation, the original text vector representation and the second-order feature interaction vector are spliced to form a spliced vector, and a new spliced feature vector is output after a series of calculation operations of two full connection layers, an activation function, a residual block and regularization.
Furthermore, the new spliced feature vector is used as an input vector of the post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing time.
Further, the new spliced feature vector is used as an input vector of the post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing distance.
Further, when the post-translation editing time and the post-translation editing distance are smaller than the non-translation element threshold and the original text is composed of English letters, numbers, punctuations and spaces, the non-translation element is judged, a machine translation engine is not required to be called, and the original text is directly used as the translation.
Further, the translated editing data includes an original text, a source language, a target language, an editing distance, and an editing time.
Compared with the prior art, the invention has the beneficial effects that:
according to the non-translated element identification method based on the editing behavior and the rule, the original text, the source language and the target language are input, the translated editing distance and the translated editing time are output, and whether the original text is the non-translated element is judged through the threshold value of the translated editing time, the threshold value of the translated editing distance and the simple regular expression rule, so that the identification accuracy of the non-translated element in the original text is improved, the non-translated element is not required to be exhausted, and the maintenance cost is reduced.
Detailed Description
The invention will be described in further detail with reference to the drawings and the detailed description. The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Referring to FIG. 1, the invention is an method for identifying non-translated elements based on editing behavior and rules, comprising the steps of:
step one: acquiring an original text, and segmenting the original text;
step two: obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector;
step three: obtaining vector representation of an original text, constructing an encoder and a pooling layer, inputting a vector representation set of original words, and outputting vector representation representing the whole original text;
step four: obtaining language direction vector characterization, splicing a source language and a target language into language direction features by using underline characters, and mapping the language direction features into 256-dimensional embedded vectors;
step five: obtaining a second-order feature interaction vector, and performing outer product operation and column summation operation on the language direction vector representation and the vector representation of the original text to generate the second-order feature interaction vector;
step six: splicing language direction vector representation, original text vector representation and second-order feature interaction vector, and generating a new spliced feature vector through MLP;
step seven: predicting the compiled time, inputting the spliced feature vector, and outputting a scalar as the compiled time;
step eight: predicting the compiled editing distance, inputting a spliced feature vector, and outputting a scalar as the compiled editing distance;
step nine: non-translated element threshold, criteria for non-translated element: the post-translation editing time is less than 10 seconds; and the compiled editing distance is less than 4; the original text is composed of English letters, numbers, punctuations and spaces;
step ten: by compiling the data after translation, the MSE loss function is designed and model weights are trained using a random gradient descent method and an error back propagation method.
Specifically, before translating the original text of the document, the original text is obtained, and the blank and punctuation marks are used for translating the original textWord segmentation is carried out; judging whether the word or the character exists in the multilingual dictionary based on the multilingual dictionary; if a word or character exists in the multilingual dictionary, then it is noted as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the If the word or character does not exist in the multilingual dictionary, the longest character string greedy matching algorithm is used for word segmentation, namely character-by-character cutting is carried out from the last character of the word to the first character of the word, the word is cut into two sub-words, the two sub-words are respectively marked as a front character string and a rear character string until the front character string exists in the multilingual dictionary, and the front character string is regarded as a qualified sub-word of the word and recorded as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword tokensrc Composition; obtaining a set of words or characters after word segmentation of an original text by using a formula;
wherein, TOKENSsrc Is a set of words or characters after the segmentation of the original text, t represents t moire words or characters, and the number of the words or characters of the original text is limited to 64.
The method comprises the steps of obtaining word vector representation of an original text, wherein the word vector representation of the original text word consists of a word vector and a position vector, performing word embedding on words or characters of the original text, mapping each word or character into a 256-dimensional word vector, mapping the position of the word or character into the 256-dimensional position vector, and performing addition and regularization processing on the word vector and the position vector; obtaining a set of original words or character vector characterizations by using a formula;
wherein,,
is the word or character of the i-th original text, < >>
Is the position of the i-th original word or character,/->
Word vector which is the i-th original word or character,/->
Is the position vector of the i-th original word or character,>
is a vector representation of the ith textual word or character,
E
src is a collection of t original words or character vector characterizations, and is regularized,
the above references are trainable weights.
The invention constructs a 6-layer 8-head self-attention layer encoder, inputs the vector characterization set of the words or characters of the original text, and outputs the vector characterization representing the whole original text. Each self-attention layer is composed of 4 parts, namely an input vector part, an 8-head QKV vector generation part, an attention calculation part and an output hidden vector part;
in terms of input vectors, the input vectors of the first layer of attention layer
For vector characterization of textual words or characters, input vectors of other attention layers (layers 2-6)>
The hidden vector H output for the upper attention layer
k-1 。
Wherein I is
k Is the input vector for the k-th layer self-attention layer,
when k=1, the input vector is set E of textual words or character vector representations
src When 1<When k is less than or equal to 6, the input vector is the hidden vector H output by the k-1 layer
k-1 ,
In the aspect of generating the QKV vector with 8 heads, a query vector, a key vector and a value vector are respectively generated by an input vector, wherein the query vector is a query vector of a word and is suitable for calculating the association degree of the word and key vectors of other words, the key vector is a key vector of the word and is suitable for calculating the association degree of the query vector of other words and the word, the value vector is a value vector of the word and is suitable for constructing new vector characterization of other words according to attention weights. In order to learn different text features, 8 attention heads are set, and query vectors, key vectors and value vectors are respectively mapped into 8 different subspaces, wherein each subspace represents different text features.
Wherein Q is
src Is the query vector which is used to determine the vector,
W
src_query is the weight of the input variable mapped to query vector,/->
K
src Is a key vector, ">
W
src_key Is the weight of the input variable mapped to key vector,/->
V
src Is a value vector,/->
W
src_value Is the weight of the input variable mapped to the value vector,/->
Is the query vector of the j-th attention header,>
is the weight of the query vector mapped to the j-th attention header vector,
is the key vector of the j-th attention head,
Is the weight of the key vector mapped to the j-th attention head vector, +.>
Is the value vector of the j-th attention head,>
is the weight that the value vector maps to the j-th attention header vector,
the above correlations W are trainable weights.
In the aspect of attention calculation, in order to learn context information, similarity between query vectors and key vectors of different attention heads is calculated, attention weight scores among the original words or characters are generated through a softmax function, and the attention weight scores and the value vectors are multiplied to obtain the context vectors of the different attention heads. Finally, different kinds of injectionSplicing and mapping context vectors of the Italian header, and adding residual blocks
A new context vector is generated after regularization.
Wherein,,
is the context vector token set of the j-th attention header,/is the j-th attention header>
M
src Is spliced and integrated by 6 attention heads, and a residual block is added +.>
The resulting set of context vector representations,
d is 32 by default,>
is used for avoiding attention weight +.>
Is too large; the above correlations W are trainable weights.
In terms of outputting hidden vectors, the new context vector representation passes through two full connection layers, the activation function gel, and the residual block Msrc And regularizing the series of calculation operations, and outputting the hidden vector of the attention layer.
Wherein W is
hidden1 Is a first full-connection layer, and is a second full-connection layer,
W
hidden2 is a second full-connection layer, +.>
After the full connection layer and the residual block are overlapped, the k-th self-attention layer outputs a hidden vector set of the original words, namely +.>
The above correlations W are trainable weights.
The original word hidden vector set H of the last layerlast_hidden_state Is a set of hidden vectors of t original words or characters, in order to obtain the hidden vectors of the original text at sentence level, the hidden vectors of all the words or characters are pooled, the pooled operation is an addition operation, and a 256-dimensional vector H is generatedsrc As a text vector representation of the original text.
Wherein,,
is the last layer hidden vector of the i-th original word or character,
H
src is a text hidden vector representation of the original text, +.>
Acquiring language directionVector characterization, language direction feature lang is characterized by the source language langsrc And target language langtgt The method comprises the steps of obtaining a source language code and a target language code, splicing the source language code and the target language code by using underlined characters "_" to form language direction characteristics, wherein for example, the source language is English-US, the code is en-US, the target language is simplified Chinese, the code is zh-CN, and the language direction characteristics are en-US_zh-CN. The invention relates to a plurality of language directions, and language direction characteristics are mapped into 256-dimensional embedded vectors.
lang=langsrc +"_"+langtgt ;
L=Embedding(langi );
Wherein L is a language direction vector representation, lang
i Is the type of the direction of the i-th language,
the above references are trainable weights.
Obtaining a second-order characteristic interaction vector through vector representation and language direction vector representation of the original text, and obtaining a vector representation H of the original textsrc And language direction vector representation LT And then, performing outer product operation on the two features to generate a second-order feature interaction matrix LH, performing column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector P.
LH=LT Hsrc ;
Where L is the language direction vector representation,
H
src is the character of the text vector of the original text,
LH is a second order feature interaction matrix,>
LH
i is the i-th line second order feature interaction vector,
p is an explicit second order feature interaction vector, +.>
Splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector, splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector to form a spliced vector C, and outputting a new spliced feature vector H after a series of calculation operations including two full connection layers, an activation function GELU, a residual block C and regularization, namely MLPo 。
C=Concate(L,Hsrc ,P);
Ho =Layrnorm(GELU(C Wc )Wo +C);
Wherein C is a feature vector obtained by directly splicing the language direction, the original text feature and the interactive feature,
W
c is a first full-connection layer, < >>
W
o Is a second full-connection layer, +.>
H
o Is to output new spliced feature vector +.>
The above correlations W are trainable weights.
Predicting the compiled time and using the new spliced characteristic vector Ho And the input vector is used as an input vector of the post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing time.
time=Layernorm(Ho Wtime1 )Wtime2 ;
Wherein W is
time1 Is a first full-connection layer, and is a second full-connection layer,
W
time2 is a second full-connection layer, and is a third full-connection layer,
time is a scalar of the compiled time. The above correlations W are trainable weights.
Predicting the compiled editing distance by using new spliced characteristic vector Ho And the input vector is used as an input vector of the post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing distance.
distance=Layernorm(Ho Wdistance1 )Wdistance2 ;
Wherein W is
distance1 Is a first full-connection layer, and is a second full-connection layer,
W
distance2 is a second full-connection layer, +.>
distance is a scalar of the translated edit distance. The above correlations W are trainable weights.
The method is characterized in that the method sets the compiled editing time to be less than 10 seconds and the compiled editing distance to be less than 4 as an editing behavior dividing standard, namely, the 10 seconds is the compiled editing time threshold, the 4 is the compiled editing distance threshold, the two thresholds are used as the non-compiled element threshold, the non-compiled element threshold is obtained through statistics of real data of the non-compiled element, and meanwhile, the method is judged by simple regular expression rules: whether the original text is completely composed of English letters, numbers, punctuation marks and spaces. If the requirements of the post-translation editing time, the post-translation editing distance and the rule are met at the same time, judging the non-translated element, and directly adopting the original text as the translated text without calling a machine translation engine.
Training model weights, in order to train the above steps of editing and W, the present patent uses tens of thousands of samples of post-translation editing data containing characteristics of original text, source language, target language, editing distance, editing time, wherein the post-translation editing data and the corresponding characteristics are described in table 1.
TABLE 1
The method takes a mean square error function MSE as a loss function, and specifically the loss function is as follows, and uses a random gradient descent algorithm and an error back propagation algorithm to train Embedding and W, wherein the training iteration epoch is 20, the batch size is 128, the learning rate is 0.00005, and the weight decay rate is 0.01.
loss=0.5×lossdistance +0.5×losstime ;
Wherein the total loss function loss is defined by the edit distance loss function loss
distance And editing a time loss function loss
time The composition is that,
is the predictive value of the edit distance, y
d Is the true value of the edit distance, +.>
Is the predicted value of the editing time, y
t Is the true value of the edit time.
In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," and the like indicate an orientation or a positional relationship based on that shown in the drawings, and are merely for convenience of description and for simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, as well as a specific orientation configuration and operation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The foregoing describes one embodiment of the present invention in detail, but the description is only a preferred embodiment of the present invention and should not be construed as limiting the scope of the invention. All equivalent changes and modifications within the scope of the present invention are intended to be covered by the present invention.