wherein, TOKENS^src Is a set of words or characters after the segmentation of the original text, t represents t moire words or characters, and the number of the words or characters of the original text is limited to 64.

The method comprises the steps of obtaining word vector representation of an original text, wherein the word vector representation of the original text word consists of a word vector and a position vector, performing word embedding on words or characters of the original text, mapping each word or character into a 256-dimensional word vector, mapping the position of the word or character into the 256-dimensional position vector, and performing addition and regularization processing on the word vector and the position vector; obtaining a set of original words or character vector characterizations by using a formula;

wherein,,

is the word or character of the i-th original text, < >>

Is the position of the i-th original word or character,/->

Word vector which is the i-th original word or character,/->

Is the position vector of the i-th original word or character,>

is a vector representation of the ith textual word or character,

E^src is a collection of t original words or character vector characterizations, and is regularized,

the above references are trainable weights.

The invention constructs a 6-layer 8-head self-attention layer encoder, inputs the vector characterization set of the words or characters of the original text, and outputs the vector characterization representing the whole original text. Each self-attention layer is composed of 4 parts, namely an input vector part, an 8-head QKV vector generation part, an attention calculation part and an output hidden vector part;

in terms of input vectors, the input vectors of the first layer of attention layer

For vector characterization of textual words or characters, input vectors of other attention layers (layers 2-6)>

The hidden vector H output for the upper attention layer_k-1 。

Wherein I is_k Is the input vector for the k-th layer self-attention layer,

when k=1, the input vector is set E of textual words or character vector representations^src When 1<When k is less than or equal to 6, the input vector is the hidden vector H output by the k-1 layer_k-1 ，

In the aspect of generating the QKV vector with 8 heads, a query vector, a key vector and a value vector are respectively generated by an input vector, wherein the query vector is a query vector of a word and is suitable for calculating the association degree of the word and key vectors of other words, the key vector is a key vector of the word and is suitable for calculating the association degree of the query vector of other words and the word, the value vector is a value vector of the word and is suitable for constructing new vector characterization of other words according to attention weights. In order to learn different text features, 8 attention heads are set, and query vectors, key vectors and value vectors are respectively mapped into 8 different subspaces, wherein each subspace represents different text features.

Wherein Q is^src Is the query vector which is used to determine the vector,

W^src_query is the weight of the input variable mapped to query vector,/->

K^src Is a key vector, ">

W^src_key Is the weight of the input variable mapped to key vector,/->

V^src Is a value vector,/->

W^src_value Is the weight of the input variable mapped to the value vector,/->

Is the query vector of the j-th attention header,>

is the weight of the query vector mapped to the j-th attention header vector,

is the key vector of the j-th attention head,

Is the weight of the key vector mapped to the j-th attention head vector, +.>

Is the value vector of the j-th attention head,>

is the weight that the value vector maps to the j-th attention header vector,

the above correlations W are trainable weights.

In the aspect of attention calculation, in order to learn context information, similarity between query vectors and key vectors of different attention heads is calculated, attention weight scores among the original words or characters are generated through a softmax function, and the attention weight scores and the value vectors are multiplied to obtain the context vectors of the different attention heads. Finally, different kinds of injectionSplicing and mapping context vectors of the Italian header, and adding residual blocks

A new context vector is generated after regularization.

Wherein,,

is the context vector token set of the j-th attention header,/is the j-th attention header>

M^src Is spliced and integrated by 6 attention heads, and a residual block is added +.>

The resulting set of context vector representations,

d is 32 by default,>

is used for avoiding attention weight +.>

Is too large; the above correlations W are trainable weights.

In terms of outputting hidden vectors, the new context vector representation passes through two full connection layers, the activation function gel, and the residual block M^src And regularizing the series of calculation operations, and outputting the hidden vector of the attention layer.

Wherein W is^hidden1 Is a first full-connection layer, and is a second full-connection layer,

W^hidden2 is a second full-connection layer, +.>

After the full connection layer and the residual block are overlapped, the k-th self-attention layer outputs a hidden vector set of the original words, namely +.>

The above correlations W are trainable weights.

The original word hidden vector set H of the last layer^{last_hidden_state} Is a set of hidden vectors of t original words or characters, in order to obtain the hidden vectors of the original text at sentence level, the hidden vectors of all the words or characters are pooled, the pooled operation is an addition operation, and a 256-dimensional vector H is generated^src As a text vector representation of the original text.

Wherein,,

is the last layer hidden vector of the i-th original word or character,

H^src is a text hidden vector representation of the original text, +.>

Acquiring language directionVector characterization, language direction feature lang is characterized by the source language lang^src And target language lang^tgt The method comprises the steps of obtaining a source language code and a target language code, splicing the source language code and the target language code by using underlined characters "_" to form language direction characteristics, wherein for example, the source language is English-US, the code is en-US, the target language is simplified Chinese, the code is zh-CN, and the language direction characteristics are en-US_zh-CN. The invention relates to a plurality of language directions, and language direction characteristics are mapped into 256-dimensional embedded vectors.

lang＝lang^src +"_"+lang^tgt ；

L＝Embedding(lang_i )；

Wherein L is a language direction vector representation, lang_i Is the type of the direction of the i-th language,

the above references are trainable weights.

Obtaining a second-order characteristic interaction vector through vector representation and language direction vector representation of the original text, and obtaining a vector representation H of the original text^src And language direction vector representation L^T And then, performing outer product operation on the two features to generate a second-order feature interaction matrix LH, performing column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector P.

LH＝L^T H^src ；

Where L is the language direction vector representation,

H^src is the character of the text vector of the original text,

LH is a second order feature interaction matrix,>

LH_i is the i-th line second order feature interaction vector,

p is an explicit second order feature interaction vector, +.>

Splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector, splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector to form a spliced vector C, and outputting a new spliced feature vector H after a series of calculation operations including two full connection layers, an activation function GELU, a residual block C and regularization, namely MLP^o 。

C＝Concate(L,H^src ,P)；

H^o ＝Layrnorm(GELU(C W^c )W^o +C)；

Wherein C is a feature vector obtained by directly splicing the language direction, the original text feature and the interactive feature,

W^c is a first full-connection layer, < >>

W^o Is a second full-connection layer, +.>

H^o Is to output new spliced feature vector +.>

The above correlations W are trainable weights.

Predicting the compiled time and using the new spliced characteristic vector H^o And the input vector is used as an input vector of the post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing time.

time＝Layernorm(H^o W^time1 )W^time2 ；

Wherein W is^time1 Is a first full-connection layer, and is a second full-connection layer,

W^time2 is a second full-connection layer, and is a third full-connection layer,

time is a scalar of the compiled time. The above correlations W are trainable weights.

Predicting the compiled editing distance by using new spliced characteristic vector H^o And the input vector is used as an input vector of the post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing distance.

distance＝Layernorm(H^o W^distance1 )W^distance2 ；

Wherein W is^distance1 Is a first full-connection layer, and is a second full-connection layer,

W^distance2 is a second full-connection layer, +.>

distance is a scalar of the translated edit distance. The above correlations W are trainable weights.

The method is characterized in that the method sets the compiled editing time to be less than 10 seconds and the compiled editing distance to be less than 4 as an editing behavior dividing standard, namely, the 10 seconds is the compiled editing time threshold, the 4 is the compiled editing distance threshold, the two thresholds are used as the non-compiled element threshold, the non-compiled element threshold is obtained through statistics of real data of the non-compiled element, and meanwhile, the method is judged by simple regular expression rules: whether the original text is completely composed of English letters, numbers, punctuation marks and spaces. If the requirements of the post-translation editing time, the post-translation editing distance and the rule are met at the same time, judging the non-translated element, and directly adopting the original text as the translated text without calling a machine translation engine.

Training model weights, in order to train the above steps of editing and W, the present patent uses tens of thousands of samples of post-translation editing data containing characteristics of original text, source language, target language, editing distance, editing time, wherein the post-translation editing data and the corresponding characteristics are described in table 1.

TABLE 1

The method takes a mean square error function MSE as a loss function, and specifically the loss function is as follows, and uses a random gradient descent algorithm and an error back propagation algorithm to train Embedding and W, wherein the training iteration epoch is 20, the batch size is 128, the learning rate is 0.00005, and the weight decay rate is 0.01.

loss＝0.5×loss_distance +0.5×loss_time ；

Wherein the total loss function loss is defined by the edit distance loss function loss_distance And editing a time loss function loss_time The composition is that,

is the predictive value of the edit distance, y_d Is the true value of the edit distance, +.>

Is the predicted value of the editing time, y_t Is the true value of the edit time.

In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," and the like indicate an orientation or a positional relationship based on that shown in the drawings, and are merely for convenience of description and for simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, as well as a specific orientation configuration and operation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The foregoing describes one embodiment of the present invention in detail, but the description is only a preferred embodiment of the present invention and should not be construed as limiting the scope of the invention. All equivalent changes and modifications within the scope of the present invention are intended to be covered by the present invention.

Claims

1. An identification method of non-translated elements based on editing behaviors and rules is characterized by comprising the following steps:

step one: acquiring an original text, and segmenting the original text;

2. The method for identifying non-translated elements based on editing behaviors and rules according to claim 1, wherein the original text is obtained, and the original text is segmented by using spaces and punctuations; judging the existence of words or characters based on the multilingual dictionary; when the word or character exists in the multilingual dictionary, the word or character is recorded as token^src The method comprises the steps of carrying out a first treatment on the surface of the When the word or character does not exist in the multilingual dictionary, the longest character string greedy matching algorithm is used for word segmentation, namely character-by-character cutting is carried out from the last character of the word to the first character of the word, the word is cut into two sub-words, and the sub-words are respectively marked as the front charactersThe character string and the back character string are recorded as token by regarding the front character string as qualified sub-word of the word until the front character string exists in the multilingual dictionary^src The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword token^src Composition is prepared.

3. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein the vector representation of the original word is composed of a word vector and a position vector, the word or character of the original word is word-embedded, each word or character is mapped into a 256-dimensional word vector, the position of the word or character is mapped into a 256-dimensional position vector, and the word vector and the position vector are added and regularized.

4. The method of claim 1, wherein a 6-layer 8-head self-attention layer encoder is constructed to input a set of vector representations of words or characters of an original text and output a vector representation representing the entire original text.

5. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein the language direction feature is composed of a source language and a target language, the source language code and the target language code are acquired, the source language code and the target language code are spliced by using underlined characters to form the language direction feature, and the language direction feature is mapped into a 256-dimensional embedded vector.

6. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein in the fifth step, after obtaining the vector representation of the original text and the language direction vector representation, performing an outer product operation on the two features to generate a second-order feature interaction matrix, performing a column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector.

7. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein language direction vector representation, original text vector representation and second-order feature interaction vector are spliced to form a spliced vector, and after a series of calculation operations of two full connection layers, an activation function, a residual block and regularization, a new spliced feature vector is output.

8. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein a new spliced feature vector is used as an input vector of a post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and a scalar is finally output as the post-translation editing time.

9. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein a new spliced feature vector is used as an input vector of a post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and a scalar is finally output as a post-translation editing distance.

10. The method for recognizing non-translated elements based on editing behavior and rule according to claim 1, wherein when the post-translation editing time and the post-translation editing distance are less than the non-translated element threshold and the original text is composed of english letters, numbers, punctuations and spaces, the non-translated element is judged, and the machine translation engine is not required to be called, and the original text is directly adopted as the translation.

11. The method of claim 1, wherein the compiled data comprises an original text, a source language, a target language, an edit distance, and an edit time.