Movatterモバイル変換


[0]ホーム

URL:


CN116069901A - Non-translated element identification method based on editing behavior and rule - Google Patents

Non-translated element identification method based on editing behavior and rule
Download PDF

Info

Publication number
CN116069901A
CN116069901ACN202310092192.2ACN202310092192ACN116069901ACN 116069901 ACN116069901 ACN 116069901ACN 202310092192 ACN202310092192 ACN 202310092192ACN 116069901 ACN116069901 ACN 116069901A
Authority
CN
China
Prior art keywords
vector
word
original text
editing
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310092192.2A
Other languages
Chinese (zh)
Other versions
CN116069901B (en
Inventor
陈件
潘丽婷
张井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yizhe Information Technology Co ltd
Original Assignee
Shanghai Yizhe Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yizhe Information Technology Co ltdfiledCriticalShanghai Yizhe Information Technology Co ltd
Priority to CN202310092192.2ApriorityCriticalpatent/CN116069901B/en
Publication of CN116069901ApublicationCriticalpatent/CN116069901A/en
Application grantedgrantedCritical
Publication of CN116069901BpublicationCriticalpatent/CN116069901B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention discloses an untranslated element identification method based on editing behaviors and rules, and belongs to the technical field of untranslated element identification. The method comprises the steps of obtaining an original text and segmenting the original text; obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector; and obtaining the vector representation of the original text, constructing an encoder and a pooling layer, inputting a vector representation set of the original word, and outputting a text vector representation. According to the method, the original text, the source language and the target language are input, the post-translation editing distance and the post-translation editing time are output, and whether the original text is an element which is not translated is judged through the post-translation editing time threshold value, the post-translation editing distance threshold value and the simple regular expression rule judgment, so that the identification accuracy of the element which is not translated in the original text is improved, the element which is not translated is not required to be exhausted, and the maintenance cost is reduced.

Description

Non-translated element identification method based on editing behavior and rule
Technical Field
The invention relates to the technical field of non-translated element recognition, in particular to a non-translated element recognition method based on editing behaviors and rules.
Background
In a document translation scene, the document is firstly analyzed, the document is segmented into clauses, and then a machine turning engine is called to translate the original text sentence by sentence. However, some original sentences do not actually need to call a machine-turning engine, and can be directly copied and pasted into translations, such as product serial numbers, mail addresses, url, numbers, chart data and the like, and the original consisting of similar character strings is called an untranslated element, and if the original contains similar character strings, but other meaningful contents still need to be translated, the original is not the untranslated element. Because the general commercial machine turning engine charges according to the number of characters, if the original text belonging to the non-translated element is screened before the machine turning engine is called, the original text can be directly used as the translated text without calling the machine turning engine, so that the calling cost of the machine turning engine of a translator can be saved, and the document translation efficiency can be improved.
The text of the non-translated element is characterized by being completely composed of English letters, numbers, punctuations and spaces, and besides, compared with other original text needing translation, the text of the non-translated element has the editing behavior characteristics of short editing time, small editing distance and the like. The non-translated element identification generally adopts an exhaustion method, and a regular expression is used for writing corresponding rules, so that the method is simple and effective, but has the defects that misjudgment is easy to cause for the rules, a rule base is required to be maintained continuously, the non-translated elements are exhausted, the regular expression codes of the rule base are more and more complicated, and the maintenance cost is increased.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention aims to provide a non-translation element identification method based on editing behaviors and rules, which solves the following technical problems: the rule is easy to cause misjudgment, a rule base is required to be maintained continuously, non-translated elements are exhausted, regular expression codes of the rule base are more and more complicated, and maintenance cost is increased.
The aim of the invention can be achieved by the following technical scheme:
an method for identifying non-translated elements based on editing behaviors and rules comprises the following steps:
step one: acquiring an original text, and segmenting the original text;
step two: obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector;
step three: obtaining vector representation of an original text, constructing an encoder and a pooling layer, inputting a vector representation set of original words, and outputting vector representation representing the whole original text;
step four: obtaining language direction vector characterization, splicing a source language and a target language into language direction features by using underline characters, and mapping the language direction features into 256-dimensional embedded vectors;
step five: obtaining a second-order feature interaction vector, and performing outer product operation and column summation operation on the language direction vector representation and the vector representation of the original text to generate the second-order feature interaction vector;
step six: splicing language direction vector representation, original text vector representation and second-order feature interaction vector, and generating a new spliced feature vector through MLP;
step seven: predicting the compiled time, inputting the spliced feature vector, and outputting a scalar as the compiled time;
step eight: predicting the compiled editing distance, inputting a spliced feature vector, and outputting a scalar as the compiled editing distance;
step nine: non-translated element threshold, criteria for non-translated element: the post-translation editing time is less than 10 seconds; and the compiled editing distance is less than 4; the original text is composed of English letters, numbers, punctuations and spaces;
step ten: by compiling the data after translation, the MSE loss function is designed and model weights are trained using a random gradient descent method and an error back propagation method.
Further, obtaining an original text, and segmenting the original text by using spaces and punctuation marks; judging the existence of words or characters based on the multilingual dictionary; when the word or character exists in the multilingual dictionary, the word or character is recorded as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the When a word or character is not presentThe multi-language dictionary uses the longest character string greedy matching algorithm to segment words, namely, the characters from the last character of the word to the first character of the word are cut one by one, the word is cut into two sub-words, the two sub-words are marked as a front character string and a rear character string respectively until the front character string exists in the multi-language dictionary, the front character string is regarded as a qualified sub-word of the word, and the word is marked as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword tokensrc Composition is prepared.
Further, the vector representation of the original word is composed of a word vector and a position vector, the word or character of the original word is embedded, each word or character is mapped into a 256-dimensional word vector, the position of the word or character is mapped into the 256-dimensional position vector, and the word vector and the position vector are added and regularized.
Further, a 6-layer 8-head self-attention layer encoder is constructed, a set of vector tokens of words or characters of an original text is input, and a vector token representing the entire original text is output.
Further, the language direction feature is composed of a source language and a target language, a source language code and a target language code are obtained, the source language code and the target language code are spliced by using underlined characters to form the language direction feature, and the language direction feature is mapped into a 256-dimensional embedded vector.
In the fifth step, after obtaining the vector representation of the original text and the language direction vector representation, performing an outer product operation on the two features to generate a second-order feature interaction matrix, performing a column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector.
Further, the language direction vector representation, the original text vector representation and the second-order feature interaction vector are spliced to form a spliced vector, and a new spliced feature vector is output after a series of calculation operations of two full connection layers, an activation function, a residual block and regularization.
Furthermore, the new spliced feature vector is used as an input vector of the post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing time.
Further, the new spliced feature vector is used as an input vector of the post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing distance.
Further, when the post-translation editing time and the post-translation editing distance are smaller than the non-translation element threshold and the original text is composed of English letters, numbers, punctuations and spaces, the non-translation element is judged, a machine translation engine is not required to be called, and the original text is directly used as the translation.
Further, the translated editing data includes an original text, a source language, a target language, an editing distance, and an editing time.
Compared with the prior art, the invention has the beneficial effects that:
according to the non-translated element identification method based on the editing behavior and the rule, the original text, the source language and the target language are input, the translated editing distance and the translated editing time are output, and whether the original text is the non-translated element is judged through the threshold value of the translated editing time, the threshold value of the translated editing distance and the simple regular expression rule, so that the identification accuracy of the non-translated element in the original text is improved, the non-translated element is not required to be exhausted, and the maintenance cost is reduced.
Drawings
FIG. 1 is a flow chart of a non-translated element recognition method based on editing behavior and rules according to the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and the detailed description. The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Referring to FIG. 1, the invention is an method for identifying non-translated elements based on editing behavior and rules, comprising the steps of:
step one: acquiring an original text, and segmenting the original text;
step two: obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector;
step three: obtaining vector representation of an original text, constructing an encoder and a pooling layer, inputting a vector representation set of original words, and outputting vector representation representing the whole original text;
step four: obtaining language direction vector characterization, splicing a source language and a target language into language direction features by using underline characters, and mapping the language direction features into 256-dimensional embedded vectors;
step five: obtaining a second-order feature interaction vector, and performing outer product operation and column summation operation on the language direction vector representation and the vector representation of the original text to generate the second-order feature interaction vector;
step six: splicing language direction vector representation, original text vector representation and second-order feature interaction vector, and generating a new spliced feature vector through MLP;
step seven: predicting the compiled time, inputting the spliced feature vector, and outputting a scalar as the compiled time;
step eight: predicting the compiled editing distance, inputting a spliced feature vector, and outputting a scalar as the compiled editing distance;
step nine: non-translated element threshold, criteria for non-translated element: the post-translation editing time is less than 10 seconds; and the compiled editing distance is less than 4; the original text is composed of English letters, numbers, punctuations and spaces;
step ten: by compiling the data after translation, the MSE loss function is designed and model weights are trained using a random gradient descent method and an error back propagation method.
Specifically, before translating the original text of the document, the original text is obtained, and the blank and punctuation marks are used for translating the original textWord segmentation is carried out; judging whether the word or the character exists in the multilingual dictionary based on the multilingual dictionary; if a word or character exists in the multilingual dictionary, then it is noted as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the If the word or character does not exist in the multilingual dictionary, the longest character string greedy matching algorithm is used for word segmentation, namely character-by-character cutting is carried out from the last character of the word to the first character of the word, the word is cut into two sub-words, the two sub-words are respectively marked as a front character string and a rear character string until the front character string exists in the multilingual dictionary, and the front character string is regarded as a qualified sub-word of the word and recorded as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword tokensrc Composition; obtaining a set of words or characters after word segmentation of an original text by using a formula;
Figure SMS_1
wherein, TOKENSsrc Is a set of words or characters after the segmentation of the original text, t represents t moire words or characters, and the number of the words or characters of the original text is limited to 64.
The method comprises the steps of obtaining word vector representation of an original text, wherein the word vector representation of the original text word consists of a word vector and a position vector, performing word embedding on words or characters of the original text, mapping each word or character into a 256-dimensional word vector, mapping the position of the word or character into the 256-dimensional position vector, and performing addition and regularization processing on the word vector and the position vector; obtaining a set of original words or character vector characterizations by using a formula;
Figure SMS_2
Figure SMS_3
Figure SMS_4
Figure SMS_5
wherein,,
Figure SMS_7
is the word or character of the i-th original text, < >>
Figure SMS_10
Is the position of the i-th original word or character,/->
Figure SMS_13
Word vector which is the i-th original word or character,/->
Figure SMS_8
Figure SMS_11
Is the position vector of the i-th original word or character,>
Figure SMS_12
Figure SMS_14
is a vector representation of the ith textual word or character,
Figure SMS_6
Esrc is a collection of t original words or character vector characterizations, and is regularized,
Figure SMS_9
the above references are trainable weights.
The invention constructs a 6-layer 8-head self-attention layer encoder, inputs the vector characterization set of the words or characters of the original text, and outputs the vector characterization representing the whole original text. Each self-attention layer is composed of 4 parts, namely an input vector part, an 8-head QKV vector generation part, an attention calculation part and an output hidden vector part;
in terms of input vectors, the input vectors of the first layer of attention layer
Figure SMS_15
For vector characterization of textual words or characters, input vectors of other attention layers (layers 2-6)>
Figure SMS_16
The hidden vector H output for the upper attention layerk-1
Figure SMS_17
Wherein I isk Is the input vector for the k-th layer self-attention layer,
Figure SMS_18
when k=1, the input vector is set E of textual words or character vector representationssrc When 1<When k is less than or equal to 6, the input vector is the hidden vector H output by the k-1 layerk-1
Figure SMS_19
In the aspect of generating the QKV vector with 8 heads, a query vector, a key vector and a value vector are respectively generated by an input vector, wherein the query vector is a query vector of a word and is suitable for calculating the association degree of the word and key vectors of other words, the key vector is a key vector of the word and is suitable for calculating the association degree of the query vector of other words and the word, the value vector is a value vector of the word and is suitable for constructing new vector characterization of other words according to attention weights. In order to learn different text features, 8 attention heads are set, and query vectors, key vectors and value vectors are respectively mapped into 8 different subspaces, wherein each subspace represents different text features.
Figure SMS_20
Figure SMS_21
Figure SMS_22
Figure SMS_23
Figure SMS_24
Figure SMS_25
Wherein Q issrc Is the query vector which is used to determine the vector,
Figure SMS_34
Wsrc_query is the weight of the input variable mapped to query vector,/->
Figure SMS_35
Ksrc Is a key vector, ">
Figure SMS_42
Wsrc_key Is the weight of the input variable mapped to key vector,/->
Figure SMS_27
Vsrc Is a value vector,/->
Figure SMS_31
Wsrc_value Is the weight of the input variable mapped to the value vector,/->
Figure SMS_30
Figure SMS_44
Figure SMS_37
Is the query vector of the j-th attention header,>
Figure SMS_39
Figure SMS_26
is the weight of the query vector mapped to the j-th attention header vector,
Figure SMS_29
Figure SMS_36
is the key vector of the j-th attention head,
Figure SMS_40
Figure SMS_32
Is the weight of the key vector mapped to the j-th attention head vector, +.>
Figure SMS_43
Figure SMS_28
Is the value vector of the j-th attention head,>
Figure SMS_33
Figure SMS_38
is the weight that the value vector maps to the j-th attention header vector,
Figure SMS_41
the above correlations W are trainable weights.
In the aspect of attention calculation, in order to learn context information, similarity between query vectors and key vectors of different attention heads is calculated, attention weight scores among the original words or characters are generated through a softmax function, and the attention weight scores and the value vectors are multiplied to obtain the context vectors of the different attention heads. Finally, different kinds of injectionSplicing and mapping context vectors of the Italian header, and adding residual blocks
Figure SMS_45
A new context vector is generated after regularization.
Figure SMS_46
Figure SMS_47
Figure SMS_48
Wherein,,
Figure SMS_49
is the context vector token set of the j-th attention header,/is the j-th attention header>
Figure SMS_50
Msrc Is spliced and integrated by 6 attention heads, and a residual block is added +.>
Figure SMS_51
The resulting set of context vector representations,
Figure SMS_52
d is 32 by default,>
Figure SMS_53
is used for avoiding attention weight +.>
Figure SMS_54
Is too large; the above correlations W are trainable weights.
In terms of outputting hidden vectors, the new context vector representation passes through two full connection layers, the activation function gel, and the residual block Msrc And regularizing the series of calculation operations, and outputting the hidden vector of the attention layer.
Figure SMS_55
Wherein W ishidden1 Is a first full-connection layer, and is a second full-connection layer,
Figure SMS_56
Whidden2 is a second full-connection layer, +.>
Figure SMS_57
Figure SMS_58
After the full connection layer and the residual block are overlapped, the k-th self-attention layer outputs a hidden vector set of the original words, namely +.>
Figure SMS_59
The above correlations W are trainable weights.
The original word hidden vector set H of the last layerlast_hidden_state Is a set of hidden vectors of t original words or characters, in order to obtain the hidden vectors of the original text at sentence level, the hidden vectors of all the words or characters are pooled, the pooled operation is an addition operation, and a 256-dimensional vector H is generatedsrc As a text vector representation of the original text.
Figure SMS_60
Wherein,,
Figure SMS_61
is the last layer hidden vector of the i-th original word or character,
Figure SMS_62
Hsrc is a text hidden vector representation of the original text, +.>
Figure SMS_63
Acquiring language directionVector characterization, language direction feature lang is characterized by the source language langsrc And target language langtgt The method comprises the steps of obtaining a source language code and a target language code, splicing the source language code and the target language code by using underlined characters "_" to form language direction characteristics, wherein for example, the source language is English-US, the code is en-US, the target language is simplified Chinese, the code is zh-CN, and the language direction characteristics are en-US_zh-CN. The invention relates to a plurality of language directions, and language direction characteristics are mapped into 256-dimensional embedded vectors.
lang=langsrc +"_"+langtgt
L=Embedding(langi );
Wherein L is a language direction vector representation, langi Is the type of the direction of the i-th language,
Figure SMS_64
the above references are trainable weights.
Obtaining a second-order characteristic interaction vector through vector representation and language direction vector representation of the original text, and obtaining a vector representation H of the original textsrc And language direction vector representation LT And then, performing outer product operation on the two features to generate a second-order feature interaction matrix LH, performing column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector P.
LH=LT Hsrc
Figure SMS_65
Where L is the language direction vector representation,
Figure SMS_66
Hsrc is the character of the text vector of the original text,
Figure SMS_67
LH is a second order feature interaction matrix,>
Figure SMS_68
LHi is the i-th line second order feature interaction vector,
Figure SMS_69
p is an explicit second order feature interaction vector, +.>
Figure SMS_70
Splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector, splicing the language direction vector representation, the vector representation of the original text and the second-order feature interaction vector to form a spliced vector C, and outputting a new spliced feature vector H after a series of calculation operations including two full connection layers, an activation function GELU, a residual block C and regularization, namely MLPo
C=Concate(L,Hsrc ,P);
Ho =Layrnorm(GELU(C Wc )Wo +C);
Wherein C is a feature vector obtained by directly splicing the language direction, the original text feature and the interactive feature,
Figure SMS_71
Wc is a first full-connection layer, < >>
Figure SMS_72
Wo Is a second full-connection layer, +.>
Figure SMS_73
Ho Is to output new spliced feature vector +.>
Figure SMS_74
The above correlations W are trainable weights.
Predicting the compiled time and using the new spliced characteristic vector Ho And the input vector is used as an input vector of the post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing time.
time=Layernorm(Ho Wtime1 )Wtime2
Wherein W istime1 Is a first full-connection layer, and is a second full-connection layer,
Figure SMS_75
Wtime2 is a second full-connection layer, and is a third full-connection layer,
Figure SMS_76
time is a scalar of the compiled time. The above correlations W are trainable weights.
Predicting the compiled editing distance by using new spliced characteristic vector Ho And the input vector is used as an input vector of the post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and finally a scalar is output to be used as the post-translation editing distance.
distance=Layernorm(Ho Wdistance1 )Wdistance2
Wherein W isdistance1 Is a first full-connection layer, and is a second full-connection layer,
Figure SMS_77
Wdistance2 is a second full-connection layer, +.>
Figure SMS_78
distance is a scalar of the translated edit distance. The above correlations W are trainable weights.
The method is characterized in that the method sets the compiled editing time to be less than 10 seconds and the compiled editing distance to be less than 4 as an editing behavior dividing standard, namely, the 10 seconds is the compiled editing time threshold, the 4 is the compiled editing distance threshold, the two thresholds are used as the non-compiled element threshold, the non-compiled element threshold is obtained through statistics of real data of the non-compiled element, and meanwhile, the method is judged by simple regular expression rules: whether the original text is completely composed of English letters, numbers, punctuation marks and spaces. If the requirements of the post-translation editing time, the post-translation editing distance and the rule are met at the same time, judging the non-translated element, and directly adopting the original text as the translated text without calling a machine translation engine.
Training model weights, in order to train the above steps of editing and W, the present patent uses tens of thousands of samples of post-translation editing data containing characteristics of original text, source language, target language, editing distance, editing time, wherein the post-translation editing data and the corresponding characteristics are described in table 1.
Figure SMS_79
Figure SMS_80
TABLE 1
The method takes a mean square error function MSE as a loss function, and specifically the loss function is as follows, and uses a random gradient descent algorithm and an error back propagation algorithm to train Embedding and W, wherein the training iteration epoch is 20, the batch size is 128, the learning rate is 0.00005, and the weight decay rate is 0.01.
loss=0.5×lossdistance +0.5×losstime
Figure SMS_81
Figure SMS_82
Wherein the total loss function loss is defined by the edit distance loss function lossdistance And editing a time loss function losstime The composition is that,
Figure SMS_83
is the predictive value of the edit distance, yd Is the true value of the edit distance, +.>
Figure SMS_84
Is the predicted value of the editing time, yt Is the true value of the edit time.
In the description of the present invention, it should be understood that the terms "upper," "lower," "left," "right," and the like indicate an orientation or a positional relationship based on that shown in the drawings, and are merely for convenience of description and for simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, as well as a specific orientation configuration and operation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The foregoing describes one embodiment of the present invention in detail, but the description is only a preferred embodiment of the present invention and should not be construed as limiting the scope of the invention. All equivalent changes and modifications within the scope of the present invention are intended to be covered by the present invention.

Claims (11)

1. An identification method of non-translated elements based on editing behaviors and rules is characterized by comprising the following steps:
step one: acquiring an original text, and segmenting the original text;
step two: obtaining vector characterization of an original word, generating a word vector and a position vector, and adding and regularizing the word vector and the position vector;
step three: obtaining vector representation of an original text, constructing an encoder and a pooling layer, inputting a vector representation set of original words, and outputting vector representation representing the whole original text;
step four: obtaining language direction vector characterization, splicing a source language and a target language into language direction features by using underline characters, and mapping the language direction features into 256-dimensional embedded vectors;
step five: obtaining a second-order feature interaction vector, and performing outer product operation and column summation operation on the language direction vector representation and the vector representation of the original text to generate the second-order feature interaction vector;
step six: splicing language direction vector representation, original text vector representation and second-order feature interaction vector, and generating a new spliced feature vector through MLP;
step seven: predicting the compiled time, inputting the spliced feature vector, and outputting a scalar as the compiled time;
step eight: predicting the compiled editing distance, inputting a spliced feature vector, and outputting a scalar as the compiled editing distance;
step nine: non-translated element threshold, criteria for non-translated element: the post-translation editing time is less than 10 seconds; and the compiled editing distance is less than 4; the original text is composed of English letters, numbers, punctuations and spaces;
step ten: by compiling the data after translation, the MSE loss function is designed and model weights are trained using a random gradient descent method and an error back propagation method.
2. The method for identifying non-translated elements based on editing behaviors and rules according to claim 1, wherein the original text is obtained, and the original text is segmented by using spaces and punctuations; judging the existence of words or characters based on the multilingual dictionary; when the word or character exists in the multilingual dictionary, the word or character is recorded as tokensrc The method comprises the steps of carrying out a first treatment on the surface of the When the word or character does not exist in the multilingual dictionary, the longest character string greedy matching algorithm is used for word segmentation, namely character-by-character cutting is carried out from the last character of the word to the first character of the word, the word is cut into two sub-words, and the sub-words are respectively marked as the front charactersThe character string and the back character string are recorded as token by regarding the front character string as qualified sub-word of the word until the front character string exists in the multilingual dictionarysrc The method comprises the steps of carrying out a first treatment on the surface of the Repeating the cutting operation on the rear character string until the original text can be formed by qualified subword tokensrc Composition is prepared.
3. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein the vector representation of the original word is composed of a word vector and a position vector, the word or character of the original word is word-embedded, each word or character is mapped into a 256-dimensional word vector, the position of the word or character is mapped into a 256-dimensional position vector, and the word vector and the position vector are added and regularized.
4. The method of claim 1, wherein a 6-layer 8-head self-attention layer encoder is constructed to input a set of vector representations of words or characters of an original text and output a vector representation representing the entire original text.
5. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein the language direction feature is composed of a source language and a target language, the source language code and the target language code are acquired, the source language code and the target language code are spliced by using underlined characters to form the language direction feature, and the language direction feature is mapped into a 256-dimensional embedded vector.
6. The method for recognizing non-translated elements based on editing behavior and rules according to claim 1, wherein in the fifth step, after obtaining the vector representation of the original text and the language direction vector representation, performing an outer product operation on the two features to generate a second-order feature interaction matrix, performing a column-wise summation operation on the feature interaction matrix, and finally generating an explicit second-order feature interaction vector.
7. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein language direction vector representation, original text vector representation and second-order feature interaction vector are spliced to form a spliced vector, and after a series of calculation operations of two full connection layers, an activation function, a residual block and regularization, a new spliced feature vector is output.
8. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein a new spliced feature vector is used as an input vector of a post-translation editing time prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and a scalar is finally output as the post-translation editing time.
9. The method for recognizing non-translated elements based on editing behaviors and rules according to claim 1, wherein a new spliced feature vector is used as an input vector of a post-translation editing distance prediction module, a 2-layer full-connection layer and a regularization layer are accessed, and a scalar is finally output as a post-translation editing distance.
10. The method for recognizing non-translated elements based on editing behavior and rule according to claim 1, wherein when the post-translation editing time and the post-translation editing distance are less than the non-translated element threshold and the original text is composed of english letters, numbers, punctuations and spaces, the non-translated element is judged, and the machine translation engine is not required to be called, and the original text is directly adopted as the translation.
11. The method of claim 1, wherein the compiled data comprises an original text, a source language, a target language, an edit distance, and an edit time.
CN202310092192.2A2023-02-032023-02-03Non-translated element identification method based on editing behavior and ruleActiveCN116069901B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310092192.2ACN116069901B (en)2023-02-032023-02-03Non-translated element identification method based on editing behavior and rule

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310092192.2ACN116069901B (en)2023-02-032023-02-03Non-translated element identification method based on editing behavior and rule

Publications (2)

Publication NumberPublication Date
CN116069901Atrue CN116069901A (en)2023-05-05
CN116069901B CN116069901B (en)2023-08-11

Family

ID=86179859

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310092192.2AActiveCN116069901B (en)2023-02-032023-02-03Non-translated element identification method based on editing behavior and rule

Country Status (1)

CountryLink
CN (1)CN116069901B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20140358518A1 (en)*2013-06-022014-12-04Jianqing WuTranslation Protocol for Large Discovery Projects
CN109635269A (en)*2019-01-312019-04-16苏州大学A kind of post-editing method and device of machine translation text
CN109783826A (en)*2019-01-152019-05-21四川译讯信息科技有限公司A kind of document automatic translating method
JP2019082981A (en)*2017-10-302019-05-30株式会社テクノリンクInter-different language communication assisting device and system
CN114201976A (en)*2021-10-282022-03-18上海一者信息科技有限公司Automatic optimization system for machine translation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20140358518A1 (en)*2013-06-022014-12-04Jianqing WuTranslation Protocol for Large Discovery Projects
JP2019082981A (en)*2017-10-302019-05-30株式会社テクノリンクInter-different language communication assisting device and system
CN109783826A (en)*2019-01-152019-05-21四川译讯信息科技有限公司A kind of document automatic translating method
CN109635269A (en)*2019-01-312019-04-16苏州大学A kind of post-editing method and device of machine translation text
CN114201976A (en)*2021-10-282022-03-18上海一者信息科技有限公司Automatic optimization system for machine translation

Also Published As

Publication numberPublication date
CN116069901B (en)2023-08-11

Similar Documents

PublicationPublication DateTitle
CN109918666B (en)Chinese punctuation mark adding method based on neural network
US8239188B2 (en)Example based translation apparatus, translation method, and translation program
CN113377897B (en)Multi-language medical term standard standardization system and method based on deep confrontation learning
CN114818668B (en)Name correction method and device for voice transcription text and computer equipment
CN112541356A (en)Method and system for recognizing biomedical named entities
CN111599340A (en)Polyphone pronunciation prediction method and device and computer readable storage medium
CN115293138A (en)Text error correction method and computer equipment
Kišš et al.AT-ST: self-training adaptation strategy for OCR in domains with limited transcriptions
CN114386417A (en) A Chinese Nested Named Entity Recognition Method Incorporating Word Boundary Information
Dilawari et al.Neural attention model for abstractive text summarization using linguistic feature space
CN113449514A (en)Text error correction method and device suitable for specific vertical field
CN110222338B (en)Organization name entity identification method
CN112818683A (en)Chinese character relationship extraction method based on trigger word rule and Attention-BilSTM
CN114021549A (en) Chinese Named Entity Recognition Method and Device Based on Vocabulary Enhancement and Multi-feature
Yadav et al.Different models of transliteration-a comprehensive review
CN116956946B (en)Machine translation text fine granularity error type identification and positioning method
CN116069901B (en)Non-translated element identification method based on editing behavior and rule
CN114595700A (en)Zero-pronoun and chapter information fused Hanyue neural machine translation method
Tasdemir et al.Automatic transcription of Ottoman documents using deep learning
CN118278392A (en)Chinese spelling error correction method and system based on deep learning
CN118152564A (en)Method for generating countermeasure text of Chinese natural language processing model
CN116663501A (en) A Chinese variant text conversion method based on multi-modal shared weights
CN114036260B (en) Sensitive word determination method, device, equipment, storage medium and program product
Mohapatra et al.Spell checker for OCR
CN115910031A (en)Data augmentation method, data augmentation device, electronic device, and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp