Movatterモバイル変換


[0]ホーム

URL:


CN109670178A - Sentence-level bilingual alignment method and device, computer readable storage medium - Google Patents

Sentence-level bilingual alignment method and device, computer readable storage medium
Download PDF

Info

Publication number
CN109670178A
CN109670178ACN201811562126.2ACN201811562126ACN109670178ACN 109670178 ACN109670178 ACN 109670178ACN 201811562126 ACN201811562126 ACN 201811562126ACN 109670178 ACN109670178 ACN 109670178A
Authority
CN
China
Prior art keywords
text
sentence
punctuate
handled
aligned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811562126.2A
Other languages
Chinese (zh)
Other versions
CN109670178B (en
Inventor
聂镭
李睿
聂颖
郑权
张峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Original Assignee
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dragon Horse Zhixin (zhuhai Hengqin) Technology Co LtdfiledCriticalDragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority to CN201811562126.2ApriorityCriticalpatent/CN109670178B/en
Publication of CN109670178ApublicationCriticalpatent/CN109670178A/en
Application grantedgrantedCritical
Publication of CN109670178BpublicationCriticalpatent/CN109670178B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种句子级双语对齐方法及装置、计算机可读存储介质,该方法包括:步骤S1:获取Z个训练好的卷积核,其中,Z为大于等于1的整数;步骤S2:分别对两个待对齐文本进行断句处理,并建立所述两个待对齐文本的文本相似度矩阵U:步骤S3:采用所述Z个训练好的卷积核中的每一个卷积核对所述文本相似度矩阵U进行卷积,得到Z个优化文本相似度矩阵;步骤S4:利用所述Z个优化文本相似度矩阵得到所述两个待对齐文本的语句对齐结果。本发明有利于提高文本间语句对齐的效率。

The invention discloses a sentence-level bilingual alignment method and device, and a computer-readable storage medium. The method includes: step S1: obtaining Z trained convolution kernels, where Z is an integer greater than or equal to 1; step S2: Segmenting the two texts to be aligned respectively, and establishing a text similarity matrix U of the two texts to be aligned: Step S3: using each of the Z trained convolution kernels to check the The text similarity matrix U is convolved to obtain Z optimized text similarity matrices; Step S4: using the Z optimized text similarity matrices to obtain sentence alignment results of the two texts to be aligned. The present invention is beneficial to improve the efficiency of sentence alignment between texts.

Description

Sentence-level bilingual alignment method and device, computer readable storage medium
Technical field
The present invention relates to natural language processing technique field, especially a kind of Sentence-level bilingual alignment method and device, meterCalculation machine readable storage medium storing program for executing.
Background technique
Parallel Corpus is more important data for the translation algorithm based on natural language processing, parallel/rightAnswering corpus is by source text and its parallel corresponding bilingual/multi-lingual corpus translating Chinese language and originally constituting, and degree of registration canIt is several to be divided into word grade, sentence grade, section grade and piece grade, wherein the parallel corpora of sentence grade is therefore most common corpus usually canThe parallel corpora of section grade, piece grade will be converted to the parallel corpora for the grade that forms a complete sentence, but in corpus, original text and translation might notIt is one-to-one, for example, being likely to result in 15 Chinese sentence pairs due to the difference that text structure and author write habit and answering22 English sentences, it is also possible to will cause 16 Chinese sentence pairs and answer 50 English sentences, so needing to consider complicated multiplicitySentence match situation, presently mainly the fractionation of the corpus of paragraph and chapter is combined into using manual type one-to-oneSentence, it will take a lot of manpower and time for this mode, to be unfavorable for the raising of sentence alignment efficiency.
Summary of the invention
In view of this, one of the objects of the present invention is to provide a kind of Sentence-level bilingual alignment method and devices, computerReadable storage medium storing program for executing is conducive to the raising of sentence alignment efficiency.
In order to achieve the above objectives, technical solution of the present invention provides a kind of Sentence-level bilingual alignment method, comprising:
Step S1: Z trained convolution kernels are obtained, wherein Z is the integer more than or equal to 1, is trained described in eachConvolution kernel obtained by step S11- step S15;
Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text of described two trained textsThis similarity matrix B:
Wherein, n is the sentence that described two training are handled with text by punctuate with a training in textQuantity, m are the quantity for the sentence that described two training are handled with text by punctuate with another training in text, textElement K in this similarity matrix BijI-th of the sentence handled by punctuate with text for one training with it is describedThe text similarity for j-th of sentence that another training is handled with text by punctuate;
Step S12: initialization convolution kernel;
Step S13: it is rolled up using text similarity matrix B of the current convolution kernel to described two training textProduct, obtains matrix P, and calculate penalty values loss, if penalty values loss meets preset requirement, thens follow the steps S14, otherwise, holdRow step S16;
Wherein, if i-th of sentence and described another that one training is handled with text by punctuate are trainedIt is matched with text by j-th of sentence that punctuate is handled, then LijIt is 1, is otherwise 0;
Step S14: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default wantIt asks, if so, step S15 is executed, if it is not, executing step S16;
Step S15: using current convolution kernel as trained convolution kernel;
Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reachesTo preset times, if so, step S15 is executed, if it is not, repeating step S13;
Step S2: carrying out punctuate processing to two texts to be aligned respectively, and establishes the text of described two texts to be alignedThis similarity matrix U:
Wherein, a is the sentence that a text to be aligned in described two texts to be aligned is handled by punctuateQuantity, b are the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, textElement K in this similarity matrix UijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text similarity for j-th of sentence that another text to be aligned is handled by punctuate;
Step S3: each of the Z trained convolution kernels convolution kernel is respectively adopted to the text similarityMatrix U carries out convolution, obtains Z optimization text similarity matrix;
Step S4: optimize text similarity matrix using described Z and obtain the sentence alignment of described two texts to be alignedAs a result.
Further, Z is integer more than or equal to 2, and the size and weighted of different trained convolution kernels.
Further, the step S4 includes:
Step S41: text matches degree matrix T is calculated according to the Z optimization text similarity matrix, wherein the textElement Y in this matching degree matrix TijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text matches degree for j-th of sentence that another text to be aligned is handled by punctuate, and the text matches degree matrix TEach of element value be described Z optimize text similarity matrix in same position element average value;
Step S42: each row element in the text matches degree matrix T is successively traversed, is chosen from each row elementIt is worth maximum element, and corresponding two sentences of the element of the selection is matched.
Further, after the step S42 further include:
Step S43: judge that another described text to be aligned passes through in the b sentence that punctuate is handled with the presence or absence of notThe sentence of pairing, if so, lookup and its maximum sentence of text matches degree in the text matches degree matrix T, and will be describedThe sentence found is matched with it.
Further, after the step S4 further include:
Step S5: the b sentence handled according to another described text to be aligned by punctuate it is described anotherThe a sentence that sequence of positions, one text to be aligned in text to be aligned are handled by punctuate is oneSequence of positions in text to be aligned detects sentence alignment result.
Further, the step S5 includes:
Step S51: according to sequence of positions of the b sentence in another described text to be aligned and the sentenceAlignment result is ranked up a sentence;
Step S52: if there are two sentences in a sentence, described two sentences pass through the position sorted and obtainedSequence is set with sequence of positions of described two sentences in one text to be aligned on the contrary, then there are mistakes for judgement.
Further, include an English text in described two trained texts and described two texts to be aligned withAn and non English language text, wherein calculate in the following ways each sentence that English text is handled by punctuate with it is non-The text similarity K for each sentence that English text is handled by punctuate:
Non English language text is translated by the sentence that punctuate is handled, obtains corresponding English text;
To two sentences of text similarity to be calculated, compare sentence that wherein English text is handled by punctuate withPass through the quantity of word in the English text that the statement translation that punctuate is handled obtains by non English language text;
It calculates
Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparisonAs a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amountsOne of include root identical as v-th of word word, then NvValue be 1, be otherwise 0.
To achieve the above object, technical solution of the present invention additionally provides a kind of Sentence-level bilingual alignment device, comprising:
Module is obtained, for obtaining Z trained convolution kernels, wherein Z is the integer more than or equal to 1, described in eachTrained convolution kernel is obtained by step S11- step S15;
Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text of described two trained textsThis similarity matrix B:
Wherein, n is the sentence that described two training are handled with text by punctuate with a training in textQuantity, m are the quantity for the sentence that described two training are handled with text by punctuate with another training in text, textElement K in this similarity matrix BijI-th of the sentence handled by punctuate with text for one training with it is describedThe text similarity for j-th of sentence that another training is handled with text by punctuate;
Step S12: initialization convolution kernel;
Step S13: it is rolled up using text similarity matrix B of the current convolution kernel to described two training textProduct, obtains matrix P, and calculate penalty values loss, if penalty values loss meets preset requirement, thens follow the steps S14, otherwise, holdRow step S16;
Wherein, if i-th of sentence and described another that one training is handled with text by punctuate are trainedIt is matched with text by j-th of sentence that punctuate is handled, then LijIt is 1, is otherwise 0;
Step S14: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default wantIt asks, if so, step S15 is executed, if it is not, executing step S16;
Step S15: using current convolution kernel as trained convolution kernel;
Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reachesTo preset times, if so, step S15 is executed, if it is not, repeating step S13;
First processing module for carrying out punctuate processing to two texts to be aligned respectively, and is established described two to rightThe text similarity matrix U of neat text:
Wherein, a is the sentence that a text to be aligned in described two texts to be aligned is handled by punctuateQuantity, b are the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, textElement K in this similarity matrix UijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text similarity for j-th of sentence that another text to be aligned is handled by punctuate;
Second processing module, for each of the Z trained convolution kernels convolution kernel to be respectively adopted to describedText similarity matrix U carries out convolution, obtains Z optimization text similarity matrix;
Third processing module, for obtaining described two texts to be aligned using the Z optimization text similarity matrixSentence be aligned result.
To achieve the above object, technical solution of the present invention additionally provides a kind of Sentence-level bilingual alignment device, including placeReason device and the memory that couple with the processor, wherein the processor is for executing the instruction in memory, in realizationState Sentence-level bilingual alignment method.
To achieve the above object, technical solution of the present invention additionally provides a kind of computer readable storage medium, the meterCalculation machine readable storage medium storing program for executing is stored with computer program, and the computer program realizes that above-mentioned Sentence-level is double when being executed by processorThe step of language alignment schemes.
Sentence-level bilingual alignment method provided by the invention, by using trained convolution kernel to two texts to be alignedText similarity matrix carry out convolution, and sentence alignment are carried out to two texts to be aligned according to the result of convolution, not only may be usedIt to reduce artificial participation, realizes sentence automatic aligning, the accuracy rate of alignment can also be improved, be conducive to sentence pair between raising textNeat efficiency.
Detailed description of the invention
By referring to the drawings to the description of the embodiment of the present invention, the above and other purposes of the present invention, feature andAdvantage will be apparent from, in the accompanying drawings:
Fig. 1 is a kind of flow chart of Sentence-level bilingual alignment method provided in an embodiment of the present invention;
Fig. 2 is a kind of schematic diagram of training convolutional core provided in an embodiment of the present invention;
Fig. 3 is a kind of schematic diagram of text similarity matrix provided in an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of objective matrix provided in an embodiment of the present invention;
Fig. 5 is a kind of schematic diagram for calculating text matches degree matrix provided in an embodiment of the present invention.
Specific embodiment
Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.UnderText is detailed to describe some specific detail sections in datail description of the invention, in order to avoid obscuring essence of the invention,There is no narrations in detail for well known method, process, process, element.
In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, andWhat attached drawing was not necessarily drawn to scale.
Unless the context clearly requires otherwise, "include", "comprise" otherwise throughout the specification and claims etc. are similarWord should be construed as the meaning for including rather than exclusive or exhaustive meaning;That is, be " including but not limited to " containsJustice.
In the description of the present invention, it is to be understood that, term " first ", " second " etc. are used for description purposes only, withoutIt can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple "It is two or more.
It is a kind of flow chart of Sentence-level bilingual alignment method provided in an embodiment of the present invention, this method referring to Fig. 1, Fig. 1Include:
Step S1: Z trained convolution kernels are obtained, wherein Z is the integer more than or equal to 1, is trained described in eachConvolution kernel obtained by step S11- step S15;
Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text of described two trained textsThis similarity matrix B:
Wherein, n is the sentence that described two training are handled with text by punctuate with a training in textQuantity, m are the quantity for the sentence that described two training are handled with text by punctuate with another training in text, textElement K in this similarity matrix BijI-th of the sentence handled by punctuate with text for one training with it is describedThe text similarity for j-th of sentence that another training is handled with text by punctuate;
Step S12: initialization convolution kernel;
Step S13: it is rolled up using text similarity matrix B of the current convolution kernel to described two training textProduct, obtains matrix P, and calculate penalty values loss, if penalty values loss meets preset requirement, thens follow the steps S14, otherwise, holdRow step S16;
Wherein, if i-th of sentence and described another that one training is handled with text by punctuate are trainedIt is matched with text by j-th of sentence that punctuate is handled, then LijIt is 1, is otherwise 0;
Step S14: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default wantIt asks, if so, step S15 is executed, if it is not, executing step S16;
Step S15: using current convolution kernel as trained convolution kernel;
Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reachesTo preset times, if so, step S15 is executed, if it is not, repeating step S13;
Step S2: carrying out punctuate processing to two texts to be aligned respectively, and establishes the text of described two texts to be alignedThis similarity matrix U:
Wherein, a is the sentence that a text to be aligned in described two texts to be aligned is handled by punctuateQuantity, b are the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, textElement K in this similarity matrix UijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text similarity for j-th of sentence that another text to be aligned is handled by punctuate;
Step S3: each of the Z trained convolution kernels convolution kernel is respectively adopted to the text similarityMatrix U carries out convolution, obtains Z optimization text similarity matrix;
Step S4: optimize text similarity matrix using described Z and obtain the sentence alignment of described two texts to be alignedAs a result.
Sentence-level bilingual alignment method provided in an embodiment of the present invention, by using trained convolution kernel to two to rightThe text similarity matrix of neat text carries out convolution, and carries out sentence alignment to two texts to be aligned according to the result of convolution,It can not only reduce artificial participation, realize sentence automatic aligning, the accuracy rate of alignment can also be improved, be conducive to improve between textThe efficiency of sentence alignment.
The trained convolution kernel of each of embodiment of the present invention can be obtained by convolutional neural networks training, such as Fig. 2It is shown, by using sentence be aligned result known to two training use the text similarity matrix B of text as the input of training set,And objective matrix is inputted, objective matrix (i.e. model answer) is used for compared with the matrix that neural network returns, so that nerve netThe output of network is infinitely close to objective matrix, to obtain required convolution kernel, detailed process is as follows:
Step A1: obtaining two trained texts from training set, for example, one of training is English text with text(original text), another training are Chinese text (translation) with text, and the sentence of two trained texts is aligned known to result;
Step A2: punctuate processing is carried out with text to two training respectively;
Punctuate processing is carried out for dividing the marking symbols of sentence for example, can use in text, with bilingual Chinese-English rightFor neat, Chinese with ".","!" it is ending, English is ending with " ", is made pauses in reading unpunctuated ancient writings if there are above-mentioned marking symbols, is brokenTwo lists are obtained after sentence, respectively one English (original text) sentence list and one including n English sentence includes m ChineseChinese (translation) sentence list of sentence, each of English sentence list sentence is independent a word in original text, middle sentenceEach of list sentence is independent a word in translation,, can be with for each sentence list in addition, for convenient for processingEach sentence therein is numbered according to text tandem (i.e. the sequence of positions of sentence in the text), as sentence ropeDraw, for example, the number of the sentence of beginning location is 1 in English text ... in English sentence list, the sentence of end positionNumber be n, in Chinese sentence list, the number of the sentence of beginning location is 1 in Chinese text ..., the language of end positionThe number of sentence is m;
Step A3: establishing the text similarity matrix B of two trained texts, i.e., for the m word in Chinese list,All with each progress similarity system design of n word in English list, detailed process is as follows:
Firstly, using translation tool by the identical language of translator of Chinese Cheng Yuyuan (English) text, i.e., in Chinese sentence listEach sentence translated, obtain the wherein corresponding English text of each sentence;
To two sentences (a Chinese sentence and an English sentence) of text similarity to be calculated, compare wherein english statementThe quantity of word in the English text obtained with Chinese statement translation;
It calculates later
Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparisonAs a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amountsOne of include root identical as v-th of word word, then NvValue be 1, be otherwise 0;
It should be noted that if comparison result is identical for the word quantity of the two, then it can be using any one as wordA fairly large number of one, negligible amounts one of of the another one as word;
I.e. by taking root to exactly match the word in sentence, and the text between two sentences is calculated using above-mentioned formulaSimilarity, if root is identical, coupling number adds 1. matched sums as molecule, the length of the sentence (number of word i.e. in sentenceAmount) it is used as denominator to take the word quantity of longer sentence as denominator if length is inconsistent;
By the above-mentioned means, available m*n text similarity, is indicated, i.e., using the matrix that a size is m*nAs text similarity matrix B;
Wherein, the element K in text similarity matrix BijIt (is numbered for i-th of sentence in above-mentioned English sentence listFor the sentence of i) text similarity with j-th of sentence (i.e. number be j sentence) in above-mentioned Chinese sentence list;
For example, obtaining its text similarity matrix as shown in figure 3, can see after being handled with text two trainingOut the element aggregation of matrix intermediate value larger (i.e. text similarity is higher) since the upper left corner to the diagonal line that the lower right corner terminatesPosition, this is because China and British text sentence sequencing having the same;
Step A4: initialization convolution kernel, and the convolution kernel that initialization is obtained executes step A5 as current convolution kernel;
Step A5: result is aligned according to the sentence of above-mentioned two training text and establishes objective matrix J;
Wherein, the element L in objective matrix JijI-th of sentence and above-mentioned Chinese in corresponding above-mentioned English sentence listJ-th of sentence in sentence list, and the value of element is determined by known sentence alignment result, if i-th in English sentence listJ-th of sentence in a sentence and Chinese sentence list matches, LijValue be 1, be otherwise 0;
For example, as shown in Figure 4 according to the objective matrix J that above-mentioned two training is established with text;
Step A6: carrying out convolution using text similarity matrix B of the current convolution kernel to described two trained texts,Matrix P is obtained, and calculates penalty values loss using the objective matrix J established, if penalty values loss meets preset requirement (such as less thanOne threshold value), A7 is thened follow the steps, otherwise, executes step A9;
Step A7: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default wantIt asks, if so, step A8 is executed, if it is not, executing step A9;
Wherein, which includes several verifying texts pair, each verifying text is to including an English textThis (original text) and a Chinese text (translation);
Wherein, verification process is substantially similar to training process, and details are not described herein again, when the damage for verifying collection in the result of verifyingMistake value loss is less than a certain threshold value, and when the accuracy rate for verifying collection is greater than a certain threshold value, it is default to determine that the result of verifying meetsIt is required that;
Step A8: using current convolution kernel as trained convolution kernel;
Step A9: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reachesTo preset times, if so, step A8 is executed, if it is not, repeating step A6.
Preferably, in one embodiment, Z is integer more than or equal to 2, and the size of different trained convolution kernels andWeighted, for example, the value of Z can be 3,5 or 6;
To obtain multiple trained convolution kernels, multiple convolution kernels (different volumes that initialization obtains can be initialized respectivelyThe size and weight of product core are different), later using each convolution kernel respectively to the text of above-mentioned two trained textSimilarity matrix B carries out convolution algorithm, operation the result is that multiple changed matrixes of numerical value, later by obtain eachMatrix obtains the penalty values loss of different convolutional neural networks compared with objective matrix, wherein the more big then table of penalty values lossShow that neural network effect is more bad, need parameter adjustment bigger, penalty values loss is smaller, indicates that neural network effect is better, needsWant parameter adjustment smaller, therefore can be according to respectively different penalty values loss, reverse transfer is to corresponding convolution mindThrough network, each convolutional neural networks reversely successively adjusts network parameter according to respective penalty values loss, i.e. adjustment convolutionThe weight of core, the weighted value that each backpropagation of each convolution kernel is adjusted is not identical, until penalty values loss reaches pre-Phase requires.
It should be noted that memory can be stored it in after obtaining trained convolution kernel through the above wayIn, when need to use, it can read and obtain directly from memory.
For example, in one embodiment, in two texts to be aligned, one of them text to be aligned is English text(original text), another text to be aligned are Chinese text (translation), wherein establish the text similarity of two texts to be alignedThe method (i.e. above-mentioned steps A1, A2, A3) that matrix U and the text similarity matrix B for establishing above-mentioned two training text are adoptedIdentical, details are not described herein again;
In above-mentioned steps S3, by by the text similarity matrix U of two texts to be aligned and trained convolution kernelConvolution is carried out, realizes that the optimization to text similarity matrix U is corrected, obtains optimization text similarity matrix;
For example, in one embodiment, above-mentioned steps S4 includes:
Step S41: text matches degree matrix T is calculated according to the Z optimization text similarity matrix, wherein the textElement Y in this matching degree matrix TijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text matches degree for j-th of sentence that another text to be aligned is handled by punctuate, and the text matches degree matrix TEach of element value be described Z optimize text similarity matrix in same position element average value;
After the contraposition of obtained Z optimization text similarity matrix is added, the element of each position is averaging, is obtainedTo text matches degree matrix T;
It should be noted that if the value of Z is 1, it can be directly using optimization text similarity matrix as text matches degree squareBattle array;
For example, with reference to Fig. 5, the text similarity matrix U of two texts to be aligned and 3 trained convolution kernels are carried outConvolution obtains 3 optimization text similarity matrixes, text matches degree matrix is calculated later;
Step S42: each row element in the text matches degree matrix T is successively traversed, is chosen from each row elementIt is worth maximum element, and corresponding two sentences of the element of the selection is matched;
For example, for each row element, therefrom selective value is maximum for text matches degree matrix obtained in Fig. 5Element matches corresponding two sentences of the element of selection, obtain three pairing as a result, i.e. the 1st row (i.e. said one waits forAligning texts pass through the 1st sentence that punctuate is handled) (another i.e. above-mentioned text to be aligned is by punctuate with the 1st columnManage the 1st obtained sentence) pairing, the 2nd row (i.e. said one text to be aligned passes through the 2nd sentence that punctuate is handled)With the 3rd column (another i.e. above-mentioned text to be aligned passes through the 3rd sentence that punctuate is handled) pairing, the 3rd row (i.e. above-mentioned oneA text to be aligned passes through the 3rd sentence that punctuate is handled) (another i.e. above-mentioned text to be aligned is by disconnected with the 3rd columnThe 3rd sentence that sentence processing obtains) pairing:
Wherein, in this step, if in a line, there are multiple maximum elements of value are (i.e. same in text matches degree matrix TThe value of multiple elements is maximum value in a line), then determine the value with the maximum element of a line intermediate value first, and asCurrent lookup value is searched in above-mentioned Z optimization text similarity matrix and above-mentioned multiple maximum same positions of element of value laterElement, and determine the most position of current lookup value number wherein occur, and by determining corresponding two sentences in position intoRow pairing, for example, the element in the first row is [0.7,0.7,0.3], wherein first for the text matches degree matrix in Fig. 5The value of the element of the second column position of element and the first row of the first column position of row is maximum value 0.7, then searches 3 optimization textsThe element of the second column position of the element of the first column position of the first row and the first row in this similarity matrix, due to 3 optimization textsThe first row element in this similarity matrix is respectively [0.7,0.6,0.3], [0.7,0.6,0.2] [0.7,0.9,0.4], can be withSee the element of the first column position of the first row occur 0.7 number it is most, therefore by the 1st row (i.e. said one text to be alignedThe 1st sentence handled by punctuate) (another i.e. above-mentioned text to be aligned is handled by punctuate with the 1st column1st sentence) pairing, in addition, if in text matches degree matrix T in a line there are the maximum elements of multiple values, can also be fromAn element is randomly choosed in multiple maximum element of value is used as the maximum element of value;
S42 can match each of said one text to be aligned sentence through the above steps, but mayIt is unpaired in the presence of the sentence in another one or more above-mentioned text to be aligned, it is preferable that in step S4, the stepAfter rapid S42 further include:
Step S43: judge that another described text to be aligned passes through in the b sentence that punctuate is handled with the presence or absence of notThe sentence of pairing, if so, lookup and its maximum sentence of text matches degree in the text matches degree matrix T, and will be describedThe sentence found is matched with it, is realized to the column leakage detection in matrix;
For example, after being matched by step S42, there are still the 2nd for text matches degree matrix obtained in Fig. 5Arranging not matching row, (i.e. another text to be aligned is unpaired language by the 2nd sentence that punctuate is handledSentence), then wherein maximum value element is searched in the 2nd column in text matches degree matrix T, obtained result is that the 1st row the 2nd arranges positionThe element set, thus by the 1st row (i.e. said one text to be aligned passes through the 1st sentence that punctuate is handled) and the 2nd column(another i.e. above-mentioned text to be aligned passes through the 2nd sentence that punctuate is handled) matches, through the above steps S42-S43,The pairing result that text matches degree matrix in Fig. 5 obtains are as follows: the 1st row and the 1st column pairing, the 1st row and the 2nd column pairing, the 2nd rowIt is matched with the 3rd column pairing, the 3rd row and the 3rd column;
Preferably, in one embodiment, after the step S4 further include:
Step S5: the b sentence handled according to another described text to be aligned by punctuate it is described anotherThe a sentence that sequence of positions, one text to be aligned in text to be aligned are handled by punctuate is oneSequence of positions in text to be aligned detects sentence alignment result;
For example, the step S5 can be specifically included:
Step S51: according to sequence of positions of the b sentence in another described text to be aligned and the sentenceAlignment result is ranked up a sentence;
Step S52: if there are two sentences in a sentence, described two sentences pass through the position sorted and obtainedSequence is set with sequence of positions of described two sentences in one text to be aligned on the contrary, then there are mistakes for judgement, is neededIllustrate, sequence of positions herein refers on the contrary: for two sentences in one text to be aligned, if passed throughThe sequence of positions that sequence in step S51 obtains be one of sentence be located at before another sentence, but it is one toSaid one sentence is located at behind another above-mentioned sentence in aligning texts, it is determined that sequence of positions is opposite.
For example, said one text to be aligned is English text, another text to be aligned is Chinese text, in thisAfter two texts to be aligned of English carry out sentence alignment, it will usually obtain shaped like [in 20, English 25] such matching pair, for into oneStep ground improves the accuracy of pairing, can detect to matched result, specifically, will obtain matching to according to Chinese firstThe number (i.e. sequence of positions of all Chinese sentences made pauses in reading unpunctuated ancient writings of Chinese text in Chinese text) of sentence carry out from it is small toBig sequence is ranked up all English sentences that English text is made pauses in reading unpunctuated ancient writings to realize, then according to the result of the sequenceDetect the number (i.e. sequence of positions of all English sentences made pauses in reading unpunctuated ancient writings of English text in English text) of english sentenceVariation, judges whether it is the variation of monotonic increase, wherein monotonic increase is are as follows: inside a collating sequence, if in rear positionThe number set is greater than the number in front position, then this sequence is monotonic increase, if not meeting the variation of monotonic increase, can incite somebody to actionThe matching of monotonic increase is not met to being marked, to carry out error prompting to user.
Sentence-level bilingual alignment method provided in an embodiment of the present invention, it is contemplated that since complexity is more in sentence alignment procedureThe difference that the text structure of sample and author write habit causes complicated and diversified sentence pairing situation, by using multiple trainingGood convolution kernel carries out convolution to the text similarity matrix of two texts to be aligned, realizes to the excellent of text similarity matrixChange amendment, the matrix after making optimization considers the time sequencing (namely sequence of positions) that sentence occurs in the text, not only avoidsThe interference that identical sentence generates when matching to sentence, and also avoid doing caused by complicated and diversified sentence pairing situationIt disturbs, ensure that the matched accuracy rate of sentence, substantially increase the robustness of algorithm.
The embodiment of the invention also provides a kind of Sentence-level bilingual alignment devices, comprising:
Module is obtained, for obtaining Z trained convolution kernels, wherein Z is the integer more than or equal to 1, described in eachTrained convolution kernel is obtained by step S11- step S15;
Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text of described two trained textsThis similarity matrix B:
Wherein, n is the sentence that described two training are handled with text by punctuate with a training in textQuantity, m are the quantity for the sentence that described two training are handled with text by punctuate with another training in text, textElement K in this similarity matrix BijI-th of the sentence handled by punctuate with text for one training with it is describedThe text similarity for j-th of sentence that another training is handled with text by punctuate;
Step S12: initialization convolution kernel;
Step S13: it is rolled up using text similarity matrix B of the current convolution kernel to described two training textProduct, obtains matrix P, and calculate penalty values loss, if penalty values loss meets preset requirement, thens follow the steps S14, otherwise, holdRow step S16;
Wherein, if i-th of sentence and described another that one training is handled with text by punctuate are trainedIt is matched with text by j-th of sentence that punctuate is handled, then LijIt is 1, is otherwise 0;
Step S14: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default wantIt asks, if so, step S15 is executed, if it is not, executing step S16;
Step S15: using current convolution kernel as trained convolution kernel;
Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reachesTo preset times, if so, step S15 is executed, if it is not, repeating step S13;
First processing module for carrying out punctuate processing to two texts to be aligned respectively, and is established described two to rightThe text similarity matrix U of neat text:
Wherein, a is the sentence that a text to be aligned in described two texts to be aligned is handled by punctuateQuantity, b are the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, textElement K in this similarity matrix UijI-th of the sentence handled for one text to be aligned by punctuate with it is describedThe text similarity for j-th of sentence that another text to be aligned is handled by punctuate;
Second processing module, for each of the Z trained convolution kernels convolution kernel to be respectively adopted to describedText similarity matrix U carries out convolution, obtains Z optimization text similarity matrix;
Third processing module, for obtaining described two texts to be aligned using the Z optimization text similarity matrixSentence be aligned result.
Wherein, in one embodiment, Z is integer more than or equal to 2, and the size and power of different trained convolution kernelsWeight is different.
Wherein, in one embodiment, the third processing module includes:
Computing unit, for calculating text matches degree matrix T according to the Z optimization text similarity matrix, wherein instituteState the element Y in text matches degree matrix TijI-th of the sentence handled for one text to be aligned by punctuate withThe text matches degree for j-th of sentence that another described text to be aligned is handled by punctuate, and the text matches degreeThe value of each of matrix T element is the average value of same position element in described Z optimization text similarity matrix;
First pairing unit, for successively traversing each row element in the text matches degree matrix T, from every a line memberThe maximum element of selected value in element, and corresponding two sentences of the element of the selection are matched.
Wherein, in one embodiment, the third processing module further include:
Second pairing unit, for judging that another described text to be aligned passes through in the b sentence that punctuate is handledWith the presence or absence of unpaired sentence, if so, being searched and its maximum language of text matches degree in the text matches degree matrix TSentence, and the sentence found is matched with it.
Wherein, in one embodiment, the Sentence-level bilingual alignment device further include:
As a result detection module, for being existed according to another described text to be aligned by the b sentence that punctuate is handledSequence of positions, one text to be aligned in another described text to be aligned pass through a sentence that punctuate is handledSequence of positions in one text to be aligned detects sentence alignment result.
Wherein, in one embodiment, the result detection module includes:
Sequencing unit, for according to sequence of positions of the b sentence in another described text to be aligned and instituteThe neat result of predicate sentence pair is ranked up a sentence;
Detection unit, if for, there are two sentences, described two sentences to be obtained by the sequence in a sentenceSequence of positions in one text to be aligned of sequence of positions and described two sentences on the contrary, then there are mistakes for judgement.
It wherein, in one embodiment, include one in described two trained texts and described two texts to be alignedEnglish text and a non English language text, wherein calculating English text is handled each by punctuate in the following waysThe text similarity K for each sentence that a sentence and non English language text are handled by punctuate:
Non English language text is translated by the sentence that punctuate is handled, obtains corresponding English text;
To two sentences of text similarity to be calculated, compare sentence that wherein English text is handled by punctuate withPass through the quantity of word in the English text that the statement translation that punctuate is handled obtains by non English language text;
It calculates
Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparisonAs a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amountsOne of include root identical as v-th of word word, then NvValue be 1, be otherwise 0.
The embodiment of the invention also provides a kind of Sentence-level bilingual alignment device, including processor and with the processorThe memory of coupling, wherein the processor is used to execute the instruction in memory, realizes above-mentioned Sentence-level bilingual alignment sideMethod.
The embodiment of the invention also provides a kind of computer readable storage medium, the computer-readable recording medium storageThere is the step of computer program, the computer program realizes above-mentioned Sentence-level bilingual alignment method when being executed by processor.
Those skilled in the art will readily recognize that above-mentioned each preferred embodiment can be free under the premise of not conflictingGround combination, superposition.
It should be appreciated that above-mentioned embodiment is merely exemplary, and not restrictive, without departing from of the invention basicIn the case where principle, those skilled in the art can be directed to the various apparent or equivalent modification or replace that above-mentioned details is madeIt changes, is all included in scope of the presently claimed invention.

Claims (10)

CN201811562126.2A2018-12-202018-12-20Sentence-level bilingual alignment method and device, computer readable storage mediumActiveCN109670178B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201811562126.2ACN109670178B (en)2018-12-202018-12-20Sentence-level bilingual alignment method and device, computer readable storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201811562126.2ACN109670178B (en)2018-12-202018-12-20Sentence-level bilingual alignment method and device, computer readable storage medium

Publications (2)

Publication NumberPublication Date
CN109670178Atrue CN109670178A (en)2019-04-23
CN109670178B CN109670178B (en)2019-10-08

Family

ID=66144024

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201811562126.2AActiveCN109670178B (en)2018-12-202018-12-20Sentence-level bilingual alignment method and device, computer readable storage medium

Country Status (1)

CountryLink
CN (1)CN109670178B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111723587A (en)*2020-06-232020-09-29桂林电子科技大学 A Chinese-Thai entity alignment method for cross-language knowledge graph
CN112906371A (en)*2021-02-082021-06-04北京有竹居网络技术有限公司Parallel corpus acquisition method, device, equipment and storage medium
CN113657421A (en)*2021-06-172021-11-16中国科学院自动化研究所 Convolutional neural network compression method and device, image classification method and device
CN114564932A (en)*2021-11-252022-05-31阿里巴巴达摩院(杭州)科技有限公司Chapter alignment method, apparatus, computer device and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105868187A (en)*2016-03-252016-08-17北京语言大学A multi-translation version parallel corpus establishing method
US20170004121A1 (en)*2015-06-302017-01-05Facebook, Inc.Machine-translation based corrections
CN108897740A (en)*2018-05-072018-11-27内蒙古工业大学A kind of illiteracy Chinese machine translation method based on confrontation neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20170004121A1 (en)*2015-06-302017-01-05Facebook, Inc.Machine-translation based corrections
CN105868187A (en)*2016-03-252016-08-17北京语言大学A multi-translation version parallel corpus establishing method
CN108897740A (en)*2018-05-072018-11-27内蒙古工业大学A kind of illiteracy Chinese machine translation method based on confrontation neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WUHONGLIN 等: "A Sentence Alignment Model Based on Combined Clues and Kernel Extensional Matrix Matching Method", 《AASRI PROCEDIA》*
丁颖 等: "基于词对建模的句子对齐研究", 《计算机工程(网络首发)》*

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN111723587A (en)*2020-06-232020-09-29桂林电子科技大学 A Chinese-Thai entity alignment method for cross-language knowledge graph
CN112906371A (en)*2021-02-082021-06-04北京有竹居网络技术有限公司Parallel corpus acquisition method, device, equipment and storage medium
CN112906371B (en)*2021-02-082024-03-01北京有竹居网络技术有限公司Parallel corpus acquisition method, device, equipment and storage medium
CN113657421A (en)*2021-06-172021-11-16中国科学院自动化研究所 Convolutional neural network compression method and device, image classification method and device
CN113657421B (en)*2021-06-172024-05-28中国科学院自动化研究所Convolutional neural network compression method and device, and image classification method and device
CN114564932A (en)*2021-11-252022-05-31阿里巴巴达摩院(杭州)科技有限公司Chapter alignment method, apparatus, computer device and medium
CN114564932B (en)*2021-11-252024-12-03阿里巴巴达摩院(杭州)科技有限公司 Chapter alignment method, device, computer equipment and medium

Also Published As

Publication numberPublication date
CN109670178B (en)2019-10-08

Similar Documents

PublicationPublication DateTitle
CN109670178A (en)Sentence-level bilingual alignment method and device, computer readable storage medium
CN111753767A (en) A method, device, electronic device and storage medium for automatic job correction
CN108052588B (en)Method for constructing automatic document question-answering system based on convolutional neural network
CN103488724B (en)A kind of reading domain knowledge map construction method towards books
CN112035730B (en)Semantic retrieval method and device and electronic equipment
CN110188347B (en)Text-oriented method for extracting cognitive relationship between knowledge topics
CN110489523B (en)Fine-grained emotion analysis method based on online shopping evaluation
CN107590127B (en) A method and system for automatically labeling knowledge points in a question bank
Cohen et al.End to end long short term memory networks for non-factoid question answering
CN112966135B (en) A method and system for image and text retrieval based on attention mechanism and gating mechanism
CN105955962B (en) Calculation method and device for topic similarity
WO2020063092A1 (en)Knowledge graph processing method and apparatus
WO2021169263A1 (en)Semantic matching method and device based on internal adversarial mechanism, and storage medium
CN110008309B (en)Phrase mining method and device
CN106960001B (en)A kind of entity link method and system of term
CN113010657B (en)Answer processing method and answer recommendation method based on answer text
WO2022151594A1 (en)Intelligent recommendation method and apparatus, and computer device
CN105760359B (en)Question processing system and method thereof
CN112069322B (en)Text multi-label analysis method and device, electronic equipment and storage medium
CN109902303A (en) A kind of entity identification method and related equipment
CN111026815B (en)Entity pair specific relation extraction method based on user-assisted correction
CN111274402A (en)E-commerce comment emotion analysis method based on unsupervised classifier
CN110837736A (en) An Iterative Dilated Convolutional Neural Network-Conditional Random Field-based Named Entity Recognition Method for Chinese Medical Records
CN110489554B (en) Attribute-level sentiment classification method based on location-aware mutual attention network model
CN106484676B (en)Biological Text protein reference resolution method based on syntax tree and domain features

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant
CP02Change in the address of a patent holder

Address after:519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong

Patentee after:LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

Address before:519031 room 417, building 20, creative Valley, Hengqin New District, Zhuhai City, Guangdong Province

Patentee before:LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

CP02Change in the address of a patent holder
PP01Preservation of patent right

Effective date of registration:20240718

Granted publication date:20191008

PP01Preservation of patent right
PD01Discharge of preservation of patent

Date of cancellation:20241125

Granted publication date:20191008

PD01Discharge of preservation of patent

[8]ページ先頭

©2009-2025 Movatter.jp