A kind of document automatic translating methodTechnical field
The present invention relates to machine translation field, especially a kind of document automatic translating method.
Background technique
Machine translation is that a kind of natural language (original language) is converted into another natural language (target language using computerSpeech) process.Because its translation efficiency is apparently higher than manually, user can be assisted more rapidly to obtain information, therefore have importantPractical value.It is predicted according to the maximum market survey library Research & Markets in the whole world, global machine translation city in 2023Field scale will be up to 1.95 hundred million dollars.During 2017 to 2023 years, the compound annual growth rate in machine translation market is more than 6.0%, andReach 1.95 hundred million dollars within 2023.Existing machine translation software, for example, translation dog, Transgod using MT engine intoRow full text translation.
The engines such as TransGod are completed to need download online translation that can just check after translation, and current machine translation pairIt is " making no exception " for document, i.e., for the text in document, carries out full text translation, do not take into account that readding for translationThe bandwagon effect of the property read and translation, that is, do not take into account that whether translation meets the reading habit of people.
In the epoch that machine translation market increases rapidly, the precision and efficiency of machine translation are the Dai-ichi Mutual Life Insurances of translation softwarePower.User experience is also important references point of the translation software in market competition simultaneously.
Summary of the invention
The present invention is directed to provide easier mode of operation for user, more intuitively translates effect and show.Hair of the inventionImproving eyesight is: for above-mentioned all or part of problem, providing a kind of document automatic translating method.To translate flexibly, be suitble to people's reading habit translation.
The technical solution adopted by the invention is as follows:
A kind of document automatic translating method, comprising:
Document is uploaded, target language is specified;
Analysis Doctype selects corresponding Word Input tool to extract the text in document based on the analysis results;
According to the text extracted, the languages of document text are detected;
Analyze and record the composition information of extracted text;
The text extracted is subjected to minimum unit processing, divides the entry that extracted text is several minimum units;
Mark in extracted text it is non-translate element, it is described non-to translate the character that element is preliminary making;
According to the languages of document text, corresponding interpretative system is selected to translate extracted text;The knot of the translationFruit are as follows: retain it is labeled it is non-translate element, will be unless translating the partial translation of element into target language;
To the format of the result of translation, it is reconstructed with the text composition information of record.
The step of above method, in addition to apparent sequence limits, the successive requirement of no execution.
By the languages detection to original text, determines that specific interpretative system is targetedly translated, can be improvedThe accuracy of translation;Corresponding Word Input tool is chosen according to Doctype, it can be ensured that for the extraction side of document textFormula has specific aim, accurately to extract the character in document;Non- element of translating in document is locked, in a manner of original textIt is exported, so that the character of certain unsuitable translations is not translated, is more in line with the reading habit of people;Translation is compareed formerLiterary typesetting is reconstructed, it is ensured that the contrastive of translation and original text reads convenient for control and translation detects;Because translation engine isAutomatic punctuate translation, is translated as unit of minimum unit and non-integral is translated, can be to avoid because of translation engine punctuate mistakeThe case where causing translation result to malfunction, so that translation result is more accurate.
Further, above-mentioned interpretation method further include:
Original text is made of several parts, and when user selects bilingual journal display, each part of original text is respectively corresponded reconstructEach part of translation afterwards is shown.
Current machine translation mothod, does not support bilingual piecewise control, and people are only capable of by opening original text respectively and translatingText carries out control reading.Each topical controls translation of original text is shown, original text and translation can be carried out in order to userControl reading, verification and understanding, prominent verification, sentence learn and the effect of expression.Meanwhile passing through the selection of function, Ke YiWhether control show between be switched fast, simplified user's operation movement.
Further, it is above-mentioned by translation to being literally shown specifically: the control display mode specified according to user, withOriginal text carries out control displaying in the preceding mode of preceding or translation, by original text and translation.
The reading habit that can satisfy different user is arranged in this, improves reading experience.
Further, above-mentioned with the preceding control displaying process of original text are as follows: duplication original text, and by each section after duplicationIt falls, is respectively placed in original text and corresponds to paragraph and then by each paragraph after duplication, replace with corresponding translation;It is described to translateThe preceding control displaying process of text are as follows: the translation after duplication reconstruct, and by each paragraph after duplication, it is respectively placed in former translationIt corresponds to paragraph and then by each paragraph after duplication, replaces with corresponding original text.
Because of the translation being translated as unit of entry to original text, there is contrastive between the result and original text entry of translation.It replicates original text or when translation, has set entry amount and corresponding content, then when row replacement, it is directly (or right to minimum unitAnswer translation) it is replaced.Compared to the mode (or opposite) for being inserted directly into translation behind original text part, more quickly, andBecause having strong contrastive, format and content is not easily caused to malfunction;Meanwhile being for each local operation while carrying out, at controlIt manages more efficient.Further more, the mode replaced again is first replicated, when replicating original text or translation, to have primarily determined bandwagon effectTypesetting when being replaced, not will cause the variation of typesetting, and treatment effect is more stable, and the mode for increasing translation paragraph by paragraph is being insertedIt just can determine final typesetting after the completion of entering.
Further, above-mentioned that the text extracted is carried out minimum unit processing, it is several for dividing extracted textThe entry of minimum unit are as follows: the text extracted is subjected to punctuate processing, dividing extracted text is several simple sentences.
Using simple sentence as minimum unit, on the one hand, can quickly, accurately complete the division of unit, on the other hand, also comply with and turn overThe case where mode for translating engine processing, prevents translation engine punctuate mistake, and translation is caused to malfunction.
Further, the above-mentioned languages according to document text, select corresponding interpretative system to turn over extracted textIt is translated into:
According to the languages of document text, corresponding corpus is called, each simple sentence of division is translated in order.
The contrast relationship being stored between a large amount of phrases, simple sentence and corresponding translation in corpus, passes through the match party of corpusFormula can be rapidly completed the translation to simple sentence, while carry out unification to the translation result of phrase of the same race, simple sentence, and in corpusIt is all existing, the accurate translation of translation result, therefore, translation effect is more accurate.
Further, the above-mentioned languages according to document text, call corresponding corpus, in order to each list of divisionSentence is translated are as follows: it is directed to each simple sentence, it, will be non-on the translation basis that calling corpus is completed after going unless translating elementTranslate the result that element is supplemented in after translation corresponding position;Or are as follows: to each complete simple sentence, call turning over for corpus completionOn the basis of translating, the non-translation for translating element is replaced with into the non-result translated after element.
In this manner, it is to have locked the non-translation effect for translating element that output result, which may be implemented, and then directly to thisAs a result using without carrying out operation bidirectional.
Further, the text that above-mentioned basis extracts detects the languages of document text are as follows: complete by what is extracted from documentPortion or segment word are transferred to the detection application of third party's languages, and receive the languages of third party's languages detection application feedback.
The detection application of third party's languages is called directly, the differentiation of languages can be rapidly completed.Selection based on segment word,It can reduce the calculation amount of languages differentiation, improve detection efficiency.
Further, above-mentioned upload document are as follows: in the translation area of prebuild, pass through dragging or the side of specified pathFormula uploads document.
Document is pulled to the mode of workspace, the operating procedure that user uploads document can be simplified, improve user experience.
Further, above-mentioned translation area is the suspended window constructed in PC machine desktop.Which is convenient for user's quick calling,And in other cases, user will not be interfered.
In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are:
1, interpretation method of the invention selects optimal text based on the automatic identification to languages documents to be translated and typeExtraction and translation scheme, can effectively improve the accuracy of translation result.Meanwhile non-element of translating is locked (i.e. not automaticallyTranslated), translation result can be made more to meet the use habit of people, that improves translation can be readability.
2, the present invention carries out control paragraph by paragraph to original text and translation and shows for user's function that actively selection control is shown,The effect that control is read, verifies, learnt can be increased, meanwhile, control is shown and is switched fast, the operation stream of user is savedJourney.
3, in control displaying processing, by first unifying duplication, then the mode of unified replacement, on the one hand, it is based on minimumCorresponding relationship between unit and translation can fast implement the operation for showing result, on the other hand, in duplication original text (duplication translationWhen similarly), to show that document is in a relatively steady state, then when row replacement, document status will not be caused to arrangeInfluence in version improves the stability that control shows processing.
4, the present invention is based on corpus translates document sentence by sentence, and translation is quicker, and translation result is more quasi-, meanwhile,Corpus can be enriched again, provide corpus with the translation for other subsequent documents and support.
Detailed description of the invention
Examples of the present invention will be described by way of reference to the accompanying drawings, in which:
Fig. 1 and 2 is the different embodiments of two document automatic translating methods.
Specific embodiment
All features disclosed in this specification or disclosed all methods or in the process the step of, in addition to mutually exclusiveFeature and/or step other than, can combine in any way.
Any feature disclosed in this specification (including any accessory claim, abstract), unless specifically stated,It is replaced by other equivalent or with similar purpose alternative features.That is, unless specifically stated, each feature is a series ofAn example in equivalent or similar characteristics.
Embodiment one
As shown in the figures 1 and 2, present embodiment discloses a kind of document automatic translating methods, can voluntarily carry out to non-element of translatingFiltering, to translate the translation for meeting people's communicative habits;Interpretation method are as follows:
Document is uploaded, target language is specified.
Analysis Doctype selects corresponding Word Input tool to extract the text in document based on the analysis results.
According to the text extracted, the languages of document text are detected.
Analyze and record the composition information of extracted text.
The text extracted is subjected to minimum unit processing, divides the word that extracted text is several minimum unitsItem.So-called minimum unit is one of word, word, phrase, simple sentence or a variety of patterns.
Mark in extracted text it is non-translate element, it is described non-to translate the character that element is preliminary making.It is non-to translate elementRefer in the case where not translating, more meets the content of context and common communicative habits.Such as document is English questionnaire,In Q1, A1 etc., when translating into Chinese for " problem 1, replies 1 ", and this translation does not meet habit expression, thus the present invention by itsIt filters out, without translation.
According to the languages of document text, corresponding interpretative system is selected to translate extracted text;The translationResult are as follows: retain it is non-translates element, will be unless translating the partial translation of element into target language.
To the format of translation result, it is reconstructed with the text composition information of record.Translation is corresponded to the row of original textVersion format.
Original text is made of several parts, will be former when user selects bilingual journal (i.e. original text and translation contrasts) displayEach part that each part of text respectively corresponds the translation after reconstruct is shown.
The display of bilingual journal specifically: the control display mode specified according to user, with original text in preceding or translation precedingMode, original text and translation are subjected to control displaying.
In one embodiment, if user selects the preceding mode of original text, original text is replicated, and by each after duplicationParagraph is respectively placed in original text and corresponds to paragraph and then by each paragraph after duplication, replace with corresponding translation.If userSelect the preceding mode of translation, then the translation after replicating reconstruct, and by each paragraph after duplication, it is respectively placed in former translation pairIt answers paragraph and then by each paragraph after duplication, replaces with corresponding original text.
Embodiment two
Referring to attached drawing 1 or 2, present embodiment discloses a kind of document automatic translating methods, belong to machine translation type, and method includesFollowing steps:
In desktop, browser, the regions such as window for calling plug-in unit, translation area is constructed, passes through and pulls or the side of specified pathFormula uploads to transcription platform for documents to be translated;The target language of specified plan translation simultaneously.It, can be in desktop in PC machineSuspended window is constructed as workspace, directly dragging document completes the upload of document to the suspended window region.
According to the suffix name of document, the type of document is judged, and according to the type of document, select corresponding Word Input workTool extracts the text in document.Such as office paper series, the entitled .doc (x) of suffix .ppt (x) .xls (x) etc.,Such document belongs to text class file, and the text in document can be directly extracted by Open XML;For such as scanning PDF, pictureEqual documents, belong to picture class file, need to extract the text in document by corresponding tool (such as OCR tool);For textWord document is such as carried out the document of PDF format conversion by this class PDF(), it is extracted in document by using tools such as PDFboxText.
The text extracted is input to and such as newly translates tool, receive its feedback language languages, using feedback result asOriginal text languages documents to be translated.Such languages judge that tool belongs to existing application, by way of calling directly, can be completedJudgement to document original text languages.It can be institute it should be noted that being transferred to the text that third party's languages judge tool hereinWhole texts of the document of extraction, are also possible to segment word.For the applications of calling, if single third-party application withoutThe complete all languages of method detection can then be realized (including numerous, simplified to Chinese to full languages by calling multiple applicationsDetection) detection.
Using file parsing algorithm, analysis, which is done, extracts the composition informations such as position, pattern, the format of text in a document, andThe composition information of text is recorded.
The whole texts extracted are subjected to simple sentence and paragraph divides.This step can by detection text in comma,Fullstop and carriage return mark are realized.
Non- based on prebuild translates element database, carries out non-translating rubidium marking to each simple sentence.It is non-translate element database according toExperience or machine learning, the character picked out in a large amount of historical document, such as " Q: ", " F: ", " A: ", logo, enterprise nameDeng.This kind of character existing machine translation application in, will do it translation (by taking English translates Chinese as an example, " problem: " can be translated as," answer: " etc.), but translation result is not easy to the position of quick positioning question or answer, does not also meet people to similar fileReading habit, therefore, this kind of characters in the present invention, without translation, directly quote original text.
Each simple sentence that will be divided calls corresponding corpus, in order to each of division according to original text languagesA simple sentence is translated.So-called corpus is the contrast relationship between the original text and translation that are recorded in history translation;Pass throughMatching between original text and corpus, calling directly corresponding translation can be completed translation to sentence.Which is relatively using such as BaiduMachine translation, the translation results such as translation, Google's translation are more accurate, clear and coherent.When not being matched to corresponding corpus, need to call machineDevice translation application completes supplementary translation.It should be noted that the phrase in corpus, meeting real-time monitoring translation duties, are being supervisedWhen measuring the phrase for not including in corpus, the phrase and its translation can be stored automatically;In this way, corpus will obtainIt is continuous abundant.Be for consideration as the simple sentence that document is split into a sentence: 1, translation engine can not be translated simultaneouslyOne whole section of text can carry out sentence dismantling to whole section of text automatically, in this way it is possible to which sentence is arranged in partitioning siteCentre is translated the part of two sentences as a translation unit, and the result translated in this way can malfunction, and carry out sentence in advanceSon divides, and can overcome this kind of situation;2, the matching translation of corpus is drawn in advance using simple sentence or more junior unit as translation unitSubordinate sentence, also complies with translation brief.
In one embodiment, the above-mentioned translation to each simple sentence calls corpus or machine after going unless translating elementOn the translation basis that device translation application is completed, the result after element is supplemented in translation corresponding position is translated by non-.It takes out and non-translates memberElement is without translation, and after other parts are translated, the non-element of translating taken out before is backfilling into (the correspondence of corresponding position in translationIn the position in original text).In another embodiment, for each complete simple sentence, calling corpus or machine translationUsing on the translation basis of completion, the non-translation for translating element is replaced with into the non-result translated after element.Full text translation is first carried out,The non-element of translating being translated is replaced with into original text again.
It is recorded according to composition information of the front to original text (i.e. the text of document), typesetting, format is carried out to the translation of documentThe operation such as adjustment, so that translation is corresponding with the typesetting of original text.The reconstruct of format can be complete by existing document refinement algorithmAt.For example, first non-element of translating is placed in the record position of original text (such as on the basis of the first above-mentioned simple sentence translates embodimentRelative to the coordinate in original text), it is recorded further according to the format to original text, the translation unless after translating element will be removed, with the lattice of recordFormula, pattern are filled into corresponding position.In this way, just completing the translation to document.
In order to improve the effect that user watches original text and translation, user can choose compares display using original text with translationMode is shown translation result.Usually in the upper, lower, left, or right side of each section of original text, control display translation.The present inventionFor showing translation below original text (remaining control mode is similar), first below each paragraph of original text, original text is replicated, i.e.,The form of paragraph A, paragraph A, paragraph B, paragraph B ..., then the paragraph (i.e. posterior to repeat paragraph) of duplication is replaced with into correspondence, reconstruct after translation.To the displaying of original text and translation in the present embodiment, for the document newly constructed.If user does not select pairAccording to display, then the translation after reconstruct is directly exported.
The invention is not limited to specific embodiments above-mentioned.The present invention, which expands to, any in the present specification to be disclosedNew feature or any new combination, and disclose any new method or process the step of or any new combination.