CN102654867B

Movatterモバイル変換

Info

Publication number: CN102654867B
Application number: CN2011100498831A
Authority: CN
Inventors: 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-03-02
Filing date: 2011-03-02
Publication date: 2013-12-11
Anticipated expiration: 2031-03-02
Also published as: WO2012116561A1; CN102654867A

Abstract

Translated fromChinese

本发明提供了一种跨语言搜索中的网页排序方法及系统。该跨语言搜索中的网页排序方法包括：获取第一语言搜索请求；将第一语言搜索请求翻译成第二语言搜索请求；利用第二语言搜索请求搜索多个第二语言网页；将多个第二语言网页翻译成多个第一语言网页；根据多个第二语言网页的翻译置信度对多个第一语言网页进行排序。通过上述方式，根据翻译置信度对翻译后的搜索结果进行排序，进而提高了用户体验。

The invention provides a method and system for sorting webpages in cross-language search. The web page sorting method in the cross-language search includes: obtaining a first language search request; translating the first language search request into a second language search request; utilizing the second language search request to search multiple second language web pages; The second-language webpages are translated into multiple first-language webpages; and the multiple first-language webpages are sorted according to the translation confidence of the multiple second-language webpages. Through the above method, the translated search results are sorted according to the translation confidence, thereby improving user experience.

Description

Web page sequencing method in a kind of cross-language search and system

Technical field

The present invention relates to internet arena, particularly Web page sequencing method and the system in a kind of cross-language search.

Background technology

Along with the development of web search technology, for overcoming user's aphasis, the cross-language search technology is arisen at the historic moment.When cross-language search (such as using the English webpage of Chinese search), at first, input Chinese searching request, and Chinese searching request is translated into to English searching request, recycle English searching request and search for English webpage.Then, become Chinese to present to the reader content translation of the English webpage that searches.In the process presented at Search Results, generally need to be sorted.In existing cross-language search technology, mainly the degree of correlation by English searching request and English webpage is sorted.Yet, because cross-language search has carried out translation process, therefore may cause the poor result of translation quality to come front, cause the user to experience not good.

Summary of the invention

Technical matters to be solved by this invention is to provide Web page sequencing method and the system in a kind of cross-language search, to improve the user, experiences.

The present invention is that the technical scheme that the technical solution problem adopts is to provide the Web page sequencing method in a kind of cross-language search, and comprising: a. obtains the first language searching request; B. described first language searching request is translated into to the second language searching request; C. utilize described second language searching request to search for a plurality of second language webpages; D. described a plurality of second language web page translation are become to a plurality of first language webpages; E. according to the degree of translation confidence of described a plurality of second language webpages, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, in described step e, in the described a plurality of first language webpages after sequence, the position of the described first language webpage that described degree of translation confidence is higher is more forward.

The preferred embodiment one of according to the present invention, described step e comprises: the source language language material in the bilingualism corpora that e1. is used while obtaining the described second language webpage of translation; E2. utilize described source language language material production language model; E3. utilize described language model to calculate the translation puzzlement degree of described second language webpage; E4. according to described translation puzzlement degree, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, in described step e4, calculate described translation puzzlement degree by following formula:

P = 2^{- Σ_{i = 1}^{I} p (x_{i}) \log p (x_{i})}

Wherein, P is described translation puzzlement degree, x_ifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (x_i) for calculate the x obtained by described language model_iprobability of occurrence.

The preferred embodiment one of according to the present invention, described language model is the n-gram language model.

The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the tune order number of described second language webpage in translation process; E2. according to described tune order is several, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, described step e comprises: the source language language material in the bilingualism corpora that e1. is used while obtaining the described second language webpage of translation; E2. described source language language material is clustered into to a plurality of documents; E3. calculate the maximum similarity of described second language webpage and described a plurality of documents; E4. according to described maximum similarity, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, in step e2, obtain a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, in step e3, calculate the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.

The preferred embodiment one of according to the present invention, in step e3, calculate described maximum similarity according to following formula:

H = \max_{m = 1}^{M} \frac{Σ_{n = 1}^{N} p (t_{n} | d_{s}) \times p (t_{n} | d_{m})}{\sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{s}))}^{2}} \sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{m}))}^{2}}}

Wherein, H is described maximum similarity, t_nbe n theme, 1≤n≤N, the quantity that N is described theme, d_mbe m document, 1≤m≤M, the quantity that M is described document, p (t_n| d_m) be d_mbelong to t_nprobability, d_sfor described second language webpage, p (t_n| d_s) be d_sbelong to t_nprobability.

The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the number of the unregistered word that described second language webpage comprises in translation process; E2. according to the number of described unregistered word, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, described step e comprises: e1. calculates the average translation scoring of described second language webpage in translation process; E2. according to described average translation scoring, described a plurality of first language webpages are sorted.

The preferred embodiment one of according to the present invention, in step e1, calculate described average translation scoring according to following formula:

A = Σ_{k = 1}^{K} {score}_{k} / K

Wherein, A is described average translation scoring, score_kfor the translation scoring of k sentence in described second language webpage, 1≤k≤K, K is the sentence quantity in described second language webpage.

The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the regular access times of described second language webpage in translation process; E2. according to described regular access times, described a plurality of first language webpages are sorted.

The present invention is that the technical scheme that the technical solution problem adopts is to provide the webpage sorting system in a kind of cross-language search, comprising: the searching request acquiring unit, for obtaining the first language searching request; The first translation unit, for translating into the second language searching request by described first language searching request; Search unit, search for a plurality of second language webpages for utilizing described second language searching request; The second translation unit, for becoming a plurality of first language webpages by described a plurality of second language web page translation; Sequencing unit, sorted to described a plurality of first language webpages for the degree of translation confidence according to described a plurality of second language webpages.

The preferred embodiment one of according to the present invention, in the described a plurality of first language webpages after described sequencing unit sequence, the position of the described first language webpage that described degree of translation confidence is higher is more forward.

The preferred embodiment one of according to the present invention, described sequencing unit comprises: source language language material acquisition module, the source language language material of the bilingualism corpora used when obtaining the described second language webpage of translation; The language model generation module, for utilizing described source language language material production language model; Puzzlement degree computing module, calculate the translation puzzlement degree of described second language webpage for utilizing described language model; Order module, for being sorted to described a plurality of first language webpages according to described translation puzzlement degree.

The preferred embodiment one of according to the present invention, described puzzled degree computing module calculates described translation puzzlement degree by following formula:

P = 2^{- Σ_{i = 1}^{I} p (x_{i}) \log p (x_{i})}

The preferred embodiment one of according to the present invention, described sequencing unit comprises: adjust order number statistical module, for adding up the tune order number of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described tune order is several.

The preferred embodiment one of according to the present invention, described sequencing unit comprises: source language language material acquisition module, the source language language material of the bilingualism corpora used when obtaining the described second language webpage of translation; The cluster module, for being clustered into a plurality of documents by described source language language material; Similarity calculation module, for calculating the maximum similarity of described second language webpage and described a plurality of documents; Order module, for being sorted to described a plurality of first language webpages according to described maximum similarity.

The preferred embodiment one of according to the present invention, described cluster module is obtained a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, described similarity calculation module is calculated the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.

The preferred embodiment one of according to the present invention, described similarity calculation module is calculated described maximum similarity according to following formula:

H = \max_{m = 1}^{M} \frac{Σ_{n = 1}^{N} p (t_{n} | d_{s}) \times p (t_{n} | d_{m})}{\sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{s}))}^{2}} \sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{m}))}^{2}}}

The preferred embodiment one of according to the present invention, described sequencing unit comprises: the unregistered word statistical module, for adding up the number of the unregistered word that described second language webpage comprises at translation process; Order module, sorted to described a plurality of first language webpages for the number according to described unregistered word.

The preferred embodiment one of according to the present invention, described sequencing unit comprises: translation score calculation module, for calculating the average translation scoring of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described average translation scoring.

The preferred embodiment one of according to the present invention, described translation score calculation module is calculated described average translation scoring according to following formula:

A = Σ_{k = 1}^{K} {score}_{k} / K

The preferred embodiment one of according to the present invention, described sequencing unit comprises: regular access times statistical module, for adding up the regular access times of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described regular access times.

As can be seen from the above technical solutions, the Web page sequencing method in cross-language search provided by the invention and system are sorted to the Search Results after translating according to degree of translation confidence, and then have improved user's experience.

The accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the Web page sequencing method in the cross-language search of the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the first embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 3 is the schematic flow sheet of the second embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 4 is the schematic flow sheet of the 3rd embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 5 is the schematic flow sheet of the 4th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 6 is the schematic flow sheet of the 5th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 7 is the schematic flow sheet of the 6th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;

Fig. 8 is the schematic block diagram of the webpage sorting system in the cross-language search of the embodiment of the present invention;

Fig. 9 is the schematic block diagram of the first embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;

Figure 10 is the schematic block diagram of the second embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;

Figure 11 is the schematic block diagram of the 3rd embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;

Figure 12 is the schematic block diagram of the 4th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;

Figure 13 is the schematic block diagram of the 5th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;

Figure 14 is the schematic block diagram of the 6th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in detail.

Refer to Fig. 1, Fig. 1 is the schematic flow sheet of the Web page sequencing method in the cross-language search of the embodiment of the present invention.In the present embodiment, the Web page sequencing method in this cross-language search mainly comprises following step:

In step S101, obtain the first language searching request.In this step, the user can want by input in the search box of browser the first language searching request (Query) of search, for example searching request of Chinese, and click search button.This first language searching request is through internet transmission to search engine, and searched engine obtains.

In step S102, the first language searching request is translated into to the second language searching request.In this step, can the first language searching request be translated into to the second language searching request by various mechanical translation means well known in the art, for example, when utilizing the English webpage of Chinese search, Chinese searching request be translated into to English searching request.Concrete mechanical translation means can comprise based on word, statistical machine translation based on phrase or syntax etc.

In step S103, utilize the second language searching request to search for a plurality of second language webpages.In this step, search for a plurality of second language webpages relevant to the second language searching request, for example English webpage by search engine.

In step S104, a plurality of second language web page translation are become to a plurality of first language webpages.In this step, can the web page contents in the second language webpage be translated into to first language by various mechanical translation means mentioned above, for example Chinese, and then realize cross-language search.

In step S105, according to the degree of translation confidence of a plurality of second language webpages, a plurality of first language webpages are sorted.In this step, in a plurality of first language webpages after sequence, the position of the first language webpage that degree of translation confidence is higher is more forward, with the web results that translation quality is good, preferentially offers the user, and then improves the user and experience.To describe hereinafter the numerous embodiments of the degree of translation confidence that obtains the second language webpage in detail, those skilled in the art can expect other degree of translation confidence acquisition methods well known in the art are applied to step S105 fully.

Refer to Fig. 2, Fig. 2 is the schematic flow sheet of the first embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S201, the source language language material in the bilingualism corpora used while obtaining translation second language webpage.In the mechanical translation process, generally all utilize bilingualism corpora to train translation model.This bilingualism corpora comprises a plurality of bilingual example sentences pair, and each bilingual example sentence is to comprising source language example sentence and the target language example sentence corresponding with the source language example sentence.In the translation process of second language webpage, source language is second language, and target language is first language.Bilingualism corpora is commonly used in the mechanical translation field, and can obtain by variety of way, does not repeat them here.

In step S202, utilize source language language material production language model, for example the n-gram language model.

In step S203, utilize language model to calculate the translation puzzlement degree of second language webpage.Specifically, for by L word w₁, w₂..., w_lform a sentence x_i, can calculate the probability of occurrence of this sentence by language model:

p (x_{i}) = p (w_{1}, w_{2}, . . ., w_{L}) = Π_{l = 1}^{L} p (w_{l} | w_{l - n}, . . ., w_{l - 1})

Wherein, p (w_l| w_l-n..., w_l-1) expression word w_ln the word w with front_l-n..., w_l-1the probability of occurrence of collocation, n is a positive integer.For example, in the 2-gram language model, n=2, n=3 in the 3-gram language model.

For the second language webpage that includes I sentence, the translation puzzlement degree of second language webpage can calculate by following formula:

P = 2^{- Σ_{i = 1}^{I} p (x_{i}) \log p (x_{i})}

Wherein, the translation puzzlement degree that P is the second language webpage, x_ifor i sentence in the second language webpage, 1≤i≤I, I is the sentence quantity in the second language webpage, p (x_i) for calculate the sentence x obtained by above-mentioned language model_iprobability of occurrence.In translation process, if translation puzzlement degree is higher, mean that the translation complexity is higher, its degree of translation confidence is lower.

In step S204, according to translation puzzlement degree, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that translation puzzlement degree is higher is more leaned on.

Refer to Fig. 3, Fig. 3 is the schematic flow sheet of the second embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S301, the tune order number of statistics second language webpage in translation process.In translation process, need to sequentially be adjusted the translation of the word in the source language sentence or phrase, this adjustment is the tune order.Adjust the order number more, mean that the translation complexity is higher, its degree of translation confidence is just lower.

In step S302, according to adjusting, order is several to be sorted to a plurality of first language webpages.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that tune order number is more is more leaned on.

Refer to Fig. 4, Fig. 4 is the schematic flow sheet of the 3rd embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S401, the source language language material in the bilingualism corpora used while obtaining translation second language webpage.

In step S402, the source language language material is clustered into to a plurality of documents.Specifically, utilize clustering algorithm to carry out cluster to the sentence in the source language language material, then sentence set to a document of each class, and then form a plurality of documents.Subsequently, utilize probability latent semantic analysis (Probabilistic LatentSemantic Analysis, PLSA) or other algorithms obtain a plurality of themes from the plurality of document, and calculate the probability that each document belongs to each theme, to form a plurality of primary vectors:

Vec(d_m)＝(p(t₁|d_m)，p(t₂|d_m)，...p(t_n|d_m)，...，p(t_N|d_m))，

Wherein, t_nbe n theme, 1≤n≤N, the quantity that N is the theme, d_mbe m document, 1≤m≤M, the quantity that M is document, p (t_n| d_m) be document d_mbelong to theme t_nprobability.

In step S403, calculate the maximum similarity of second language webpage and a plurality of documents.Specifically, calculate the probability that the second language webpage belongs to each theme, to form secondary vector:

Vec(d_s)＝(p(t₁|d_s)，p(t₂|d_s)，...p(t_n|d_s)，...，p(t_N|d_s))

Wherein, d_sfor the second language webpage, p (t_n| d_s) be second language webpage d_sbelong to theme t_nprobability.

Subsequently, calculate the similarity of a plurality of primary vectors and secondary vector, and select in similarity maximum as maximum similarity.Concrete calculating formula of similarity can be:

H = \max_{m = 1}^{M} \frac{Σ_{n = 1}^{N} p (t_{n} | d_{s}) \times p (t_{n} | d_{m})}{\sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{s}))}^{2}} \sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{m}))}^{2}}}

Wherein, H is maximum similarity.Maximum similarity is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.

In step S404, according to maximum similarity H, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, the position of the first language webpage that maximum similarity is higher is more forward.

Refer to Fig. 5, Fig. 5 is the schematic flow sheet of the 4th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S501, the number of the unregistered word that statistics second language webpage comprises in translation process.Unregistered word refers to the word be not incorporated in the source language language material, comprises all kinds of proper nouns (name, place name, mechanism's name etc.), abb., newly-increased vocabulary etc.In the mechanical translation process, unregistered word is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.

In step S502, according to the number of unregistered word, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that the number of unregistered word is more is more leaned on.

Refer to Fig. 6, Fig. 6 is the schematic flow sheet of the 5th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S601, calculate the average translation scoring of second language webpage in translation process.Specifically, calculate the average translation scoring of second language webpage according to following formula:

A = Σ_{k = 1}^{K} {score}_{k} / K

Wherein, the average translation scoring that A is the second language webpage, score_kfor the translation scoring of k sentence in the second language webpage, 1≤k≤K, K is the sentence quantity in the second language webpage.In this step, the translation scoring of each sentence can be determined by translation evaluation method well known in the art, such as automatic evaluation methods such as normalized sentence translation probability.Average translation scoring is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.

In step S602, according to average translation scoring, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, the position of the first language webpage that average translation scoring is higher is more forward.

Refer to Fig. 7, Fig. 7 is the schematic flow sheet of the 6th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:

In step S701, the regular access times of statistics second language webpage in translation process.Tend to formulate certain translation rule in the mechanical translation field, for example, for the translation rule of particular phrase.In the mechanical translation process, the number of times of service regeulations is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.

In step S702, according to regular access times, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that regular access times are more is more leaned on.

Above-mentioned first to fourth embodiment is to obtain from the source language end of second language webpage the feature that means degree of translation confidence, and the 5th to the 6th embodiment obtains from the translation model of second language webpage or translation result the feature that means degree of translation confidence.Certainly, those skilled in the art can obtain other features that mean degree of translation confidence fully by other means.

Further, those skilled in the art can expect the various features of above-described expression degree of translation confidence are carried out to combination after reading foregoing fully, for example use recurrence learning (regressionlearning) method will comprise that the proper vector of above-mentioned a plurality of features is mapped to a real number, and then form the degree of translation confidence of a comprehensive above-mentioned feature.Said process can be used known instrument to realize, for example, and the SVM-light instrument.

In addition, after obtaining degree of translation confidence, can also be using degree of translation confidence as a feature and other sort methods well known in the art carry out combination, for example learning to rank or PageRank method.

Refer to Fig. 8, Fig. 8 is the schematic block diagram of the webpage sorting system in the cross-language search of the embodiment of the present invention.In the present embodiment, the webpage sorting system in this cross-language search mainly comprises searching request acquiring unit 801, the first translation unit 802, search unit 803, the second translation unit 804 and sequencing unit 805.

Searching request acquiring unit 801 is for obtaining the first language searching request.The user can want by input in the search box of browser the first language searching request (Query) of search, for example searching request of Chinese, and click search button.This first language searching request is through internet transmission to searching request acquiring unit 801, and searched acquisition request unit 801 obtains.

The first translation unit 802 is for translating into the second language searching request by the first language searching request.The first translation unit 802 can be translated into the second language searching request by the first language searching request by various mechanical translation means well known in the art, for example, when utilizing the English webpage of Chinese search, Chinese searching request is translated into to English searching request.Concrete mechanical translation means can comprise based on word, statistical machine translation based on phrase or syntax etc.

Search unit 803 is for utilizing the second language searching request to search for a plurality of second language webpages.Search unit 803 for example, by the various search engine technique search well known in the art a plurality of second language webpages relevant to the second language searching request, English webpage.

The second translation unit 804 is for becoming a plurality of first language webpages by a plurality of second language web page translation.The second translation unit 804 can be translated into first language by the web page contents in the second language webpage by various mechanical translation means mentioned above, for example Chinese, and then realizes cross-language search.In the present embodiment, the first translation unit 802 and the second translation unit 804 can be realized by same translation unit or different translation units.

Sequencing unit 805 is sorted to a plurality of first language webpages for the degree of translation confidence according to a plurality of second language webpages.In a plurality of first language webpages after sequencing unit 805 sequences, the position of the first language webpage that degree of translation confidence is higher is more forward, with the web results that translation quality is good, preferentially offers the user, and then improves the user and experience.To describe hereinafter the numerous embodiments of the degree of translation confidence that obtains the second language webpage in detail, those skilled in the art can expect other degree of translation confidence acquisition methods well known in the art are applied to sequencing unit 805 fully.

Refer to Fig. 9, Fig. 9 is the schematic block diagram of the first embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises source language languagematerial acquisition module 901, languagemodel generation module 902, puzzlementdegree computing module 903 andorder module 904.

The source language language material of the bilingualism corpora that source language languagematerial acquisition module 901 is used while for obtaining, translating the second language webpage.In the mechanical translation process, generally all utilize bilingualism corpora to train translation model.This bilingualism corpora comprises a plurality of bilingual example sentences pair, and each bilingual example sentence is to comprising source language example sentence and the target language example sentence corresponding with the source language example sentence.In the translation process of second language webpage, source language is second language, and target language is first language.Bilingualism corpora is commonly used in the mechanical translation field, and can obtain by variety of way, does not repeat them here.

Languagemodel generation module 902 for example, for utilizing source language language material production language model, n-gram language model.

Puzzlementdegree computing module 903 is for utilizing language model to calculate the translation puzzlement degree of second language webpage.Specifically, for by L word w₁, w₂..., w_lform a sentence x_i, can calculate the probability of occurrence of this sentence by language model:

p (x_{i}) = p (w_{1}, w_{2}, . . ., w_{L}) = Π_{l = 1}^{L} p (w_{l} | w_{l - n}, . . ., w_{l - 1})

P = 2^{- Σ_{i = 1}^{I} p (x_{i}) \log p (x_{i})}

Order module 904 is for being sorted to a plurality of first language webpages according to translation puzzlement degree.Wherein, in a plurality of first language webpages afterorder module 904 sequences, after the position of the first language webpage that translation puzzlement degree is higher is more leaned on.

Refer to Figure 10, Figure 10 is the schematic block diagram of the second embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises adjusts order numberstatistical module 1001 andorder module 1002.

Adjust order numberstatistical module 1001 for adding up the tune order number of second language webpage at translation process.In translation process, need to sequentially be adjusted the translation of the word in the source language sentence or phrase, this adjustment is the tune order.Adjust the order number more, mean that the translation complexity is higher, its degree of translation confidence is just lower.

Order module 1002 is for according to adjusting, order is several to be sorted to a plurality of first language webpages.Wherein, in a plurality of first language webpages afterorder module 1002 sequences, after the position of the first language webpage that tune order number is more is more leaned on.

Refer to Figure 11, Figure 11 is the schematic block diagram of the 3rd embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises source language languagematerial acquisition module 1101,cluster module 1102,similarity calculation module 1103 andorder module 1104.

The source language language material of the bilingualism corpora that source language languagematerial acquisition module 1101 is used while for obtaining, translating the second language webpage.

Cluster module 1102 is for being clustered into a plurality of documents by the source language language material.Specifically,cluster module 1102 utilizes clustering algorithm to carry out cluster to the sentence in the source language language material, then sentence set to a document of each class, and then forms a plurality of documents.Subsequently,cluster module 1102 is utilized probability latent semantic analysis (Probabilistic Latent Semantic Analysis, PLSA) or other algorithms obtain a plurality of themes from the plurality of document, and calculate the probability that each document belongs to each theme, to form a plurality of primary vectors:

Vec(d_m)＝(p(t₁|d_m)，p(t₂|d_m)，...p(t_n|d_m)，...，p(t_N|d_m))，

Vec(d_s)＝(p(t₁|d_s)，p(t₂|d_s)，...p(t_n|d_s)，...，p(t_N|d_s))

H = \max_{m = 1}^{M} \frac{Σ_{n = 1}^{N} p (t_{n} | d_{s}) \times p (t_{n} | d_{m})}{\sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{s}))}^{2}} \sqrt{Σ_{n = 1}^{N} {(p (t_{n} | d_{m}))}^{2}}}

Order module 1104 is for being sorted to a plurality of first language webpages according to maximum similarity H.Wherein, in a plurality of first language webpages afterorder module 1104 sequences, the position of the first language webpage that maximum similarity is higher is more forward.

Refer to Figure 12, Figure 12 is the schematic block diagram of the 4th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises unregistered word statistical module 1201 and order module 1202.

Unregistered word statistical module 1201 is for the number of the unregistered word adding up the second language webpage and comprise at translation process.Unregistered word refers to the word be not incorporated in the source language language material, comprises all kinds of proper nouns (name, place name, mechanism's name etc.), abb., newly-increased vocabulary etc.In the mechanical translation process, unregistered word is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.

Order module 1202 is sorted to a plurality of first language webpages for the number according to unregistered word.Wherein, in a plurality of first language webpages after order module 1202 sequences, after the position of the first language webpage that the number of unregistered word is more is more leaned on.

Refer to Figure 13, Figure 13 is the schematic block diagram of the 5th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises translationscore calculation module 1301 andorder module 1302.

Translationscore calculation module 1301 is for calculating the average translation scoring of second language webpage at translation process.Specifically, translationscore calculation module 1301 is calculated the average translation scoring of second language webpage according to following formula:

A = Σ_{k = 1}^{K} {score}_{k} / K

Wherein, the average translation scoring that A is the second language webpage, score_kfor the translation scoring of k sentence in the second language webpage, 1≤k≤K, K is the sentence quantity in the second language webpage.Translationscore calculation module 1301 can be determined the translation scoring of each sentence by translation evaluation method well known in the art, such as automatic evaluation methods such as normalized sentence translation probability.Average translation scoring is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.

Order module 1302 is for being sorted to a plurality of first language webpages according to average translation scoring.Wherein, in a plurality of first language webpages afterorder module 1302 sequences, the position of the first language webpage that average translation scoring is higher is more forward.

Refer to Figure 14, Figure 14 is the schematic block diagram of the 6th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises regular access timesstatistical module 1401 andorder module 1402.

Rule access timesstatistical module 1401 is for adding up the regular access times of second language webpage at translation process.Tend to formulate certain translation rule in the mechanical translation field, for example, for the translation rule of particular phrase.In the mechanical translation process, the number of times of service regeulations is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.

Order module 1402 is for being sorted to a plurality of first language webpages according to regular access times.Wherein, in a plurality of first language webpages afterorder module 1402 sequences, after the position of the first language webpage that regular access times are more is more leaned on.

In the above-described embodiments, only the present invention has been carried out to exemplary description, but those skilled in the art can carry out various modifications to the present invention without departing from the spirit and scope of the present invention after reading present patent application.

Claims

1. the Web page sequencing method in a cross-language search, is characterized in that, the Web page sequencing method in described cross-language search comprises:

A. obtain the first language searching request;

B. described first language searching request is translated into to the second language searching request;

C. utilize described second language searching request to search for a plurality of second language webpages;

D. described a plurality of second language web page translation are become to a plurality of first language webpages;

E. according to the degree of translation confidence of described a plurality of second language webpages, described a plurality of first language webpages are sorted;

Described step e comprises:

Source language language material in the bilingualism corpora used while e11. obtaining the described second language webpage of translation;

E12. utilize described source language language material production language model;

E13. utilize described language model to calculate the translation puzzlement degree of described second language webpage:

E14. according to described translation puzzlement degree, described a plurality of first language webpages are sorted;

Wherein, calculate described translation puzzlement degree by following formula:

P is described translation puzzlement degree, x_ifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (x_i) for calculate the x obtained by described language model_iprobability of occurrence;

Perhaps, described step e comprises:

Source language language material in the bilingualism corpora used while e21. obtaining the described second language webpage of translation;

E22. described source language language material is clustered into to a plurality of documents;

E23. calculate the maximum similarity of described second language webpage and described a plurality of documents;

E24. according to described maximum similarity, described a plurality of first language webpages are sorted;

Wherein, in step e2, obtain a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, in step e3, calculate the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.

2. the Web page sequencing method in cross-language search as claimed in claim 1, is characterized in that, described language model is the n-gram language model.

3. the Web page sequencing method in cross-language search as claimed in claim 1, is characterized in that, in step e23, according to following formula, calculates described maximum similarity:

4. the webpage sorting system in a cross-language search, is characterized in that, the webpage sorting system in described cross-language search comprises:

The searching request acquiring unit, for obtaining the first language searching request;

The first translation unit, for translating into the second language searching request by described first language searching request;

Search unit, search for a plurality of second language webpages for utilizing described second language searching request;

The second translation unit, for becoming a plurality of first language webpages by described a plurality of second language web page translation;

Sequencing unit, sorted to described a plurality of first language webpages for the degree of translation confidence according to described a plurality of second language webpages;

Described sequencing unit comprises:

Source language language material acquisition module, the source language language material of the bilingualism corpora used while for obtaining, translating described second language webpage;

The language model generation module, for utilizing described source language language material production language model;

Puzzlement degree computing module, calculate the translation puzzlement degree of described second language webpage for utilizing described language model;

Order module, for being sorted to described a plurality of first language webpages according to described translation puzzlement degree;

Wherein, described puzzled degree computing module calculates described translation puzzlement degree by following formula:

Wherein, P is described translation puzzlement degree, x_ifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (x_i) for calculate the x obtained by described language model_iprobability of occurrence;

Perhaps, described sequencing unit comprises:

The cluster module, for being clustered into a plurality of documents by described source language language material;

Similarity calculation module, for calculating the maximum similarity of described second language webpage and described a plurality of documents;

Order module, for being sorted to described a plurality of first language webpages according to described maximum similarity;

Wherein, described cluster module is obtained a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, described similarity calculation module is calculated the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.

5. the webpage sorting system in cross-language search as claimed in claim 4, is characterized in that, described language model is the n-gram language model.

6. the webpage sorting system in cross-language search as claimed in claim 4, is characterized in that, described similarity calculation module is calculated described maximum similarity according to following formula: