Movatterモバイル変換


[0]ホーム

URL:


CN102654867B - Webpage sorting method and system in cross-language search - Google Patents

Webpage sorting method and system in cross-language search
Download PDF

Info

Publication number
CN102654867B
CN102654867BCN2011100498831ACN201110049883ACN102654867BCN 102654867 BCN102654867 BCN 102654867BCN 2011100498831 ACN2011100498831 ACN 2011100498831ACN 201110049883 ACN201110049883 ACN 201110049883ACN 102654867 BCN102654867 BCN 102654867B
Authority
CN
China
Prior art keywords
language
translation
webpage
webpages
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011100498831A
Other languages
Chinese (zh)
Other versions
CN102654867A (en
Inventor
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2011100498831ApriorityCriticalpatent/CN102654867B/en
Priority to PCT/CN2011/083411prioritypatent/WO2012116561A1/en
Publication of CN102654867ApublicationCriticalpatent/CN102654867A/en
Application grantedgrantedCritical
Publication of CN102654867BpublicationCriticalpatent/CN102654867B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供了一种跨语言搜索中的网页排序方法及系统。该跨语言搜索中的网页排序方法包括:获取第一语言搜索请求;将第一语言搜索请求翻译成第二语言搜索请求;利用第二语言搜索请求搜索多个第二语言网页;将多个第二语言网页翻译成多个第一语言网页;根据多个第二语言网页的翻译置信度对多个第一语言网页进行排序。通过上述方式,根据翻译置信度对翻译后的搜索结果进行排序,进而提高了用户体验。

The invention provides a method and system for sorting webpages in cross-language search. The web page sorting method in the cross-language search includes: obtaining a first language search request; translating the first language search request into a second language search request; utilizing the second language search request to search multiple second language web pages; The second-language webpages are translated into multiple first-language webpages; and the multiple first-language webpages are sorted according to the translation confidence of the multiple second-language webpages. Through the above method, the translated search results are sorted according to the translation confidence, thereby improving user experience.

Description

Web page sequencing method in a kind of cross-language search and system
Technical field
The present invention relates to internet arena, particularly Web page sequencing method and the system in a kind of cross-language search.
Background technology
Along with the development of web search technology, for overcoming user's aphasis, the cross-language search technology is arisen at the historic moment.When cross-language search (such as using the English webpage of Chinese search), at first, input Chinese searching request, and Chinese searching request is translated into to English searching request, recycle English searching request and search for English webpage.Then, become Chinese to present to the reader content translation of the English webpage that searches.In the process presented at Search Results, generally need to be sorted.In existing cross-language search technology, mainly the degree of correlation by English searching request and English webpage is sorted.Yet, because cross-language search has carried out translation process, therefore may cause the poor result of translation quality to come front, cause the user to experience not good.
Summary of the invention
Technical matters to be solved by this invention is to provide Web page sequencing method and the system in a kind of cross-language search, to improve the user, experiences.
The present invention is that the technical scheme that the technical solution problem adopts is to provide the Web page sequencing method in a kind of cross-language search, and comprising: a. obtains the first language searching request; B. described first language searching request is translated into to the second language searching request; C. utilize described second language searching request to search for a plurality of second language webpages; D. described a plurality of second language web page translation are become to a plurality of first language webpages; E. according to the degree of translation confidence of described a plurality of second language webpages, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, in described step e, in the described a plurality of first language webpages after sequence, the position of the described first language webpage that described degree of translation confidence is higher is more forward.
The preferred embodiment one of according to the present invention, described step e comprises: the source language language material in the bilingualism corpora that e1. is used while obtaining the described second language webpage of translation; E2. utilize described source language language material production language model; E3. utilize described language model to calculate the translation puzzlement degree of described second language webpage; E4. according to described translation puzzlement degree, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, in described step e4, calculate described translation puzzlement degree by following formula:
P=2-Σi=1Ip(xi)logp(xi)
Wherein, P is described translation puzzlement degree, xifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (xi) for calculate the x obtained by described language modeliprobability of occurrence.
The preferred embodiment one of according to the present invention, described language model is the n-gram language model.
The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the tune order number of described second language webpage in translation process; E2. according to described tune order is several, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, described step e comprises: the source language language material in the bilingualism corpora that e1. is used while obtaining the described second language webpage of translation; E2. described source language language material is clustered into to a plurality of documents; E3. calculate the maximum similarity of described second language webpage and described a plurality of documents; E4. according to described maximum similarity, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, in step e2, obtain a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, in step e3, calculate the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.
The preferred embodiment one of according to the present invention, in step e3, calculate described maximum similarity according to following formula:
H=maxm=1MΣn=1Np(tn|ds)×p(tn|dm)Σn=1N(p(tn|ds))2Σn=1N(p(tn|dm))2
Wherein, H is described maximum similarity, tnbe n theme, 1≤n≤N, the quantity that N is described theme, dmbe m document, 1≤m≤M, the quantity that M is described document, p (tn| dm) be dmbelong to tnprobability, dsfor described second language webpage, p (tn| ds) be dsbelong to tnprobability.
The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the number of the unregistered word that described second language webpage comprises in translation process; E2. according to the number of described unregistered word, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, described step e comprises: e1. calculates the average translation scoring of described second language webpage in translation process; E2. according to described average translation scoring, described a plurality of first language webpages are sorted.
The preferred embodiment one of according to the present invention, in step e1, calculate described average translation scoring according to following formula:
A=Σk=1Kscorek/K
Wherein, A is described average translation scoring, scorekfor the translation scoring of k sentence in described second language webpage, 1≤k≤K, K is the sentence quantity in described second language webpage.
The preferred embodiment one of according to the present invention, described step e comprises: e1. adds up the regular access times of described second language webpage in translation process; E2. according to described regular access times, described a plurality of first language webpages are sorted.
The present invention is that the technical scheme that the technical solution problem adopts is to provide the webpage sorting system in a kind of cross-language search, comprising: the searching request acquiring unit, for obtaining the first language searching request; The first translation unit, for translating into the second language searching request by described first language searching request; Search unit, search for a plurality of second language webpages for utilizing described second language searching request; The second translation unit, for becoming a plurality of first language webpages by described a plurality of second language web page translation; Sequencing unit, sorted to described a plurality of first language webpages for the degree of translation confidence according to described a plurality of second language webpages.
The preferred embodiment one of according to the present invention, in the described a plurality of first language webpages after described sequencing unit sequence, the position of the described first language webpage that described degree of translation confidence is higher is more forward.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: source language language material acquisition module, the source language language material of the bilingualism corpora used when obtaining the described second language webpage of translation; The language model generation module, for utilizing described source language language material production language model; Puzzlement degree computing module, calculate the translation puzzlement degree of described second language webpage for utilizing described language model; Order module, for being sorted to described a plurality of first language webpages according to described translation puzzlement degree.
The preferred embodiment one of according to the present invention, described puzzled degree computing module calculates described translation puzzlement degree by following formula:
P=2-Σi=1Ip(xi)logp(xi)
Wherein, P is described translation puzzlement degree, xifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (xi) for calculate the x obtained by described language modeliprobability of occurrence.
The preferred embodiment one of according to the present invention, described language model is the n-gram language model.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: adjust order number statistical module, for adding up the tune order number of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described tune order is several.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: source language language material acquisition module, the source language language material of the bilingualism corpora used when obtaining the described second language webpage of translation; The cluster module, for being clustered into a plurality of documents by described source language language material; Similarity calculation module, for calculating the maximum similarity of described second language webpage and described a plurality of documents; Order module, for being sorted to described a plurality of first language webpages according to described maximum similarity.
The preferred embodiment one of according to the present invention, described cluster module is obtained a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, described similarity calculation module is calculated the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.
The preferred embodiment one of according to the present invention, described similarity calculation module is calculated described maximum similarity according to following formula:
H=maxm=1MΣn=1Np(tn|ds)×p(tn|dm)Σn=1N(p(tn|ds))2Σn=1N(p(tn|dm))2
Wherein, H is described maximum similarity, tnbe n theme, 1≤n≤N, the quantity that N is described theme, dmbe m document, 1≤m≤M, the quantity that M is described document, p (tn| dm) be dmbelong to tnprobability, dsfor described second language webpage, p (tn| ds) be dsbelong to tnprobability.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: the unregistered word statistical module, for adding up the number of the unregistered word that described second language webpage comprises at translation process; Order module, sorted to described a plurality of first language webpages for the number according to described unregistered word.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: translation score calculation module, for calculating the average translation scoring of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described average translation scoring.
The preferred embodiment one of according to the present invention, described translation score calculation module is calculated described average translation scoring according to following formula:
A=Σk=1Kscorek/K
Wherein, A is described average translation scoring, scorekfor the translation scoring of k sentence in described second language webpage, 1≤k≤K, K is the sentence quantity in described second language webpage.
The preferred embodiment one of according to the present invention, described sequencing unit comprises: regular access times statistical module, for adding up the regular access times of described second language webpage at translation process; Order module, for being sorted to described a plurality of first language webpages according to described regular access times.
As can be seen from the above technical solutions, the Web page sequencing method in cross-language search provided by the invention and system are sorted to the Search Results after translating according to degree of translation confidence, and then have improved user's experience.
The accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the Web page sequencing method in the cross-language search of the embodiment of the present invention;
Fig. 2 is the schematic flow sheet of the first embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 3 is the schematic flow sheet of the second embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 4 is the schematic flow sheet of the 3rd embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 5 is the schematic flow sheet of the 4th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 6 is the schematic flow sheet of the 5th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 7 is the schematic flow sheet of the 6th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1;
Fig. 8 is the schematic block diagram of the webpage sorting system in the cross-language search of the embodiment of the present invention;
Fig. 9 is the schematic block diagram of the first embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;
Figure 10 is the schematic block diagram of the second embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;
Figure 11 is the schematic block diagram of the 3rd embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;
Figure 12 is the schematic block diagram of the 4th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;
Figure 13 is the schematic block diagram of the 5th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8;
Figure 14 is the schematic block diagram of the 6th embodiment of the sequencing unit of the webpage sorting system in the cross-language search shown in Fig. 8.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in detail.
Refer to Fig. 1, Fig. 1 is the schematic flow sheet of the Web page sequencing method in the cross-language search of the embodiment of the present invention.In the present embodiment, the Web page sequencing method in this cross-language search mainly comprises following step:
In step S101, obtain the first language searching request.In this step, the user can want by input in the search box of browser the first language searching request (Query) of search, for example searching request of Chinese, and click search button.This first language searching request is through internet transmission to search engine, and searched engine obtains.
In step S102, the first language searching request is translated into to the second language searching request.In this step, can the first language searching request be translated into to the second language searching request by various mechanical translation means well known in the art, for example, when utilizing the English webpage of Chinese search, Chinese searching request be translated into to English searching request.Concrete mechanical translation means can comprise based on word, statistical machine translation based on phrase or syntax etc.
In step S103, utilize the second language searching request to search for a plurality of second language webpages.In this step, search for a plurality of second language webpages relevant to the second language searching request, for example English webpage by search engine.
In step S104, a plurality of second language web page translation are become to a plurality of first language webpages.In this step, can the web page contents in the second language webpage be translated into to first language by various mechanical translation means mentioned above, for example Chinese, and then realize cross-language search.
In step S105, according to the degree of translation confidence of a plurality of second language webpages, a plurality of first language webpages are sorted.In this step, in a plurality of first language webpages after sequence, the position of the first language webpage that degree of translation confidence is higher is more forward, with the web results that translation quality is good, preferentially offers the user, and then improves the user and experience.To describe hereinafter the numerous embodiments of the degree of translation confidence that obtains the second language webpage in detail, those skilled in the art can expect other degree of translation confidence acquisition methods well known in the art are applied to step S105 fully.
Refer to Fig. 2, Fig. 2 is the schematic flow sheet of the first embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S201, the source language language material in the bilingualism corpora used while obtaining translation second language webpage.In the mechanical translation process, generally all utilize bilingualism corpora to train translation model.This bilingualism corpora comprises a plurality of bilingual example sentences pair, and each bilingual example sentence is to comprising source language example sentence and the target language example sentence corresponding with the source language example sentence.In the translation process of second language webpage, source language is second language, and target language is first language.Bilingualism corpora is commonly used in the mechanical translation field, and can obtain by variety of way, does not repeat them here.
In step S202, utilize source language language material production language model, for example the n-gram language model.
In step S203, utilize language model to calculate the translation puzzlement degree of second language webpage.Specifically, for by L word w1, w2..., wlform a sentence xi, can calculate the probability of occurrence of this sentence by language model:
p(xi)=p(w1,w2,...,wL)=Πl=1Lp(wl|wl-n,...,wl-1)
Wherein, p (wl| wl-n..., wl-1) expression word wln the word w with frontl-n..., wl-1the probability of occurrence of collocation, n is a positive integer.For example, in the 2-gram language model, n=2, n=3 in the 3-gram language model.
For the second language webpage that includes I sentence, the translation puzzlement degree of second language webpage can calculate by following formula:
P=2-Σi=1Ip(xi)logp(xi)
Wherein, the translation puzzlement degree that P is the second language webpage, xifor i sentence in the second language webpage, 1≤i≤I, I is the sentence quantity in the second language webpage, p (xi) for calculate the sentence x obtained by above-mentioned language modeliprobability of occurrence.In translation process, if translation puzzlement degree is higher, mean that the translation complexity is higher, its degree of translation confidence is lower.
In step S204, according to translation puzzlement degree, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that translation puzzlement degree is higher is more leaned on.
Refer to Fig. 3, Fig. 3 is the schematic flow sheet of the second embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S301, the tune order number of statistics second language webpage in translation process.In translation process, need to sequentially be adjusted the translation of the word in the source language sentence or phrase, this adjustment is the tune order.Adjust the order number more, mean that the translation complexity is higher, its degree of translation confidence is just lower.
In step S302, according to adjusting, order is several to be sorted to a plurality of first language webpages.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that tune order number is more is more leaned on.
Refer to Fig. 4, Fig. 4 is the schematic flow sheet of the 3rd embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S401, the source language language material in the bilingualism corpora used while obtaining translation second language webpage.
In step S402, the source language language material is clustered into to a plurality of documents.Specifically, utilize clustering algorithm to carry out cluster to the sentence in the source language language material, then sentence set to a document of each class, and then form a plurality of documents.Subsequently, utilize probability latent semantic analysis (Probabilistic LatentSemantic Analysis, PLSA) or other algorithms obtain a plurality of themes from the plurality of document, and calculate the probability that each document belongs to each theme, to form a plurality of primary vectors:
Vec(dm)=(p(t1|dm),p(t2|dm),...p(tn|dm),...,p(tN|dm)),
Wherein, tnbe n theme, 1≤n≤N, the quantity that N is the theme, dmbe m document, 1≤m≤M, the quantity that M is document, p (tn| dm) be document dmbelong to theme tnprobability.
In step S403, calculate the maximum similarity of second language webpage and a plurality of documents.Specifically, calculate the probability that the second language webpage belongs to each theme, to form secondary vector:
Vec(ds)=(p(t1|ds),p(t2|ds),...p(tn|ds),...,p(tN|ds))
Wherein, dsfor the second language webpage, p (tn| ds) be second language webpage dsbelong to theme tnprobability.
Subsequently, calculate the similarity of a plurality of primary vectors and secondary vector, and select in similarity maximum as maximum similarity.Concrete calculating formula of similarity can be:
H=maxm=1MΣn=1Np(tn|ds)×p(tn|dm)Σn=1N(p(tn|ds))2Σn=1N(p(tn|dm))2
Wherein, H is maximum similarity.Maximum similarity is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.
In step S404, according to maximum similarity H, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, the position of the first language webpage that maximum similarity is higher is more forward.
Refer to Fig. 5, Fig. 5 is the schematic flow sheet of the 4th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S501, the number of the unregistered word that statistics second language webpage comprises in translation process.Unregistered word refers to the word be not incorporated in the source language language material, comprises all kinds of proper nouns (name, place name, mechanism's name etc.), abb., newly-increased vocabulary etc.In the mechanical translation process, unregistered word is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.
In step S502, according to the number of unregistered word, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that the number of unregistered word is more is more leaned on.
Refer to Fig. 6, Fig. 6 is the schematic flow sheet of the 5th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S601, calculate the average translation scoring of second language webpage in translation process.Specifically, calculate the average translation scoring of second language webpage according to following formula:
A=Σk=1Kscorek/K
Wherein, the average translation scoring that A is the second language webpage, scorekfor the translation scoring of k sentence in the second language webpage, 1≤k≤K, K is the sentence quantity in the second language webpage.In this step, the translation scoring of each sentence can be determined by translation evaluation method well known in the art, such as automatic evaluation methods such as normalized sentence translation probability.Average translation scoring is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.
In step S602, according to average translation scoring, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, the position of the first language webpage that average translation scoring is higher is more forward.
Refer to Fig. 7, Fig. 7 is the schematic flow sheet of the 6th embodiment of the sequencer procedure of the Web page sequencing method in the cross-language search shown in Fig. 1.Present embodiment mainly comprises following step:
In step S701, the regular access times of statistics second language webpage in translation process.Tend to formulate certain translation rule in the mechanical translation field, for example, for the translation rule of particular phrase.In the mechanical translation process, the number of times of service regeulations is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.
In step S702, according to regular access times, a plurality of first language webpages are sorted.Wherein, in a plurality of first language webpages after sequence, after the position of the first language webpage that regular access times are more is more leaned on.
Above-mentioned first to fourth embodiment is to obtain from the source language end of second language webpage the feature that means degree of translation confidence, and the 5th to the 6th embodiment obtains from the translation model of second language webpage or translation result the feature that means degree of translation confidence.Certainly, those skilled in the art can obtain other features that mean degree of translation confidence fully by other means.
Further, those skilled in the art can expect the various features of above-described expression degree of translation confidence are carried out to combination after reading foregoing fully, for example use recurrence learning (regressionlearning) method will comprise that the proper vector of above-mentioned a plurality of features is mapped to a real number, and then form the degree of translation confidence of a comprehensive above-mentioned feature.Said process can be used known instrument to realize, for example, and the SVM-light instrument.
In addition, after obtaining degree of translation confidence, can also be using degree of translation confidence as a feature and other sort methods well known in the art carry out combination, for example learning to rank or PageRank method.
Refer to Fig. 8, Fig. 8 is the schematic block diagram of the webpage sorting system in the cross-language search of the embodiment of the present invention.In the present embodiment, the webpage sorting system in this cross-language search mainly comprises searching request acquiring unit 801, the first translation unit 802, search unit 803, the second translation unit 804 and sequencing unit 805.
Searching request acquiring unit 801 is for obtaining the first language searching request.The user can want by input in the search box of browser the first language searching request (Query) of search, for example searching request of Chinese, and click search button.This first language searching request is through internet transmission to searching request acquiring unit 801, and searched acquisition request unit 801 obtains.
The first translation unit 802 is for translating into the second language searching request by the first language searching request.The first translation unit 802 can be translated into the second language searching request by the first language searching request by various mechanical translation means well known in the art, for example, when utilizing the English webpage of Chinese search, Chinese searching request is translated into to English searching request.Concrete mechanical translation means can comprise based on word, statistical machine translation based on phrase or syntax etc.
Search unit 803 is for utilizing the second language searching request to search for a plurality of second language webpages.Search unit 803 for example, by the various search engine technique search well known in the art a plurality of second language webpages relevant to the second language searching request, English webpage.
The second translation unit 804 is for becoming a plurality of first language webpages by a plurality of second language web page translation.The second translation unit 804 can be translated into first language by the web page contents in the second language webpage by various mechanical translation means mentioned above, for example Chinese, and then realizes cross-language search.In the present embodiment, the first translation unit 802 and the second translation unit 804 can be realized by same translation unit or different translation units.
Sequencing unit 805 is sorted to a plurality of first language webpages for the degree of translation confidence according to a plurality of second language webpages.In a plurality of first language webpages after sequencing unit 805 sequences, the position of the first language webpage that degree of translation confidence is higher is more forward, with the web results that translation quality is good, preferentially offers the user, and then improves the user and experience.To describe hereinafter the numerous embodiments of the degree of translation confidence that obtains the second language webpage in detail, those skilled in the art can expect other degree of translation confidence acquisition methods well known in the art are applied to sequencing unit 805 fully.
Refer to Fig. 9, Fig. 9 is the schematic block diagram of the first embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises source language languagematerial acquisition module 901, languagemodel generation module 902, puzzlementdegree computing module 903 andorder module 904.
The source language language material of the bilingualism corpora that source language languagematerial acquisition module 901 is used while for obtaining, translating the second language webpage.In the mechanical translation process, generally all utilize bilingualism corpora to train translation model.This bilingualism corpora comprises a plurality of bilingual example sentences pair, and each bilingual example sentence is to comprising source language example sentence and the target language example sentence corresponding with the source language example sentence.In the translation process of second language webpage, source language is second language, and target language is first language.Bilingualism corpora is commonly used in the mechanical translation field, and can obtain by variety of way, does not repeat them here.
Languagemodel generation module 902 for example, for utilizing source language language material production language model, n-gram language model.
Puzzlementdegree computing module 903 is for utilizing language model to calculate the translation puzzlement degree of second language webpage.Specifically, for by L word w1, w2..., wlform a sentence xi, can calculate the probability of occurrence of this sentence by language model:
p(xi)=p(w1,w2,...,wL)=Πl=1Lp(wl|wl-n,...,wl-1)
Wherein, p (wl| wl-n..., wl-1) expression word wln the word w with frontl-n..., wl-1the probability of occurrence of collocation, n is a positive integer.For example, in the 2-gram language model, n=2, n=3 in the 3-gram language model.
For the second language webpage that includes I sentence, the translation puzzlement degree of second language webpage can calculate by following formula:
P=2-Σi=1Ip(xi)logp(xi)
Wherein, the translation puzzlement degree that P is the second language webpage, xifor i sentence in the second language webpage, 1≤i≤I, I is the sentence quantity in the second language webpage, p (xi) for calculate the sentence x obtained by above-mentioned language modeliprobability of occurrence.In translation process, if translation puzzlement degree is higher, mean that the translation complexity is higher, its degree of translation confidence is lower.
Order module 904 is for being sorted to a plurality of first language webpages according to translation puzzlement degree.Wherein, in a plurality of first language webpages afterorder module 904 sequences, after the position of the first language webpage that translation puzzlement degree is higher is more leaned on.
Refer to Figure 10, Figure 10 is the schematic block diagram of the second embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises adjusts order numberstatistical module 1001 andorder module 1002.
Adjust order numberstatistical module 1001 for adding up the tune order number of second language webpage at translation process.In translation process, need to sequentially be adjusted the translation of the word in the source language sentence or phrase, this adjustment is the tune order.Adjust the order number more, mean that the translation complexity is higher, its degree of translation confidence is just lower.
Order module 1002 is for according to adjusting, order is several to be sorted to a plurality of first language webpages.Wherein, in a plurality of first language webpages afterorder module 1002 sequences, after the position of the first language webpage that tune order number is more is more leaned on.
Refer to Figure 11, Figure 11 is the schematic block diagram of the 3rd embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises source language languagematerial acquisition module 1101,cluster module 1102,similarity calculation module 1103 andorder module 1104.
The source language language material of the bilingualism corpora that source language languagematerial acquisition module 1101 is used while for obtaining, translating the second language webpage.
Cluster module 1102 is for being clustered into a plurality of documents by the source language language material.Specifically,cluster module 1102 utilizes clustering algorithm to carry out cluster to the sentence in the source language language material, then sentence set to a document of each class, and then forms a plurality of documents.Subsequently,cluster module 1102 is utilized probability latent semantic analysis (Probabilistic Latent Semantic Analysis, PLSA) or other algorithms obtain a plurality of themes from the plurality of document, and calculate the probability that each document belongs to each theme, to form a plurality of primary vectors:
Vec(dm)=(p(t1|dm),p(t2|dm),...p(tn|dm),...,p(tN|dm)),
Wherein, tnbe n theme, 1≤n≤N, the quantity that N is the theme, dmbe m document, 1≤m≤M, the quantity that M is document, p (tn| dm) be document dmbelong to theme tnprobability.
Similarity calculation module 1103 is calculated the maximum similarity of second language webpage and a plurality of documents.Specifically,similarity calculation module 1103 is calculated the probability that the second language webpage belongs to each theme, to form secondary vector:
Vec(ds)=(p(t1|ds),p(t2|ds),...p(tn|ds),...,p(tN|ds))
Wherein, dsfor the second language webpage, p (tn| ds) be second language webpage dsbelong to theme tnprobability.
Subsequently,similarity calculation module 1103 is calculated the similarity of a plurality of primary vectors and secondary vector, and select maximum in similarity as maximum similarity.Concrete calculating formula of similarity can be:
H=maxm=1MΣn=1Np(tn|ds)×p(tn|dm)Σn=1N(p(tn|ds))2Σn=1N(p(tn|dm))2
Wherein, H is maximum similarity.Maximum similarity is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.
Order module 1104 is for being sorted to a plurality of first language webpages according to maximum similarity H.Wherein, in a plurality of first language webpages afterorder module 1104 sequences, the position of the first language webpage that maximum similarity is higher is more forward.
Refer to Figure 12, Figure 12 is the schematic block diagram of the 4th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises unregistered word statistical module 1201 and order module 1202.
Unregistered word statistical module 1201 is for the number of the unregistered word adding up the second language webpage and comprise at translation process.Unregistered word refers to the word be not incorporated in the source language language material, comprises all kinds of proper nouns (name, place name, mechanism's name etc.), abb., newly-increased vocabulary etc.In the mechanical translation process, unregistered word is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.
Order module 1202 is sorted to a plurality of first language webpages for the number according to unregistered word.Wherein, in a plurality of first language webpages after order module 1202 sequences, after the position of the first language webpage that the number of unregistered word is more is more leaned on.
Refer to Figure 13, Figure 13 is the schematic block diagram of the 5th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises translationscore calculation module 1301 andorder module 1302.
Translationscore calculation module 1301 is for calculating the average translation scoring of second language webpage at translation process.Specifically, translationscore calculation module 1301 is calculated the average translation scoring of second language webpage according to following formula:
A=Σk=1Kscorek/K
Wherein, the average translation scoring that A is the second language webpage, scorekfor the translation scoring of k sentence in the second language webpage, 1≤k≤K, K is the sentence quantity in the second language webpage.Translationscore calculation module 1301 can be determined the translation scoring of each sentence by translation evaluation method well known in the art, such as automatic evaluation methods such as normalized sentence translation probability.Average translation scoring is higher, represents that translation quality is higher, means that degree of translation confidence is just higher.
Order module 1302 is for being sorted to a plurality of first language webpages according to average translation scoring.Wherein, in a plurality of first language webpages afterorder module 1302 sequences, the position of the first language webpage that average translation scoring is higher is more forward.
Refer to Figure 14, Figure 14 is the schematic block diagram of the 6th embodiment of the sequencing unit 805 of the webpage sorting system in the cross-language search shown in Fig. 8.The sequencing unit 805 of present embodiment mainly comprises regular access timesstatistical module 1401 andorder module 1402.
Rule access timesstatistical module 1401 is for adding up the regular access times of second language webpage at translation process.Tend to formulate certain translation rule in the mechanical translation field, for example, for the translation rule of particular phrase.In the mechanical translation process, the number of times of service regeulations is more, represents that translation quality is poorer, and its degree of translation confidence is just lower.
Order module 1402 is for being sorted to a plurality of first language webpages according to regular access times.Wherein, in a plurality of first language webpages afterorder module 1402 sequences, after the position of the first language webpage that regular access times are more is more leaned on.
Above-mentioned first to fourth embodiment is to obtain from the source language end of second language webpage the feature that means degree of translation confidence, and the 5th to the 6th embodiment obtains from the translation model of second language webpage or translation result the feature that means degree of translation confidence.Certainly, those skilled in the art can obtain other features that mean degree of translation confidence fully by other means.
Further, those skilled in the art can expect the various features of above-described expression degree of translation confidence are carried out to combination after reading foregoing fully, for example use recurrence learning (regressionlearning) method will comprise that the proper vector of above-mentioned a plurality of features is mapped to a real number, and then form the degree of translation confidence of a comprehensive above-mentioned feature.Said process can be used known instrument to realize, for example, and the SVM-light instrument.
In addition, after obtaining degree of translation confidence, can also be using degree of translation confidence as a feature and other sort methods well known in the art carry out combination, for example learning to rank or PageRank method.
As can be seen from the above technical solutions, the Web page sequencing method in cross-language search provided by the invention and system are sorted to the Search Results after translating according to degree of translation confidence, and then have improved user's experience.
In the above-described embodiments, only the present invention has been carried out to exemplary description, but those skilled in the art can carry out various modifications to the present invention without departing from the spirit and scope of the present invention after reading present patent application.

Claims (6)

Figure FDA00003523983400031
Wherein, P is described translation puzzlement degree, xifor i sentence in described second language webpage, 1≤i≤I, I is the sentence quantity in described second language webpage, p (xi) for calculate the x obtained by described language modeliprobability of occurrence;
Perhaps, described sequencing unit comprises:
Source language language material acquisition module, the source language language material of the bilingualism corpora used while for obtaining, translating described second language webpage;
The cluster module, for being clustered into a plurality of documents by described source language language material;
Similarity calculation module, for calculating the maximum similarity of described second language webpage and described a plurality of documents;
Order module, for being sorted to described a plurality of first language webpages according to described maximum similarity;
Wherein, described cluster module is obtained a plurality of themes from described a plurality of documents, and calculate the probability that each described document belongs to each described theme, to form a plurality of primary vectors, described similarity calculation module is calculated the probability that described second language webpage belongs to each described theme, to form secondary vector, calculate the similarity of described a plurality of primary vector and described secondary vector, and select the described maximum similarity of conduct maximum in described similarity.
5. the webpage sorting system in cross-language search as claimed in claim 4, is characterized in that, described language model is the n-gram language model.
6. the webpage sorting system in cross-language search as claimed in claim 4, is characterized in that, described similarity calculation module is calculated described maximum similarity according to following formula:
Wherein, H is described maximum similarity, tnbe n theme, 1≤n≤N, the quantity that N is described theme, dmbe m document, 1≤m≤M, the quantity that M is described document, p (tn| dm) be dmbelong to tnprobability, dsfor described second language webpage, p (tn| ds) be dsbelong to tnprobability.
CN2011100498831A2011-03-022011-03-02Webpage sorting method and system in cross-language searchActiveCN102654867B (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
CN2011100498831ACN102654867B (en)2011-03-022011-03-02Webpage sorting method and system in cross-language search
PCT/CN2011/083411WO2012116561A1 (en)2011-03-022011-12-03Web page ranking method and system in cross-language search

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN2011100498831ACN102654867B (en)2011-03-022011-03-02Webpage sorting method and system in cross-language search

Publications (2)

Publication NumberPublication Date
CN102654867A CN102654867A (en)2012-09-05
CN102654867Btrue CN102654867B (en)2013-12-11

Family

ID=46730493

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN2011100498831AActiveCN102654867B (en)2011-03-022011-03-02Webpage sorting method and system in cross-language search

Country Status (2)

CountryLink
CN (1)CN102654867B (en)
WO (1)WO2012116561A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9070303B2 (en)*2012-06-012015-06-30Microsoft Technology Licensing, LlcLanguage learning opportunities and general search engines
CN102955853B (en)*2012-11-022019-05-28北京百度网讯科技有限公司A kind of generation method and device across language digest
CN104573019B (en)*2015-01-122019-04-02百度在线网络技术(北京)有限公司Information retrieval method and device
CN104850545A (en)*2015-04-292015-08-19均康(上海)信息科技有限公司Online collaboration system and method for translating network resources
CN104850610A (en)*2015-05-112015-08-19均康(上海)信息科技有限公司Network search engine system
CN105095512A (en)*2015-09-092015-11-25四川省科技交流中心Cross-language private data retrieval system and method based on bridge language
CN107273372A (en)*2016-04-062017-10-20北京搜狗科技发展有限公司A kind of searching method, device and equipment
CN107798386B (en)*2016-09-012022-02-15微软技术许可有限责任公司Multi-process collaborative training based on unlabeled data
CN106919642B (en)*2017-01-132021-04-16北京搜狗科技发展有限公司Cross-language search method and device for cross-language search
CN110930208B (en)*2018-09-192023-05-05阿里巴巴集团控股有限公司Object searching method and device
CN111444730A (en)*2020-03-272020-07-24新疆大学 Data-enhanced Uyghur-Chinese machine translation system training method and device based on Transformer model
CN113326706B (en)*2021-06-292025-02-11北京搜狗科技发展有限公司 Cross-language search method, device and electronic device
CN118568318A (en)*2023-02-282024-08-30北京字跳网络技术有限公司 Content search method, device, equipment, computer-readable storage medium and product

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101743544A (en)*2007-05-162010-06-16谷歌公司 Cross Language Information Retrieval
CN101763402A (en)*2009-12-302010-06-30哈尔滨工业大学Integrated retrieval method for multi-language information retrieval
CN101868797A (en)*2007-09-212010-10-20谷歌公司Cross-language search

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101271461B (en)*2007-03-192011-07-13株式会社东芝Cross-language retrieval request conversion and cross-language information retrieval method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101743544A (en)*2007-05-162010-06-16谷歌公司 Cross Language Information Retrieval
CN101868797A (en)*2007-09-212010-10-20谷歌公司Cross-language search
CN101763402A (en)*2009-12-302010-06-30哈尔滨工业大学Integrated retrieval method for multi-language information retrieval

Also Published As

Publication numberPublication date
WO2012116561A1 (en)2012-09-07
CN102654867A (en)2012-09-05

Similar Documents

PublicationPublication DateTitle
CN102654867B (en)Webpage sorting method and system in cross-language search
Cao et al.Base noun phrase translation using web data and the EM algorithm
Zahran et al.Word representations in vector space and their applications for arabic
Munteanu et al.Improving machine translation performance by exploiting non-parallel corpora
US8706474B2 (en)Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names
US8521516B2 (en)Linguistic key normalization
CN102651003B (en)Cross-language searching method and device
CN102662936B (en)Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
US20120150529A1 (en)Method and apparatus for generating translation knowledge server
CN102681983A (en)Alignment method and device for text data
CN102955853A (en)Method and device for generating cross-language abstract
Bergsma et al.Creating robust supervised classifiers via web-scale n-gram data
KR101616031B1 (en)Query Translator and Method for Cross-language Information Retrieval using Liguistic Resources from Wikipedia and Parallel Corpus
Ethiraj et al.NELIS-Named Entity and Language Identification System: Shared Task System Description.
Sharma et al.Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
Farzi et al.A swarm-inspired re-ranker system for statistical machine translation
ZhangResearch on English machine translation system based on the internet
Dadashkarimi et al.A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages
Souza et al.Extraction of keywords from texts: an exploratory study using Noun Phrases
Bajpai et al.Cross language information retrieval: In indian language perspective
Sun et al.Fast multi-task learning for query spelling correction
Figueroa et al.Language independent answer prediction from the web
Li et al.Concept unification of terms in different languages via web mining for Information Retrieval
Gashaw et al.Language Modelling with NMT Query Translation for Amharic-Arabic Cross-Language Information Retrieval
Murat et al.Using semantic knowledge in the Uyghur-Chinese person name transliteration

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp