A kind of result for retrieval syllogic ordered collection method and systemTechnical field
The invention belongs to retrieval technique field, be specifically related to a kind of result for retrieval syllogic ordered collection method and system, be particularly suitable for the ordered collection to full-text search result.
Background technology
Text retrieval system, after retrieving according to retrieval request, also needs result for retrieval to sort, and then returns to the Ordinal in user's specified scope.Sort by is the degree of correlation of result for retrieval and retrieval request normally, can be also other foundations, such as the significance level of result for retrieval, the order of certain field value of result for retrieval etc.The object that result for retrieval is sorted, the result that makes to meet user's needs appears at the foremost of result sequence.
Text retrieval system foreground is generally to result for retrieval paging displaying, and once an exposition result, constantly shows subsequent result by page turning.When page turning, or by retrieval again or obtain the content of this displaying by reading the buffer memory generating while retrieving last time.Therefore the result for retrieval that text retrieval system returns is generally the data volume of one page or multipage, and along with the carrying out of page turning, the result for retrieval scope of obtaining constantly moves.
The subsequent treatment band of the result for retrieval of text retrieval system performance issue: the increase of the data volume of processing along with text retrieval system, in the situation that hitting large result, the storage of result for retrieval and sequence can take a large amount of storage spaces and processing time, cause retrieval performance not good, height is processed result for retrieval concomitantly fast.
At present, the solution that the search engine on network adopts is: only support to return front some maximally related results, surpass and limit not returning of quantity, thereby avoided this problem.As Baidu, Google all only provide less than the browsing of 800 result for retrieval, page turning only provides prompting after surpassing this quantity, no longer continues to show later retrieval result.But in the application of enterprise-level text retrieval system, requirement can be returned to whole results that retrieval is hit, and when the continuous page turning of user, can browse to whole result for retrieval.
User, when browsing result for retrieval, generally only browses former page, but also constantly page turning sometimes, or checks the last few pages.Therefore need to be according to user's demand, returning part result for retrieval.These result for retrieval may sort the most forward, may be also after leaning on most, or the result in the middle of being positioned at.
In prior art, in order to reduce the pressure of result for retrieval storage and sequence, for user, only browse the situation of former pages, use Top-K method.The best result for retrieval of K item before described Top-K method refers to and only collects, abandons all the other results.The quantity K that the default needs of text retrieval system are collected returns to the part that user requires from this in K item result.If Ahmed Metwally etc. is described in " EfficientComputation of Frequent and Top-k Elements in Data Streams " (In Proceedingsof the 10th ICDT International Conference on Database Theory.398-412.): some results sorting before collecting, when new result is added in set, replace minimum item, and finally from set, return to Top-K item value.Needed memory headroom while having saved in this way processing large result, thus raise the efficiency.But the countermeasure when result for retrieval that does not propose user's request exceeds front K item.Therefore, the method is only applicable to only need the situation of the best result for retrieval of returning part.
When the result for retrieval of user's request exceeds front K item, in prior art, generally there are following two kinds of collection modes:
1. collect whole results and carry out partial orderedly, return to the desired part of user.
The result that unordered collection is all hit, the range of results of returning according to user's requirement is carried out partial ordered, obtains between fruiting area, and the result in this interval is carried out to complete sequence, thereby obtain required result and return to user.The defect that this mode exists is: need a large amount of storage spaces, and the sequence performance of large result is lower.
2. dynamic expansion K value, to the K ' that comprises user and require the result return, is collected the best result for retrieval of K ' item.
In the famous text retrieval system Lucene that increases income (http://lucene.apache.org/), used this mode.As Top-K method, the best result for retrieval of K ' item before only collecting, abandons all the other results, finally obtains required result and returns to user.This mode has good performance when K ' approaches K, but along with the expansion of K ' value, its performance also constantly declines.The result for retrieval scope requiring user, when whole result for retrieval last, just become and collects whole result for retrieval, and storage space is now the same with the 1st kind of mode, and owing to all sorting and causing performance lower.
Above two kinds of existing modes all need to preserve a large amount of result for retrieval, thereby have taken a large amount of memory headrooms, and the performance that sorts when result for retrieval is more is low, make the hydraulic performance decline of text retrieval system.
Summary of the invention
For the defect existing in prior art, fundamental purpose of the present invention is to provide that a kind of can to return to sequence according to user's needs the most forward, or after leaning on most, or the result for retrieval syllogic ordered collection method and system of some middle result for retrieval.On this basis, the present invention can also improve the sequence performance of result for retrieval, reduces the storage space of result for retrieval, reduces taking of memory headroom, improves on the whole the performance of searching system.
To achieve these goals, the technical solution used in the present invention is as follows:
A result for retrieval syllogic ordered collection method, first the method retrieves and obtains result for retrieval according to retrieval request; Then according to the capacity of predetermined head section, rear and interlude, sequence in result for retrieval is collected in head section at head, sequence is collected in rear in last part, and sequence is collected in interlude in middle part; Finally according to result for retrieval, return to request and from the beginning in section, rear and/or interlude, return to result for retrieval.
Result for retrieval syllogic ordered collection method as above, wherein, a head section capacity comprises the first capacity K and the second capacity K ', the second capacity K ' is between 1.5 times to 10 times of the first capacity K; Rear capacity M is between 0.5 times to 1 times of the first capacity K; The result for retrieval scope that interlude capacity N returns according to request and head section capacity are determined.Head section the first capacity K is generally the result for retrieval quantity that 5 to 10 result for retrieval displayed pages comprise, and is preferably the result for retrieval quantity that 8 result for retrieval displayed pages comprise; The second capacity K ' is preferably the result for retrieval quantity that 50 result for retrieval displayed pages comprise, and rear capacity M is preferably the result for retrieval quantity that 4 result for retrieval displayed pages comprise.
Result for retrieval syllogic ordered collection method as above, wherein, the result for retrieval descending sort in head section, the result for retrieval ascending order in rear is arranged, and the result for retrieval in interlude does not sort.
Result for retrieval syllogic ordered collection method as above, wherein, sequence in result for retrieval is collected in head section at head, and sequence is collected in rear in last part, and the process that sequence is collected in interlude in middle part comprises the following steps:
(1) determine head section the first capacity K and the second capacity K ', rear capacity M;
(2) according to result for retrieval, return to the result for retrieval scope that in request, requirement is returned and determine collection mode;
Described collection mode comprises only to be collected head section result for retrieval and collects head section, rear and interlude result for retrieval simultaneously; If collect head section, rear and interlude result for retrieval simultaneously, determine the distance D between interlude capacity N, interlude and head section;
(3) according to collection mode definite in step (2), result for retrieval is collected to section to the end, or head section, rear and interlude.
Result for retrieval syllogic ordered collection method as above, the result for retrieval scope of returning as requested described in step (2) determines that the process of collection mode is as follows:
The result for retrieval scope that request is returned is by reference position SI and return to amount R C and represent;
If (SI+RC) ∈ [0, K], head section capacity adopts the first capacity K, only collects head section result for retrieval;
If (SI+RC) ∈ (K, K '], a head section capacity adopts the second capacity K ', only collects a head section result for retrieval;
If (SI+RC) > K ' and SI < K ', head section capacity adopts the second capacity K ', N=SI+RC-K ', and D=0, collects head section, rear and interlude result for retrieval;
If (SI+RC) > K ' and SI >=K ', head section capacity adopts the second capacity K ', N=RC, and D=SI-K ', collects head section, rear and interlude result for retrieval.
Result for retrieval syllogic ordered collection method as above, described in step (3), result for retrieval is collected to section to the end, or the process of head section, rear and interlude comprises the following steps:
1. from result for retrieval set, take out a result for retrieval X;
2. whether judgement head section is full; As less than, result for retrieval X is added in head section, and in correct section, result for retrieval sorts, and goes to step 7.;
If 3. head section is full, result for retrieval X is compared with the minterm in head section; If be less than minterm, using result for retrieval X as the result excluding; Otherwise result for retrieval X is added in head section and sequence, using minterm as the result excluding;
4. judge that whether rear is full; As less than, the result excluding is added in rear, and result for retrieval in rear is sorted, go to step 7.;
If 5. rear is full, the maximal term in the result excluding and rear is compared; If be less than maximal term, the result excluding added in rear, and the result for retrieval in rear is sorted, using maximal term as the result excluding; Otherwise the result excluding is constant;
6. judge the quantity of result for retrieval in the value of D and interlude; If D > 0, successively decreases one by D, abandon the result excluding; If D=0, and in interlude, the quantity of result for retrieval is less than N, the result excluding is added in interlude, according to the order adding, preserves result for retrieval; If the quantity of result for retrieval equals N in interlude, abandon the result excluding;
7. judge whether result for retrieval set is empty; In this way, finish; As no, go to step 1..
Result for retrieval syllogic ordered collection method as above, wherein, according to result for retrieval, return and ask the process of returning to result for retrieval to comprise the following steps:
(a) the relatively quantity of result for retrieval and a capacity for head section in head section;
(b) if the quantity of result for retrieval is less than its capacity in head section, the RC bar result for retrieval a Duan Zhongcong SI being started is as returning results; If RC has surpassed a Duan Zhongcong SI, start to the result for retrieval quantity of ending, a Duan Zhongcong SI is started to the result for retrieval ending up as returning results; If SI is greater than the quantity of result for retrieval in head section, return results as sky;
(c) if the quantity of result for retrieval equals its capacity in head section, from SI, start [MAX ((K '-SI) minimum head section, 0)] in the whole result for retrieval in bar result for retrieval, interlude and rear, maximum [the whole result for retrieval quantity in RC-MAX ((K '-SI), 0)-interlude] bar record conduct returns results.
A kind of result for retrieval syllogic ordered collection system, comprise for retrieve and obtain the indexing unit of result for retrieval according to retrieval request, for result for retrieval sequence being collected in to the head section collection device in head section at head according to predetermined head section capacity; For result for retrieval sequence being collected in to the rear collection device in rear in last part according to predetermined rear capacity; For result for retrieval sequence being collected in to the interlude collection device in interlude in middle part according to predetermined interlude capacity; And the result for retrieval return mechanism that returns to result for retrieval for return to the result for retrieval scope of asking requirement to be returned according to result for retrieval.
The method of the invention and system, by the mode that adopts three sections to collect result for retrieval, not only realized that according to user's needs, to return to sequence the most forward, or after leaning on most, or some middle result for retrieval, and the effectively less storage space of result for retrieval, strengthened the sequence efficiency of result for retrieval, comprehensively improved the overall performance of searching system.
Accompanying drawing explanation
Fig. 1 is the graph of a relation between three sections in embodiment;
Fig. 2 is the structured flowchart of syllogic ordered collection system in embodiment;
Fig. 3 adopts system shown in Figure 2 to collect the process flow diagram of result for retrieval in embodiment;
Fig. 4 adds result for retrieval respectively the process flow diagram of section, rear and interlude to the end according to the order of sequence in embodiment.
Embodiment
Core concept of the present invention is: first according to retrieval request, retrieve and obtain result for retrieval; Then according to the capacity of predetermined head section (Top-K), rear (Bottom-M) and interlude (Middle-N), sequence in result for retrieval is collected in head section at head, sequence is collected in rear in last part, and sequence is collected in interlude in middle part; Finally according to result for retrieval, return to request and from the beginning in section, rear and/or interlude, return to result for retrieval.Wherein, the result for retrieval descending sort in head section, the result for retrieval ascending order in rear is arranged, and the result for retrieval in interlude does not sort.Relation between head section, interlude and rear as shown in Figure 1.In the time of in the front or the most last ranking results that the range of results that user's request is returned is being collected, return to ranking results accurately; When range of results that request is returned is not exclusively in the orderly range of results in front and back, return to front and rear part Ordinal and middle not ranking results; When range of results that request is returned is completely not in the orderly range of results in front and back, the not ranking results in the middle of returning.Below in conjunction with embodiment and accompanying drawing, describe the present invention.
Fig. 2 has shown the structure of the preferred implementation of result for retrieval syllogic ordered collection system of the present invention.This system comprisesindexing unit 11, headsection collection device 12,rear collection device 13,interlude collection device 14 and result forretrieval return mechanism 15.
Indexing unit 11 is for retrieving and obtain result for retrieval according to retrieval request.Headsection collection device 12 for according to predetermined head section capacity by result for retrieval sequence in some the most front collection in head section, the result for retrieval descending sort in head section.Rear collection device 13 for according to predetermined rear capacity by result for retrieval sequence in some last collection at rear, the result for retrieval ascending order in rear is arranged.Interlude collection device 14 for according to predetermined interlude capacity by result for retrieval sequence in some collection of centre in interlude, the result for retrieval of interlude does not sort.Result forretrieval return mechanism 15 requires the result for retrieval scope of returning to return to result for retrieval for return to request according to result for retrieval.
Fig. 3 adopts system shown in Figure 1 to collect the flow process of result for retrieval, comprises the following steps:
(1) determine head section the first capacity K and the second capacity K ', rear capacity M, and result for retrieval returns to the result for retrieval scope that in request, requirement is returned.Wherein, the value of K, K ' and M is according to user's service condition analysis is determined afterwards, K ' be generally K 1.5 times to 10 between, M is generally between 0.5 to 1 times of K.K is generally the result for retrieval quantity that 5-10 result for retrieval displayed page comprises, and the best is the result for retrieval quantity that 8 result for retrieval displayed pages comprise.K ' adjusts according to the value of K, and optimum value is the result for retrieval quantity that 50 result for retrieval displayed pages comprise.M generally remains on the result for retrieval quantity that 4 left and right result for retrieval displayed pages comprise.
(2) the result for retrieval scope of returning is as requested determined collection mode, and described collection mode comprises only to be collected head section result for retrieval and collect head section, rear and interlude result for retrieval simultaneously.If need to collect head section, rear and interlude result for retrieval simultaneously, determine the distance D (with reference to Fig. 1) of interlude capacity N (being the fruiting quantities that interlude need to be collected), interlude and head section.
The result for retrieval scope that requirement is returned is by reference position StartIndex (being abbreviated as SI) and return to amount R eturnCount (being abbreviated as RC) expression.When result for retrieval is multipage, along with user's page turning, SI constantly changes; RC value is exactly the result for retrieval quantity that one page is shown, is generally 10.
When result for retrieval scope that needs return is in [0, K] interval, during (SI+RC) ∈ [0, K], a head section capacity adopts K.When result for retrieval scope that needs return (K, K '] in interval, (SI+RC) ∈ (K, K '] time, a head section capacity adopts K '.Under both of these case, only need to collect head section result for retrieval, without collecting rear and interlude.
When result for retrieval scope that needs return not all in a segment limit (now a head section capacity is K '), i.e. (SI+RC) > K ' time, if existing, both scopes partly overlap, i.e. SI < K ', N=SI+RC-K ', D=0.If it is overlapping that both do not exist, i.e. SI >=K ', N=RC, D=SI-K '.Under both of these case, need to collect head section, rear and interlude simultaneously.
(3) according to collection mode definite in step (2), result for retrieval is collected to section to the end, or head section, rear and interlude, detailed process as shown in Figure 4, comprises the following steps:
1. from result for retrieval set, take out a result for retrieval X;
2. whether judgement head section is full; As less than, result for retrieval X is added in head section, and in correct section, result for retrieval carries out descending sort, goes to step 7.;
If 3. head section is full, result for retrieval X is compared with the minterm in head section; If be less than minterm, using result for retrieval X as the result excluding; Otherwise result for retrieval X is added in head section and descending sort, using minterm as the result excluding; Described minterm refers to the item that comes minimum position according to ordering rule;
4. judge that whether rear is full; As less than, the result excluding is added in rear, and result for retrieval in rear is carried out to ascending order arrangement, go to step 7.;
If 5. rear is full, the maximal term in the result excluding and rear is compared; If be less than maximal term, the result excluding added in rear, and the result for retrieval in rear is carried out to ascending order arrangement, using maximal term as the result excluding; Otherwise the result excluding is constant; Described maximal term refers to the item that comes maximum position according to ordering rule;
6. judge the quantity of result for retrieval in the value of D and interlude; If D > 0, successively decreases one by D, abandon the result excluding; If D=0, and in interlude, the quantity of result for retrieval is less than N, the result excluding is added in interlude, according to the order adding, preserves result for retrieval; If the quantity of result for retrieval equals N in interlude, abandon the result excluding;
7. judge whether result for retrieval set is empty; In this way, finish; As no, go to step 1..
(4) according to result for retrieval, return to request and require the result for retrieval scope of returning to return to result for retrieval, detailed process comprises the following steps:
(a) the relatively quantity of result for retrieval and a capacity for head section in head section.
(b) if the quantity of result for retrieval is less than its capacity in head section, illustrate that now interlude, rear are all empty, the RC bar result for retrieval that a Duan Zhongcong SI is started is as returning results.The result for retrieval quantity that Duan Zhongcong SI starts may not enough RC bar, also may ask the result for retrieval scope of returning to exceed the result for retrieval scope of collecting in head section.If the result for retrieval amount R C that request is returned has surpassed a Duan Zhongcong SI, start to the result for retrieval quantity of ending, a Duan Zhongcong SI is started to the result for retrieval ending up as returning results; If SI is greater than the quantity of result for retrieval in head section, return results as sky.
(c) if the quantity of result for retrieval equals its capacity in head section, from SI, by MAX ((K '-SI) minimum in head section, 0) in the whole result for retrieval in bar result for retrieval, interlude and rear, maximum [the whole result for retrieval quantity in RC-MAX ((K '-SI), 0)-interlude] bar record conduct returns results.If interlude is non-conterminous with head section, the result of returning in head section is for empty; If interlude and rear are non-conterminous, rear returns results as sky; If when request only returns results therein in a section, other sections returns results as sky.
Embodiment
The result that the collection of take is below retrieved press index storehouse is example, and said method is illustrated.In the retrieve application in press index storehouse, a result for retrieval displayed page is shown 10 records, 8 pages of results of primary retrieval buffer memory just can be satisfied the demand substantially, therefore K gets the fruiting quantities 80 that 8 result for retrieval displayed pages comprise, K ' gets 500, it is 40 that M gets 0.5 of K, and these parameters are that user is according to the feature configured in advance of searching system.
Inquiry " content: China " in the press index storehouse of approximately 1,000 ten thousand data volumes, and according to the order sequence from back to front of publications time, hit altogether 2,201,853 results.
When getting follow-up 10 results since the 10th, i.e. SI=10, RC=10.SI+RC < K, requires the result for retrieval returning in head section.Now, directly front K item is sorted to maximum collection in head section, all the other results are abandoned.Finally from the beginning in section, obtain needed result.
When getting follow-up 10 results since the 495th, i.e. SI=495, RC=10.SI+RC > K ' and SI < K ', require the result for retrieval returning in head section and interlude, N=SI+RC-K '=5, D=0.
For every result for retrieval, according to method as shown in Figure 4, first add to the end in section, if there is result to be excluded, join in rear again, if there is result to be excluded, according to the situation of interlude, add interlude again, all the other results are abandoned.
Finally from the beginning in section, read from the minimum result of [MAX ((K '-SI), 0)=5] bar, and from interlude, obtain all results (N=5 bar), as returning results, the result that rear returns is for empty.
When getting follow-up 10 results since the 510th, i.e. SI=510, RC=10.SI+RC > K ', requires the result for retrieval returning in interlude, N=RC=10, D=SI-K '=10.
For every result for retrieval, according to method as shown in Figure 4, first add to the end in section, if there is result to be excluded, join in rear again, if there is result to be excluded, according to the situation of interlude, add interlude again, all the other results are abandoned.
Finally from interlude, obtain all results (N=10 bar), as returning results, the returning results as sky of other sections.
When from the 2nd, 201,810 while starting to get follow-up 10 results, i.e. SI=2,201,810, RC=10.SI+RC > K ', requires the result for retrieval returning in interlude and rear, N=RC=10, D=SI-K '=2,201,310.
For every result for retrieval, according to method as shown in Figure 4, first add to the end in section, if there is result to be excluded, join in rear again, if there is result to be excluded, according to the situation of interlude, add interlude again, all the other results are abandoned.
Finally from interlude, obtain all results (less than, only have 3), bar (totally 7) the maximum result that sorts is as returning results from rear, to obtain [whole fruiting quantities in RC-MAX ((K '-SI), 0)-interlude], and head section returns results as sky.
When from the 2nd, 201,820 while starting to get follow-up 10 results, i.e. SI=2,201,820, RC=10.SI+RC > K ', requires the result for retrieval returning in rear, N=RC=10, D=SI-K '=2,201,320.
For every result for retrieval, according to method as shown in Figure 4, first add to the end in section, if there is result to be excluded, join in rear again, if there is result to be excluded, according to the situation of interlude, add interlude again, all the other results are abandoned.
Finally from rear, obtain and return results, other sections return results as sky.
Obviously, those skilled in the art can carry out various changes and modification and not depart from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of the claims in the present invention and equivalent technology thereof, the present invention is also intended to comprise these changes and modification interior.