Summary of the invention
In order in time to catch for the erratic user's that distributes dynamic Search Requirement, follow the mutual of user and search engine and the purpose of the retrieval model that upgrades in time, the present invention has designed a kind of self-adaptation Personal Information System and method.
Self-adaptation Personal Information System of the present invention comprises:
Be used for according to current Query Information, combine historical query information and historical click information constitutive characteristic matrix, also be used for obtaining the data input subsystem of training parameter forecast model according to eigenmatrix;
Be used for training the also parameter training and the predicting subsystem of application parameter forecast model, acquisition Prediction Parameters according to eigenmatrix;
Be used for organizing current inquiry, historical query and historical the click with the parameter that prediction is come out; Also be used for user model and interrogation model in conjunction with the execution retrieval subsystem that forms the personalized enquire model;
Be used for seeking document with the personalized enquire coupling as the preliminary search result, also be used for said preliminary search result being sorted based on correlation at document to be retrieved, and the data output subsystem exported as final result for retrieval of the result after will sort.
Above-mentioned data input subsystem comprises:
Be used for according to current Query Information generate the user behavior characteristic module and
Be used for according to all the behavioural characteristic constitutive characteristic matrix norm pieces of user that obtain.
Above-mentioned parameter training and predicting subsystem comprise:
Be used to receive the data input module of pending data;
Be used to calculate each and inquire about pairing historical query and the historical module of clicking and being organized into desired data layout;
Be used for constitutive characteristic matrix norm piece;
Be used for searching with the mode of searching of traversal the module of current inquiry best parameter, the step-length of said traversal is 0.1;
Be used to use the SVM-Logic Regression Models to set up the module of the mapping of user characteristics and optimized parameter.
Self-adaptation customized information search method of the present invention comprises:
According to current Query Information, in conjunction with the step of historical query information and historical click information constitutive characteristic matrix;
Obtain the step of training parameter forecast model according to eigenmatrix;
Based on eigenmatrix training and application parameter forecast model, obtain the step of the parameter of prediction;
Parameter so that prediction is come out is organized current inquiry, historical query and historical the click, with the step of user model and interrogation model combination formation personalized enquire model;
In document to be retrieved, seek document with the personalized enquire Model Matching as the preliminary search result, and said preliminary search result is sorted, the step that the result after the ordering is exported as final result for retrieval data based on correlation.
Above-mentioned according to current Query Information, comprise in conjunction with the step of historical query information and historical click information constitutive characteristic matrix:
According to current Query Information generate the user behavior characteristic step and
Step according to all the behavioural characteristic constitutive characteristic matrixes of user that obtain.
Above-mentioned according to eigenmatrix training and application parameter forecast model, the step that obtains the parameter of prediction also comprises:
Receive the step of pending data;
Calculate each and inquire about pairing historical query and the historical step of clicking and being organized into desired data layout;
Constitutive characteristic matrix norm piece step;
Search the step of current inquiry best parameter with the mode of searching of traversal, the step-length of said traversal is 0.1;
Use the SVM regression model to set up the mapping steps of user characteristics and optimized parameter.
In the technical scheme of the present invention, said user behavior characteristic comprises:
The history of the web document of checking of expression user in an inquiry session session is clicked category feature, representes the web document that the user checked in very short time that is:;
The historical query category feature to searching system submitted of expression user in an inquiry session session promptly, represented the interior inquiry of submitting to of user's very short time,
The current inquiry category feature of representing current inquiry;
Current inquiry of representing to concern between current inquiry and the historical query and the characteristic between the historical query;
Represent the current inquiry of relation between current inquiry and historical click the and the characteristic between historical the click.
The particular content of above-mentioned five category features is respectively:
The said historical category feature of clicking comprises: the historical total degree of clicking; The historical total length of clicking; The historical length mean value (mean values of whole click length that each inquiry is corresponding) of clicking is clicked average length at every turn, the last one historical total length of clicking; The last number of documents of clicking, the last mean value of clicking document length;
Said historical query category feature comprises: historical query total length, the average length of historical query and historical query total quantity;
The current inquiry category feature of the current inquiry of said expression comprises: current query length;
Characteristic between said current inquiry and the historical query comprises: current query word is compared with a last historical query, a new epexegesis and a last historical recurrence probability of clicking, and a current inquiry and a last inquiry are relatively; The quantity of new epexegesis, current query word is compared with a last historical query, and co-occurrence word accounts for the number percent of current query length; The similarity average of current inquiry and historical query, the similarity maximal value of current inquiry and historical query, the similarity of a current query word and a last historical query; Current inquiry is compared with a last historical query, the recurrence probability of new epexegesis and current inquiry, new epexegesis quantity; The number of times summation that new epexegesis occurs, current query word is compared with a last historical query, deletes the recurrence probability of a speech and a last historical query; Delete the quantity of speech in the last historical query; Delete the number of times summation that speech occurs in the last historical query, current inquiry is compared the recurrence probability of a co-occurrence word and a last historical query with a last historical query; The quantity of co-occurrence word in the last historical query, the number of times summation that co-occurrence word occurs in the last historical query;
Characteristic between said current inquiry and historical the click comprises: current query word and all historical similarity averages of clicking, current query word and whole historical similarity maximal values of clicking, a current query word and a last historical similarity of clicking; A current inquiry and a last historical point hit newly-increased speech number, and new epexegesis is in the last one historical occurrence number summation of clicking, and current query word is compared with a last historical query; Delete a speech and a last historical recurrence probability of clicking, delete the quantity of speech, last one historical point hits deletes the speech number; Hit the number of times summation of deleting that speech occurs at last one historical point; Current query word is compared with last historical a click, a co-occurrence word and a last historical recurrence probability of clicking, the quantity of co-occurrence word; Last one historical point hits the quantity of co-occurrence word, and last one historical point hits co-occurrence word occurrence number summation.
Because it is not necessarily identical that each inquires about pairing user behavior characteristic, the parameter in the corresponding interrogation model is just not necessarily identical.Therefore, the present invention is directed to the objective retrieval behavior rule that the method for the concrete retrieval environment dynamic assignment parameter of each inquiry more is close to the users.
In the actual information retrieving, call the feature weight that obtains in the training, the optimized parameter that should use in the prediction retrieval model.The present invention adopt the current inquiry of the common decision of five kinds of related characteristics of retrieving information, historical query and historically click in three parts, which part is more accurately expressed user search intent and for the contribution of current retrieval tasks; Thereby dynamic assignment the weight of three parts, reach the purpose that obtains optimized parameter.
To sum up; The adaptive personalized retrieval Model parameter of the present invention all is according to each user's interbehavior the parameter in the current interrogation model to be predicted; Adopted machine learning algorithm in the process of prediction; Such retrieval model is the parameter in the transaction module flexibly, thereby possesses higher dirigibility and retrieval rate.
Self-adaptation retrieval model of the present invention is self along with user and increasing of searching system interaction times and constantly; Wherein to historical information according to dynamic assignment weight with the size of current time interval, the decision attenuation amplitude parameter be to produce by parametric prediction model.For the present invention and mainstream technology are compared, adopted the data of (Shen et al., 2005), experiment is provided with also consistent with this article.The importance of considering historical information is with the special circumstances that change with current time interval, and the present invention has also compared the dynamic effect of retrieval model and fixed coefficient retrieval model this moment.See that on the whole along with enriching of historical information, the retrieval effectiveness of personalized retrieval model is become better and better on the whole, the gap between the model is more obvious, sees following table for details:
The 4th the inquiry Q4 that submits to user in the inquiry session is example; Utilize first inquiry Q1 equally; Second inquiry Q2 and the 3rd inquiry Q3 are as historical information; Even when not considering the historical information difference of importance, the method that this paper proposes has improved 38.18% with respect to traditional model (BayesInt) (being AdaptiveEW result) under this kind condition relatively on the MAP measurement index, and the PR20 index has improved 17.74% relatively; If difference of importance between the historical information, the AdaptiveDW model that this paper proposes is with respect to the BatchUp model, and MAP and PR20 increase rate reach 27.54% and 15.94% respectively.Data show that the retrieval effectiveness of the self-adaptation personalized retrieval model (AdaptiveDW) that the present invention proposes has surpassed personalized retrieval model (BatchUp mode) best in the current main-stream method.
To sum up, self-adaptation personalized retrieval model of the present invention adopts parametric prediction model to produce weight separately, has taken into account the dirigibility and the rationality of weight allocation.On identical data set, adaptive dynamically personalized retrieval model is superior to mainstream technology on retrieval effectiveness, has confirmed the validity of the technology of proposition in this invention.
Inventing concrete effect has:
One, the present invention is all effective for the new and old inquiry that the user submits to.
Old inquiry is meant the inquiry that in user search history, occurred; New inquiry is meant the inquiry that the user submits to for the first time.For old inquiry, because there is the historical information can reference, the weight for historical information in the personalized retrieval model will increase, and sets the constant near 1 usually.For new inquiry,,, set constant usually near 0 so the weight for historical information will reduce in the personalized retrieval model because there is not the history can reference.The present invention is different with prior art; Self-adaptation retrieval model of the present invention need not earlier the inquiry classification to be judged whether new inquiry or old inquiry; But directly set the parameter in the retrieval model flexibly according to the user behavior characteristic; Therefore, the present invention is applicable to various types of user behavior characteristics.
Two, the present invention is according to user interactions behavior dynamic assignment weight.
Prior art does not have to set the parameter in the retrieval model with reference to abundant user behavior characteristic.In fact, the user search behavior itself provides important interest information, serves as according to increasing the rationality that is assigned weight greatly with this part information.For instance, if the length of current inquiry is less, the quantity of information that so current inquiry provided is just less, and the weight for historical information will strengthen this moment.On the contrary, if user's historical information seldom, will strengthen the weight of current inquiry so.It is to assign weight dynamically according to realizing that parameter training of the present invention and predicting subsystem provide important interest information with user behavior itself, can increase the rationality of weight allocation greatly.
Three, the present invention has adopted machine learning algorithm to accomplish prediction automatically.
For instance, if the length of current inquiry is less, the weight for historical information will strengthen so.If user's historical information seldom, will strengthen the weight of current inquiry so.But, if current inquiry is shorter, the less situation of while historical information, how to assign weight has just seemed complicated.Adaptive personalized retrieval model solves the problem that model parameter is difficult to confirm by machine learning algorithm, has guaranteed the accuracy of the weight of prediction to a certain extent.
Four, the present invention has considered the sequential relationship between the inquiry.
User's query history is arranged according to the time in order, and new inquiry is more important than old inquiry, so historical query is decayed according to carry out weight with the time gap of current inquiry.
Five, the present invention has answered in the middle of the personalized retrieval modeling, how to organize current inquiry, historical query, and the historical relation of clicking between the three.
Six, the present invention has strengthened the processing of customized information, comes the further problem of the retrieval effectiveness of the current inquiry of raising if explored the historical information and the current Query Information that excavate the active user.
Seven, the present invention does not do any hypothesis to user distribution.Like this with regard to avoided the user true distribute inconsistent and influence the situation of retrieval effectiveness with hypothesis.
Embodiment
Embodiment one, the described self-adaptation Personal Information System of this embodiment comprise:
Be used for according to current Query Information, combine historical query information and historical click information constitutive characteristic matrix, also be used for obtaining the data input subsystem of training parameter forecast model according to eigenmatrix;
Be used for training the also parameter training and the predicting subsystem of application parameter forecast model, acquisition Prediction Parameters according to eigenmatrix;
Be used for organizing current inquiry, historical query and historical the click with the parameter that prediction is come out; Also be used for user model and interrogation model in conjunction with the execution retrieval subsystem that forms the personalized enquire model;
Be used for seeking document with the personalized enquire coupling as the preliminary search result, also be used for said preliminary search result being sorted based on correlation at document to be retrieved, and the data output subsystem exported as final result for retrieval of the result after will sort.
Embodiment two, this embodiment are that the data input subsystem in this embodiment comprises to the further qualification of data input subsystem in the embodiment one described self-adaptation Personal Information System:
Be used for according to current Query Information generate the user behavior characteristic module and
Be used for according to all the behavioural characteristic constitutive characteristic matrix norm pieces of user that obtain.
Embodiment three, this embodiment are that parameter training and predicting subsystem comprise in this embodiment to the parameter training in the embodiment one described self-adaptation Personal Information System and the further qualification of predicting subsystem:
Be used to receive the data input module of pending data;
Be used to calculate each and inquire about pairing historical query and the historical module of clicking and being organized into desired data layout;
Be used for constitutive characteristic matrix norm piece;
Be used for searching with the mode of searching of traversal the module of current inquiry best parameter, the step-length of said traversal is 0.1;
Be used to use the SVM regression model to set up the module of the mapping of user characteristics and optimized parameter.
Embodiment four, this embodiment are that said user behavior characteristic comprises to the further specifying of the user behavior characteristic in the self-adaptation Personal Information System described in the embodiment one:
The history of the web document of checking of expression user in an inquiry session session is clicked category feature, representes the history click that the user checked in very short time that is:;
The historical query category feature to searching system submitted of expression user in an inquiry session session promptly, represented the interior historical query of submitting to of user's very short time,
The current inquiry category feature of representing current inquiry;
Current inquiry of representing to concern between current inquiry and the historical query and the characteristic between the historical query;
Represent the current inquiry of relation between current inquiry and historical click the and the characteristic between historical the click.
Embodiment five, this embodiment are to the further specifying of embodiment four described self-adaptation Personal Information System,
The said historical category feature of clicking comprises: the historical total degree of clicking; The historical total length (is unit with single speech/term) of clicking; The historical length mean value (mean values of whole click length that each inquiry is corresponding) of clicking is clicked average length at every turn, the last one historical total length of clicking; The last number of documents of clicking, the last mean value of clicking document length;
Said historical query category feature comprises: historical query total length, the average length of historical query and historical query total quantity;
The current inquiry category feature of the current inquiry of said expression comprises: current query length;
Characteristic between said current inquiry and the historical query comprises: current query word is compared with a last historical query, a new epexegesis and a last historical recurrence probability of clicking, and a current inquiry and a last inquiry are relatively; The quantity of new epexegesis, current query word is compared with a last historical query, and co-occurrence word accounts for the number percent of current query length; The similarity average of current inquiry and historical query, the similarity maximal value of current inquiry and historical query, the similarity of a current query word and a last historical query; Current inquiry is compared with a last historical query, the recurrence probability of new epexegesis and current inquiry, new epexegesis quantity; The number of times summation that new epexegesis occurs, current query word is compared with a last historical query, deletes the recurrence probability of a speech and a last historical query; Delete the quantity of speech in the last historical query; Delete the number of times summation that speech occurs in the last historical query, current inquiry is compared the recurrence probability of a co-occurrence word and a last historical query with a last historical query; The quantity of co-occurrence word in the last historical query, the number of times summation that co-occurrence word occurs in the last historical query;
Characteristic between said current inquiry and historical the click comprises: current query word and all historical similarity averages of clicking, current query word and whole historical similarity maximal values of clicking, a current query word and a last historical similarity of clicking; A current inquiry and a last historical point hit newly-increased speech number, and new epexegesis is in the last one historical occurrence number summation of clicking, and current query word is compared with a last historical query; Delete a speech and a last historical recurrence probability of clicking, delete the quantity of speech, last one historical point hits deletes the speech number; Hit the number of times summation of deleting that speech occurs at last one historical point; Current query word is compared with last historical a click, a co-occurrence word and a last historical recurrence probability of clicking, the quantity of co-occurrence word; Last one historical point hits the quantity of co-occurrence word, and last one historical point hits co-occurrence word occurrence number summation.
Embodiment six, the described self-adaptation customized information of this embodiment search method comprise:
According to current Query Information, in conjunction with the step of historical query information and historical click information constitutive characteristic matrix;
Obtain the step of training parameter forecast model according to eigenmatrix;
Based on eigenmatrix training and application parameter forecast model, obtain the step of the parameter of prediction;
Parameter so that prediction is come out is organized current inquiry, historical query and historical the click, with the step of user model and interrogation model combination formation personalized enquire model;
In document to be retrieved, seek document with the personalized enquire Model Matching as the preliminary search result, and said preliminary search result is sorted, the step that the result after the ordering is exported as final result for retrieval data based on correlation.
Embodiment seven, this embodiment are in the embodiment six described self-adaptation customized information search methods; According to current Query Information; In conjunction with the further qualification of the step of historical query information and historical click information constitutive characteristic matrix, this step further comprises:
According to current Query Information generate the user behavior characteristic step and
Step according to all the behavioural characteristic constitutive characteristic matrixes of user that obtain.
Embodiment eight, this embodiment are in the embodiment six described self-adaptation customized information search methods; According to eigenmatrix training and application parameter forecast model; The further qualification of the step of the parameter of acquisition prediction, this step further comprises:
Receive the step of pending data;
Calculate each and inquire about pairing historical query and the historical step of clicking and being organized into desired data layout;
Constitutive characteristic matrix norm piece step;
Search the step of current inquiry best parameter with the mode of searching of traversal, the step-length of said traversal is 0.1;
Use the SVM regression model to set up the mapping steps of user characteristics and optimized parameter.
Embodiment nine, this embodiment are that said user behavior characteristic comprises to the further qualification of the user behavior characteristic described in the embodiment six described self-adaptation customized information search methods:
The history of the web document of checking of expression user in an inquiry session session is clicked category feature, representes the history click that the user checked in very short time that is:;
The historical query category feature of the historical query to searching system submitted of expression user in an inquiry session session promptly, is represented the interior historical query of submitting to of user's very short time,
The current inquiry category feature of representing current inquiry;
Current inquiry of representing to concern between current inquiry and the historical query and the characteristic between the historical query;
Represent the current inquiry of relation between current inquiry and historical click the and the characteristic between historical the click.
Embodiment ten, this embodiment are further specifying five types of technical characterictics described in the embodiment nine:
The said historical category feature of clicking comprises: the historical total degree of clicking; The historical total length (is unit with single speech/term) of clicking; The historical length mean value (mean values of whole click length that each inquiry is corresponding) of clicking is clicked average length at every turn, the last one historical total length of clicking; The last number of documents of clicking, the last mean value of clicking document length;
Said historical query category feature comprises: historical query total length, the average length of historical query and historical query total quantity;
The current inquiry category feature of the current inquiry of said expression comprises: current query length;
Characteristic between said current inquiry and the historical query comprises: current query word is compared with a last historical query, a new epexegesis and a last historical recurrence probability of clicking, and a current inquiry and a last inquiry are relatively; The quantity of new epexegesis, current query word is compared with a last historical query, and co-occurrence word accounts for the number percent of current query length; The similarity average of current inquiry and historical query, the similarity maximal value of current inquiry and historical query, the similarity of a current query word and a last historical query; Current inquiry is compared with a last historical query, the recurrence probability of new epexegesis and current inquiry, new epexegesis quantity; The number of times summation that new epexegesis occurs, current query word is compared with a last historical query, deletes the recurrence probability of a speech and a last historical query; Delete the quantity of speech in the last historical query; Delete the number of times summation that speech occurs in the last historical query, current inquiry is compared the recurrence probability of a co-occurrence word and a last historical query with a last historical query; The quantity of co-occurrence word in the last historical query, the number of times summation that co-occurrence word occurs in the last historical query;
Characteristic between said current inquiry and historical the click comprises: current query word and all historical similarity averages of clicking, current query word and whole historical similarity maximal values of clicking, a current query word and a last historical similarity of clicking; A current inquiry and a last historical point hit newly-increased speech number, and new epexegesis is in the last one historical occurrence number summation of clicking, and current query word is compared with a last historical query; Delete a speech and a last historical recurrence probability of clicking, delete the quantity of speech, last one historical point hits deletes the speech number; Hit the number of times summation of deleting that speech occurs at last one historical point; Current query word is compared with last historical a click, a co-occurrence word and a last historical recurrence probability of clicking, the quantity of co-occurrence word; Last one historical point hits the quantity of co-occurrence word, and last one historical point hits co-occurrence word occurrence number summation.
Input data of the present invention are continuous-query behaviors of carrying out in order to satisfy a search need according to each user of sequence of event; Comprise that each user submits to the inquiry of searching system; The document that searching system is returned (comprising title and summary), and the document code checked of user.
With file query_history.topic2 is example, and data layout is:
The result for retrieval of inquiry string " acquisition u.s.foreign company " is recorded in<jian Suojieguo>With</Jian Suojieguo>Between.The precedence that document code occurs has been reacted the sequencing information of document in the searching system return results.Click set record the user click the numbering of the document of checking.
Step according to current Query Information, combination historical query information and historical click information constitutive characteristic matrix is:
After the input data, next carry out feature extraction.Need that current inquiry and the historical query in the analysis and consult session, current inquiry and historical clicked, historical query, the relation between historical the click is finally extracted five type, 39 the search behavior characteristics of each user when submitting each inquiry to, for:
| The history of the web document of checking of expression user in an inquiry session session is clicked category feature, comprising: |
| The historical total degree of clicking |
| The historical total length of clicking |
| The historical length mean value (mean values of whole click length that each inquiry is corresponding) of clicking |
| Each average length of clicking |
| The last one historical total length of clicking |
| The last number of documents of clicking |
| The last mean value of clicking document length |
| The historical query category feature to searching system submitted of expression user in an inquiry session session comprises: |
| The historical query total length |
| Historical query length mean value |
| Historical query quantity |
| Represent the current inquiry category feature of current inquiry, comprising: |
| Current query length |
| Represent the current inquiry of relation between current inquiry and historical click the and the characteristic between historical the click, comprise |
| Current inquiry term and whole historical similarity averages of clicking |
| Current inquiry term and whole historical similarity maximal values of clicking |
| A current inquiry term and a last historical similarity of clicking |
| A current inquiry and a last historical point hit newly-increased speech number |
| New epexegesis is in the last one historical occurrence number summation of clicking |
| Current inquiry term compares with a last historical query, deletes a speech and a last historical recurrence probability of clicking |
| Delete the quantity of speech |
| Last one historical point hits deletes the speech number |
| Hit the number of times summation of deleting that speech occurs at last one historical point |
| Current inquiry term compares with last historical a click, a co-occurrence word and a last historical recurrence probability of clicking |
| The quantity of co-occurrence word |
| Last one historical point hits the quantity of co-occurrence word |
| Last one historical point hits co-occurrence word occurrence number summation |
| Current inquiry of representing to concern between current inquiry and the historical query and the characteristic between the historical query comprise: |
| Current inquiry term compares with a last historical query, a new epexegesis and a last historical recurrence probability of clicking |
| The quantity of new epexegesis is compared in a current inquiry and a last inquiry |
| Current inquiry term compares with a last historical query, and co-occurrence word accounts for the number percent of current query length |
| The similarity average of current inquiry and historical query |
| The similarity maximal value of current inquiry and historical query |
| The similarity of a current inquiry term and a last historical query |
| Current inquiry term compares with a last historical query, the recurrence probability of new epexegesis and current inquiry |
| New epexegesis quantity |
| The number of times summation that new epexegesis occurs |
| Current inquiry term compares with a last historical query, deletes the recurrence probability of a speech and a last historical query |
| Delete the quantity of speech in the last historical query |
| Delete the number of times summation that speech occurs in the last historical query |
| Current inquiry term compares the recurrence probability of a co-occurrence word and a last historical query with a last historical query |
| The quantity of co-occurrence word in the last historical query |
| The number of times summation that co-occurrence word occurs in the last historical query |
On the other hand, calculate the optimum weighted value of each inquiry.These 39 characteristics and optimal weights value are formed the training data of parametric prediction model jointly.The part that in training data, starts is represented filename and the title of each characteristic and the symbolic animal of the birth year description of character pair of training data.Part below the DATA is exactly eigenmatrix (this form can directly be imported for existing SVM returns kit).
With q2Be example, then corresponding training data is:
RELATION?q2.arff
ATTRIBUTE?cqlenth?numeric
......
ATTRIBUTE?class?numeric
DATA
3,2,20,20,10,20,0,2,0.0869565217391304,0.0869565217391304,0.0869565217391304,0,0,0,2,2,1,0.4,0.4,0.4,0.333333333333333,0,0.5,0.4
4,3,2,2,0.666666666666667,2,1,2,0,0,0,0,0,0,2,2,1,0.333333333333333,0.333333333333333,0.333333333333333,0.25,0,0.5,0.4
......
First line description of above-mentioned training data file by name " q2.arff ", key word is " RELATION ", second line description first characteristic " length of current inquiry ", key word is " ATTRIBUTE ".By that analogy, have 39 feature descriptions.
An ensuing line description optimized parameter type is the decimal between the 0-1, and key word is " ATTRIBUTE ".Be exactly the characteristic of correspondence matrix after DATA, eigenmatrix refers to the content of removing in the training data file with beginning, and 39 user behavior proper vectors and corresponding optimized parameter that eigenmatrix is mentioned by preamble are formed.Each row has 40 data item, and preceding 39 is eigenwert, and the 40th data item is optimized parameter.Each training data can be used delegation (40) vector representation, the delegation of constitutive characteristic matrix.The quantity of training data has determined the line number of eigenmatrix.Separate with comma between the data item.
Adopt machine learning method SVM to return (SVM-Regression) according to above-mentioned training data and come the training parameter forecast model, this model representation be the funtcional relationship of optimal weights and each characteristic;
MAP maximal value with each inquiry is the ferret out value.The step-length of traversal is 0.1.Adopt Support Vector Regression (SVR) (Chang and Lin, 2001) to train, confirm the optimal weights of each inquiry and the funtcional relationship of 39 characteristics, and then obtain the training parameter forecast model.
When the application parameter forecast model is predicted, import 39 eigenwerts of each test query, this parametric prediction model just can produce corresponding weighted value.Make up current inquiry by this way, historical query and historical three parts of clicking.The test data form is as follows.Test data and training data form basically identical, difference are that last row of proper vector are "? " in the test data, represent value to be predicted.The test data form is:
The main task of carrying out retrieval subsystem be with TREC AP88-90 document as the band search file, use Lemur to set up index, accomplish retrieval tasks at the conventional language model framework then.
Predicted the outcome based on what a last step application parameter forecast model produced, organize current inquiry and historical information, constitute the personalized enquire model.
If current inquiry is k inquiry Q in the inquiry sessionk, the user interest of short-term history inquiry representative is embodied in historical query Q soi(the average of the term probability of occurrence among 1≤i≤k-1).Similarly, user's short-term interest is also embodied in the historical C of clicki(the average that the term among 1≤i≤k-1) occurs.Query history is by historical query HQClick H with historyCForm.Query word is represented with ω.
A) calculate current interrogation model
The implication of each parameter in the formula, please explain: ω represents speech, QiThe representative inquiry, P represents probability, and i representes the i time.Current interrogation model is by the number of times of current query word appearance and the length decision of current inquiry.P (ω | Qi) the current inquiry Q of representative1In the probability that occurs of each speech ω.C (ω, Qi) represent at inquiry QiThe number of times that middle speech ω occurs.| Qi| the length of expression inquiry Qi, just form by what speech.The implication of current interrogation model representative is that the computing method of the probability of the some speech in the inquiry string are, the number of times that this speech occurs in inquiry then divided by current inquiry in the sum of speech.
B) computation history interrogation model
The implication of each parameter in the formula, please explain: ω represents speech, QiThe representative inquiry, P represents probability, HqRepresent whole historical querys, i represents the i time.Historical query model p (ω | HQ) by single historical query model P (ω | Qi) adding up and making even all obtains.For current inquiry Qk, its historical query is by Q1, Q2... QK-1Form.With each historical query model P (ω | Qi) add up, then divided by the quantity k-1 of historical query.Wherein single historical query model P (ω | Qi) calculate according to formula (1).The implication of historical query model representative is at whole historical HQIn the method for calculating probability of single speech ω be, calculate number of times that this speech occurs sum at first respectively divided by the place speech that historical query comprised in each historical query, next, next k-1 probability done and, at last divided by k-1.
C) computation history is clicked model
The implication of each parameter in the formula, please explain: ω represents speech, CiThe web document that representative of consumer was checked, P represents probability, HcWhole history web pages document that representative of consumer has been seen, i representes the i time.With the historical query model class seemingly, historical click model P (ω | HC) by single historical click model P (ω | Ci) adding up and making even all obtains.For current inquiry Qk, its history is clicked by C1, C2... CK-1Form.With each historical click model P (ω | Ci) add up the quantity k-1 that clicks divided by history then.The wherein single historical model of clicking calculates according to formula (1).
D) extract current inquiry category feature
The length that mainly comprises current inquiry.
E) extract the historical query category feature
Mainly comprise historical query quantity, total length and average length.
F) characteristic between current inquiry of extraction and the historical query
Mainly comprise the similarity between a current inquiry and the last inquiry, the similarity of current inquiry and whole historical querys, new epexegesis and the quantity of deleting speech, and the shared proportion in current inquiry or historical query.
G) characteristic between current inquiry of extraction and historical the click
Mainly comprise the similarity between a current inquiry and whole and last historical the click, new epexegesis and the quantity of deleting speech, and concentrate the proportion of fighting at current inquiry and historical point.
H) the operation parameter forecast model obtains parameter
User characteristics is as the input of parameter prediction system, and output is fit to the parameter of the best of current inquiry
I) organize current interrogation model, historical query model and the historical model of clicking according to the parameter that dopes
Parameter beta whereink∈ (0,1) has determined the weight allocation between historical query and historical click the, parameter betakThe historical importance of clicking of big more explanation is big more; Work as βk=1 o'clock, the expression user interest model was clicked by history fully and is embodied.In like manner, αkBig more, the importance of current inquiry is big more.
Adaptive personalized retrieval model has been attempted two kinds of methods respectively, a kind of retrieval model (AdaptiveEW) under the equal situation of importance between the history, formalization representation such as formula (4) of being based on.Another kind is descending according to historical and current query time distance, the importance retrieval model (AdaptiveDW) under the rule that changes from small to big, and formalization representation is shown in formula (5).Wherein, QkRepresent current inquiry, HcRepresent that the history before the current inquiry is clicked in the current inquiry session, HqRepresent the historical query in the current inquiry session.Parameter alphak, βk, mk, nkRepresent weight respectively, their span is the arbitrary small number between 0 to 1.
The interrogation model p of self-adaptation personalized retrieval model (AdaptiveEW) (ω | θk) comprise two parts: current interrogation model p (ω | Qk) and historical models, current interrogation model weight is αkThe historical models weight is 1-αkCurrent interrogation model is represented the probability that current query word ω occurs, and calculates according to formula (1).Wherein historical information by history click model p (ω | Hc) and historical query model p (ω | HQ) form.The historical query model calculates according to formula (2).The historical model of clicking calculates according to formula (3).Weight equates between each historical query.Weight equates between each historical click.Historical click model weight is 1-βk, historical click model weight is βkShown in formula (4).
p(ω|θk)=ακp(ω|QK)+(1-αk)[βkp(ω|HC)+(1-βk)p(ω|HQ)]
(4)
The implication of each parameter in the formula, please explain:
More than be self-adaptation retrieval model (AdaptiveEW), wherein weight equates between the historical information.Another kind of self-adaptation retrieval model thinks that the importance of historical information is relevant with the time gap of current inquiry.The interrogation model p of this self-adaptation retrieval model (AdaptiveDW) (ω | ψk) comprise two parts: interrogation model p (ω | θk) and historical click model p (ω | HC) form.Historical click model p (ω | HC) weight be mk, interrogation model p (ω | θk) weight be 1-mkInterrogation model p (ω | θk) by current interrogation model p (ω, θk) and last one constantly interrogation model p (ω | θK-1) form.Current interrogation model p (ω, θk) weight is nk, the interrogation model p in a last moment (ω | θK-1) weight is 1-nkThe historical query model calculates according to formula (2).The historical model of clicking calculates according to formula (3).Interrogation model carries out the weight decay to old interrogation model as time passes in the self-adaptation retrieval model (AdaptiveDW), and new historical query is bigger than the weight of old historical query,, formalization representation is shown in formula (5).
p(ω|θk)=nkp(ω,QK)+(1-nk)p(ω|θk-1)
The implication of each parameter in the formula, please explain:
J) start retrieving
In document to be retrieved, seek the result for retrieval that mates with personalized enquire, and carry out descending sort based on the correlation probabilities value.1000 pieces of documents are returned in each inquiry.
Personalized enquire is submitted to after the searching system, and searching system is returned result for retrieval.Personalized retrieval result's data layout:
Number of queries is shown in first tabulation, and secondary series is represented document code, the 3rd row representative ordering, and the 4th row are represented the mark of language model.So far, the implementation process of whole self-adaptation personalized retrieval model finishes.