A kind of subject retrieval method based on topic modelTechnical field
The invention belongs to theme modeling and subject retrieval fields.Especially by a kind of combination of two methods, utilize people'sThe method that the statistical information that priori knowledge and theme model constructs theme jointly and carries out subject retrieval.
Background technique
Topic model (Topic Model) is in data set in the fields such as machine learning and natural language processing(in a series of document) finds a kind of statistical model of abstract theme.Topic model can be according to the theme number n of input, and automatic pointEach document in data set is analysed, the word in statistical documents clusters the word of data set, finally obtains the word of each themeProbability distribution.
The query statement that subject retrieval will pre-define is as theme, and in retrieval, searching motif title is of equal valueIn its corresponding query statement of retrieval.For example, defining the theme that a subject name is " artificial intelligence ": " artificial intelligence ^5ORRobot ^3OR automatic Pilot ^4 " query statement is by keyword and crucial phrase at " artificial intelligence " " robot " " oneself in upper exampleIt is dynamic to drive " it is keyword." OR " and " ^5 " " ^3 " " ^4 " is keyword, and the relationship respectively represented between keyword (is in exampleLogical relation: OR) and each keyword weight (5,3,4).Logic or (OR) indicate that a keyword hit is just told the fortuneIn, weight then represents the importance of keyword, hits the document scores height containing the high keyword of weight, can preferentially appear in and searchThe foremost of rope result page.(note: the keyword grammatical form that different search engines are supported is slightly different, such as a OR b, a < or> b, or (a, b), but all indicate logic or, semantic identical) purpose of subject retrieval is, it can be people (business personnel, operationPersonnel and domain expert) Heuristics (which keyword it is most appropriate describe the theme, their weight is how many, heRelationship what is) preserved by query statement to capture is described, and user is not required to know the inquiry language of the themeHow sentence defines, and need to only be retrieved by subject name to achieve the purpose that browse related subject article, simultaneously asTheme is query statement, there is good reusability, modifiability.
The prior art is all directly to write query statement using the Heuristics of people when defining theme.Still it liftsAbove-mentioned example such as defines certain subject name are as follows: " artificial intelligence ", query statement are as follows: " artificial intelligence ^5OR robot ^3OR is certainlyIt moves and drives ^4 ", here the experience of each part people from source of query statement.People determines which word is best by experienceDescribe " artificial intelligence " this theme, the weight size of these words and the relationship of these words.Although people is capturing heat in real timeVocabulary is put, is had great advantage in relationship between setting word;Also there is good subjective initiative, can allow logical using the people of subject retrievalSee that theme definien it is expected the article collection seen.But by ignoring the objective data collection (institute of search engine index when data setHave the set of article) in statistical information and fully rely on people to define query statement and can bring some defects.Still logical belowCross example above, define certain subject name: " artificial intelligence ", query statement: " artificial intelligence ^5OR robot ^3OR drives automatically^4 " is sailed to illustrate one by one.It is occurred there may be some articles relevant with artificial intelligence but not 1. recall rate is low, in data set" artificial intelligence ", " robot ", " automatic Pilot " these words.Replacing is a large amount of " AI " " machine learning " these vocabulary.ThisIt will lead to many and relevant article of the theme to be missed in the user search theme, so as to cause the low (recall rate of recall rateLow this qualified article of finger is not retrieved).2. keyword setting is inaccurate, there may be some and people in data setWork intelligently relevant article but does not occur " automatic Pilot ".Replace is " unmanned automobile ".Although equivalent in meaning, do not haveThere is fitting data set, is not the most accurate word of describing mode in data set.So still matching is recalled less than the reduction of these articlesRate.3. weight proportion defines improper.Exist in article relevant with artificial intelligence in data set, it is a large amount of " robot " key occurWord, and the number for " automatic Pilot " occur is fewer and fewer.At this time if the weight of " automatic Pilot " is adjusted when defining themeMistake be greater than the weight of " robot ", will lead to user's subject retrieval near preceding article being all unexpected winner article, and a large amount ofThe mainstream article of data set is sunk to below since weight adjustment is improper.4. low efficiency is when data set is very big, each themeQuery statement will be greatly reduced efficiency by people completely to think deeply description.
Summary of the invention
The purpose of the present invention is being to solve the problems, such as to refer in above-mentioned background technique: the subjectivity due to only relying on peopleHeuristics is come caused by information itself that define the query statement of theme, and ignore objective data collection.Therefore mesh of the inventionBe exactly to propose a kind of by utilizing probability statistics model LDA, excavate statistical information from data set, obtain the word distribution of theme in turnQuery statement is generated, while modification query statement can be processed artificially on this basis still to reach the Heuristics and data of peopleThe statistical information of collection constructs the technical solution of theme jointly.
The technical solution adopted by the present invention is a kind of subject retrieval method based on topic model, and this method includes following stepSuddenly,
Step 1: the article collection after participle as training data and being inputted into LDA topic model parameter, LDA topic model ginsengNumber includes setting number of topics K, hyper parameter α and β, and the value of α indicates that weight distribution of the theme before sampling, the value of β indicate eachPrior distribution of the theme to word.
Step 2: training LDA model obtains the K Word probability distribution under each theme.
Step 3: given Integer n value.
Step 4: to the highest preceding n word of probability in the distribution of each Word probability as the keyword in query statement,The weight than being mapped to keyword in query statement such as probability.
Step 5: adding " OR " keyword among each word and form query statement.
Step 6: step 4-5 is repeated, until K Word probability distribution is all converted into query statement.
Step 7: each query statement that business personnel, operation personnel or expert are generated by observation, imparting are each looked intoAsk the reasonable subject name of sentence.
Step 8: business personnel, operation personnel or expert further modify to the query statement of each theme,Keyword is deleted in expansion, modifies keyword, adjusts weight.
Step 9: the theme finally defined being saved, equivalence uses the inquiry predefined when user's search for titleSentence retrieval.
Compared with prior art, the present invention has the following technical effect that.
It solves and fully relies on people to define defect caused by theme query statement, improve the efficiency of manufacture theme, mentionThe accuracy rate and recall rate of high subject retrieval.It is that the statistical data of objective data collection and the subjective experience knowledge of people take long benefitIt is short, coefficient result.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Specific embodiment
Below in conjunction with drawings and examples, the present invention is described in detail.
This method includes the following steps:
Step 1: the article collection after participle as training data and being inputted into LDA topic model parameter, LDA topic model ginsengNumber includes setting number of topics K, hyper parameter α and β, and the value of α indicates that weight distribution of the theme before sampling, the value of β indicate eachPrior distribution of the theme to word.
Step 2: training LDA model obtains the K Word probability distribution under each theme.
Step 3: given Integer n value.
Step 4: to the highest preceding n word of probability in the distribution of each Word probability as the keyword in query statement,The weight than being mapped to keyword in query statement such as probability.
Step 5: adding " OR " keyword among each word and form query statement.
Step 6: step 4-5 is repeated, until K Word probability distribution is all converted into query statement.
Step 7: each query statement that business personnel, operation personnel or expert are generated by observation, imparting are each looked intoAsk the reasonable subject name of sentence.
Step 8: business personnel, operation personnel or expert can further repair the query statement of each themeChange, keyword is deleted in expansion, modifies keyword, adjusts weight.
Step 9: the theme finally defined being saved, equivalence uses the inquiry predefined when user's search for titleSentence retrieval.