Movatterモバイル変換


[0]ホーム

URL:


CN109918476A - A Topic Retrieval Method Based on Topic Model - Google Patents

A Topic Retrieval Method Based on Topic Model
Download PDF

Info

Publication number
CN109918476A
CN109918476ACN201910076645.6ACN201910076645ACN109918476ACN 109918476 ACN109918476 ACN 109918476ACN 201910076645 ACN201910076645 ACN 201910076645ACN 109918476 ACN109918476 ACN 109918476A
Authority
CN
China
Prior art keywords
query statement
theme
keyword
word
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910076645.6A
Other languages
Chinese (zh)
Inventor
徐晨
段娟
肖创柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of TechnologyfiledCriticalBeijing University of Technology
Priority to CN201910076645.6ApriorityCriticalpatent/CN109918476A/en
Publication of CN109918476ApublicationCriticalpatent/CN109918476A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Article collection after participle as training data and is inputted LDA topic model parameter by the subject retrieval method based on topic model that the invention discloses a kind of.Training LDA model, to the highest preceding n word of probability in the distribution of each Word probability as the keyword in query statement, the weight than being mapped to keyword in query statement such as probability;" OR " keyword is added among each word forms query statement.Each query statement that business personnel, operation personnel or expert are generated by observation, assigns each query statement reasonable subject name.It further modifies to the query statement of each theme, keyword is deleted in expansion, modifies keyword, adjusts weight.The theme finally defined is saved, equivalence uses the query statement retrieval predefine when user's search for title.The present invention, which solves, fully relies on people to define defect caused by theme query statement, improves the efficiency of manufacture theme, improves the accuracy rate and recall rate of subject retrieval.

Description

A kind of subject retrieval method based on topic model
Technical field
The invention belongs to theme modeling and subject retrieval fields.Especially by a kind of combination of two methods, utilize people'sThe method that the statistical information that priori knowledge and theme model constructs theme jointly and carries out subject retrieval.
Background technique
Topic model (Topic Model) is in data set in the fields such as machine learning and natural language processing(in a series of document) finds a kind of statistical model of abstract theme.Topic model can be according to the theme number n of input, and automatic pointEach document in data set is analysed, the word in statistical documents clusters the word of data set, finally obtains the word of each themeProbability distribution.
The query statement that subject retrieval will pre-define is as theme, and in retrieval, searching motif title is of equal valueIn its corresponding query statement of retrieval.For example, defining the theme that a subject name is " artificial intelligence ": " artificial intelligence ^5ORRobot ^3OR automatic Pilot ^4 " query statement is by keyword and crucial phrase at " artificial intelligence " " robot " " oneself in upper exampleIt is dynamic to drive " it is keyword." OR " and " ^5 " " ^3 " " ^4 " is keyword, and the relationship respectively represented between keyword (is in exampleLogical relation: OR) and each keyword weight (5,3,4).Logic or (OR) indicate that a keyword hit is just told the fortuneIn, weight then represents the importance of keyword, hits the document scores height containing the high keyword of weight, can preferentially appear in and searchThe foremost of rope result page.(note: the keyword grammatical form that different search engines are supported is slightly different, such as a OR b, a < or> b, or (a, b), but all indicate logic or, semantic identical) purpose of subject retrieval is, it can be people (business personnel, operationPersonnel and domain expert) Heuristics (which keyword it is most appropriate describe the theme, their weight is how many, heRelationship what is) preserved by query statement to capture is described, and user is not required to know the inquiry language of the themeHow sentence defines, and need to only be retrieved by subject name to achieve the purpose that browse related subject article, simultaneously asTheme is query statement, there is good reusability, modifiability.
The prior art is all directly to write query statement using the Heuristics of people when defining theme.Still it liftsAbove-mentioned example such as defines certain subject name are as follows: " artificial intelligence ", query statement are as follows: " artificial intelligence ^5OR robot ^3OR is certainlyIt moves and drives ^4 ", here the experience of each part people from source of query statement.People determines which word is best by experienceDescribe " artificial intelligence " this theme, the weight size of these words and the relationship of these words.Although people is capturing heat in real timeVocabulary is put, is had great advantage in relationship between setting word;Also there is good subjective initiative, can allow logical using the people of subject retrievalSee that theme definien it is expected the article collection seen.But by ignoring the objective data collection (institute of search engine index when data setHave the set of article) in statistical information and fully rely on people to define query statement and can bring some defects.Still logical belowCross example above, define certain subject name: " artificial intelligence ", query statement: " artificial intelligence ^5OR robot ^3OR drives automatically^4 " is sailed to illustrate one by one.It is occurred there may be some articles relevant with artificial intelligence but not 1. recall rate is low, in data set" artificial intelligence ", " robot ", " automatic Pilot " these words.Replacing is a large amount of " AI " " machine learning " these vocabulary.ThisIt will lead to many and relevant article of the theme to be missed in the user search theme, so as to cause the low (recall rate of recall rateLow this qualified article of finger is not retrieved).2. keyword setting is inaccurate, there may be some and people in data setWork intelligently relevant article but does not occur " automatic Pilot ".Replace is " unmanned automobile ".Although equivalent in meaning, do not haveThere is fitting data set, is not the most accurate word of describing mode in data set.So still matching is recalled less than the reduction of these articlesRate.3. weight proportion defines improper.Exist in article relevant with artificial intelligence in data set, it is a large amount of " robot " key occurWord, and the number for " automatic Pilot " occur is fewer and fewer.At this time if the weight of " automatic Pilot " is adjusted when defining themeMistake be greater than the weight of " robot ", will lead to user's subject retrieval near preceding article being all unexpected winner article, and a large amount ofThe mainstream article of data set is sunk to below since weight adjustment is improper.4. low efficiency is when data set is very big, each themeQuery statement will be greatly reduced efficiency by people completely to think deeply description.
Summary of the invention
The purpose of the present invention is being to solve the problems, such as to refer in above-mentioned background technique: the subjectivity due to only relying on peopleHeuristics is come caused by information itself that define the query statement of theme, and ignore objective data collection.Therefore mesh of the inventionBe exactly to propose a kind of by utilizing probability statistics model LDA, excavate statistical information from data set, obtain the word distribution of theme in turnQuery statement is generated, while modification query statement can be processed artificially on this basis still to reach the Heuristics and data of peopleThe statistical information of collection constructs the technical solution of theme jointly.
The technical solution adopted by the present invention is a kind of subject retrieval method based on topic model, and this method includes following stepSuddenly,
Step 1: the article collection after participle as training data and being inputted into LDA topic model parameter, LDA topic model ginsengNumber includes setting number of topics K, hyper parameter α and β, and the value of α indicates that weight distribution of the theme before sampling, the value of β indicate eachPrior distribution of the theme to word.
Step 2: training LDA model obtains the K Word probability distribution under each theme.
Step 3: given Integer n value.
Step 4: to the highest preceding n word of probability in the distribution of each Word probability as the keyword in query statement,The weight than being mapped to keyword in query statement such as probability.
Step 5: adding " OR " keyword among each word and form query statement.
Step 6: step 4-5 is repeated, until K Word probability distribution is all converted into query statement.
Step 7: each query statement that business personnel, operation personnel or expert are generated by observation, imparting are each looked intoAsk the reasonable subject name of sentence.
Step 8: business personnel, operation personnel or expert further modify to the query statement of each theme,Keyword is deleted in expansion, modifies keyword, adjusts weight.
Step 9: the theme finally defined being saved, equivalence uses the inquiry predefined when user's search for titleSentence retrieval.
Compared with prior art, the present invention has the following technical effect that.
It solves and fully relies on people to define defect caused by theme query statement, improve the efficiency of manufacture theme, mentionThe accuracy rate and recall rate of high subject retrieval.It is that the statistical data of objective data collection and the subjective experience knowledge of people take long benefitIt is short, coefficient result.
Detailed description of the invention
Fig. 1 is the flow chart of this method.
Specific embodiment
Below in conjunction with drawings and examples, the present invention is described in detail.
This method includes the following steps:
Step 1: the article collection after participle as training data and being inputted into LDA topic model parameter, LDA topic model ginsengNumber includes setting number of topics K, hyper parameter α and β, and the value of α indicates that weight distribution of the theme before sampling, the value of β indicate eachPrior distribution of the theme to word.
Step 2: training LDA model obtains the K Word probability distribution under each theme.
Step 3: given Integer n value.
Step 4: to the highest preceding n word of probability in the distribution of each Word probability as the keyword in query statement,The weight than being mapped to keyword in query statement such as probability.
Step 5: adding " OR " keyword among each word and form query statement.
Step 6: step 4-5 is repeated, until K Word probability distribution is all converted into query statement.
Step 7: each query statement that business personnel, operation personnel or expert are generated by observation, imparting are each looked intoAsk the reasonable subject name of sentence.
Step 8: business personnel, operation personnel or expert can further repair the query statement of each themeChange, keyword is deleted in expansion, modifies keyword, adjusts weight.
Step 9: the theme finally defined being saved, equivalence uses the inquiry predefined when user's search for titleSentence retrieval.

Claims (1)

CN201910076645.6A2019-01-262019-01-26 A Topic Retrieval Method Based on Topic ModelPendingCN109918476A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910076645.6ACN109918476A (en)2019-01-262019-01-26 A Topic Retrieval Method Based on Topic Model

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910076645.6ACN109918476A (en)2019-01-262019-01-26 A Topic Retrieval Method Based on Topic Model

Publications (1)

Publication NumberPublication Date
CN109918476Atrue CN109918476A (en)2019-06-21

Family

ID=66960764

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910076645.6APendingCN109918476A (en)2019-01-262019-01-26 A Topic Retrieval Method Based on Topic Model

Country Status (1)

CountryLink
CN (1)CN109918476A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110377706A (en)*2019-07-252019-10-25腾讯科技(深圳)有限公司Search statement method for digging and equipment based on deep learning
CN113159105A (en)*2021-02-262021-07-23北京科技大学Unsupervised driving behavior pattern recognition method and data acquisition monitoring system
CN114116977A (en)*2021-11-242022-03-01杭州逗酷软件科技有限公司Method for recommending similar defects by fusing multiple features and related device
CN117112810A (en)*2023-07-122023-11-24南京理工大学紫金学院 A full retrieval method based on LDA iterative retrieval of literature data sets

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103942274A (en)*2014-03-272014-07-23东莞中山大学研究院Labeling system and method for biological medical treatment image on basis of LDA
US20150154305A1 (en)*2013-12-022015-06-04Qbase, LLCMethod of automated discovery of topics relatedness
CN105205135A (en)*2015-09-152015-12-30天津大学3D (three-dimensional) model retrieving method based on topic model and retrieving device thereof
CN106777043A (en)*2016-12-092017-05-31宁波大学A kind of academic resources acquisition methods based on LDA

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20150154305A1 (en)*2013-12-022015-06-04Qbase, LLCMethod of automated discovery of topics relatedness
CN103942274A (en)*2014-03-272014-07-23东莞中山大学研究院Labeling system and method for biological medical treatment image on basis of LDA
CN105205135A (en)*2015-09-152015-12-30天津大学3D (three-dimensional) model retrieving method based on topic model and retrieving device thereof
CN106777043A (en)*2016-12-092017-05-31宁波大学A kind of academic resources acquisition methods based on LDA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐芸芝: "基于音乐主题的音乐检索系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》*
徐芸芝等: "基于MT-LDA 的音乐标签主题检索", 《计算机技术与发展》*

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110377706A (en)*2019-07-252019-10-25腾讯科技(深圳)有限公司Search statement method for digging and equipment based on deep learning
CN110377706B (en)*2019-07-252022-10-14腾讯科技(深圳)有限公司Search sentence mining method and device based on deep learning
CN113159105A (en)*2021-02-262021-07-23北京科技大学Unsupervised driving behavior pattern recognition method and data acquisition monitoring system
CN113159105B (en)*2021-02-262023-08-08北京科技大学Driving behavior unsupervised mode identification method and data acquisition monitoring system
CN114116977A (en)*2021-11-242022-03-01杭州逗酷软件科技有限公司Method for recommending similar defects by fusing multiple features and related device
CN117112810A (en)*2023-07-122023-11-24南京理工大学紫金学院 A full retrieval method based on LDA iterative retrieval of literature data sets

Similar Documents

PublicationPublication DateTitle
CN109918476A (en) A Topic Retrieval Method Based on Topic Model
CN110717339B (en) Method, device, electronic device and storage medium for processing semantic representation model
CN107729468B (en) Answer extraction method and system based on deep learning
CN102567304B (en) Method and device for filtering bad network information
CN109960799B (en) An optimized classification method for short texts
CN102262634B (en) An automatic question answering method and system
CN106257455B (en)A kind of Bootstrapping method extracting viewpoint evaluation object based on dependence template
CN108052593A (en)A kind of subject key words extracting method based on descriptor vector sum network structure
WO2013170587A1 (en)Multimedia question and answer system and method
CN109271459B (en)Chat robot based on Lucene and grammar network and implementation method thereof
CN103729456B (en)Microblog multi-modal sentiment analysis method based on microblog group environment
CN105608232B (en)A kind of bug knowledge modeling method based on graphic data base
CN101131705A (en)New word discovering method and system thereof
CN101211339A (en)Intelligent web page classifier based on user behaviors
CN113032557A (en)Microblog hot topic discovery method based on frequent word set and BERT semantics
CN110728144B (en)Extraction type document automatic summarization method based on context semantic perception
CN112100317B (en) A Feature Keyword Extraction Method Based on Topic Semantic Awareness
CN106202294A (en)The related news computational methods merged based on key word and topic model and device
CN105354216A (en)Chinese microblog topic information processing method
CN108399238A (en)A kind of viewpoint searching system and method for fusing text generalities and network representation
CN115033753A (en) Training corpus construction method, text processing method and device
CN109522396B (en)Knowledge processing method and system for national defense science and technology field
CN110929509B (en) A domain event trigger word clustering method based on louvain community discovery algorithm
CN108038204A (en)For the viewpoint searching system and method for social media
CN108763355A (en)A kind of intelligent robot interaction data processing system and method based on user

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20190621


[8]ページ先頭

©2009-2025 Movatter.jp