Background technology
LDA (Latent Dirichlet Allocation) topic models (Topic Model) exist from David Blei etc.(D.M.Blei, A.Y.Ng, and M.I.Jordan.Latent Dirichlet have been referred to since proposing within 2003Allocation.Journal of Machine Learning Research, 3,993-1022,2003), text mining,The field that information retrieval, calculating advertisement, commending system, question answering system, knowledge mapping etc. are related to text semantic analysis has obtained extensivelyGeneral application.LDA models are that a kind of generative probabilistic model (refers to Zhao Xin, user's topic interest modeling is ground with excavation in social mediaStudy carefully, the outstanding doctoral thesis of Peking University, 2014), document no longer as traditional vector space model, is only regarded as dictionary by itExpression spatially, and it is the introduction of the concept in theme space, to realize the expression of text spatially in theme.By rightThe introducing of Subject Concept, the model bring two benefits:(1) low-dimensional for realizing text indicates that this is very beneficial for subsequentlyThe calculating of text classification or the like, avoiding the occurrence of " dimension disaster " problem, (text is indicated by the vector of theme spatially, vectorialThe dimension in dimension, that is, theme space, is determined by theme number;Relatively conventional text vector spatial model, this dimension are usually wantedIt is much lower.In vector space model, the dimension size in the dictionary space that text vector dimension is obtained by text collection determines, leads toOften it is much larger than theme number);(2) semantic information that text collection implies behind, i.e. theme have been excavated, has been text semantic modelingOne strong tool.
Since LDA has solid Fundamentals of Mathematics and a good autgmentability, exploration to the model itself and and otherThe researchs such as the combination of method are always one of hot research problems in fields such as natural language processing subject, machine learning.WhereinIt is exactly a specific Research Challenges about the determination method of LDA model parameter theme Optimal units.
By literature search, determines method about the optimal theme number of LDA models, be mainly the following:
(1) experience is set, and in text semantic analysis task, researcher is often through the number for repeatedly debugging themeCarry out the quality of observation experiment effect, for example the quality of the theme vocabulary of observation high probability, semantic whether consistent etc. (refers to ZhaoIt is prosperous, user's topic interest modeling and Research on Mining in social media, the outstanding doctoral thesis of Peking University, 2014).Empirical settingPeople is needed to participate in, artificial experience is judged, as a result not necessarily very accurate;Different people judgment criteria is also variant;If collection of document is huge,Including it is theme number tens, up to a hundred, micro-judgment manually can not be almost carried out one by one, while this is not a kind of determination of automationMethod.
(2) based on the determination method of Perplexity.For a collection of document, after being trained by LDA text modelings, baseIn result of calculation, Perplexity values are calculated.One lower Perplexity value corresponds to a good LDA model, but mostThe automation of low Perplexity values determines method, and there is presently no people's propositions.In practical application, typically everybody according toPerplexity values -- the change curve of theme number artificially determines minimum point, to obtain optimal theme number.
(3) the deformation extension based on nonparametric Bayes method.More representational work is HierarchicalDirichlet Processes models, it solves the problems, such as to automatically determine number of topics purpose in topic model to a certain extent,But due to model complexity, actual use gets up to run complexity higher, and cost is too big (to refer to Teh, Y.W.;Jordan,M.I.;Beal,M.J.;Blei,D.M.(2006).Hierarchical Dirichlet Processes.Journal ofthe American Statistical Association.101:pp.1566–1581)。
(4) Cao Juan etc. proposes a kind of optimal L DA model selection methods based on minimum average B configuration similarity principle between theme,Which demonstrate the conclusions of " when average similarity minimum topic model is just optimal between theme ", simultaneously, it is proposed that one kind is based on closeThe optimal theme number selection algorithm of degree, this algorithm are that analogy density clustering algorithm DBSCAN thoughts propose, be it is a kind of relativelyGood automation determines method.But since algorithm idea hypothesis, the condition of convergence, method of determination of material calculation etc. are in practical applicationIn have a deviation, result of calculation is not necessarily accurate, it is reliable (refer to Cao Juan, Zhang Yongdong, Li Jintao, Tang Sheng, it is a kind of based on densityAdaptive optimal LDA model selection methods).
Invention content
For the technical problems in the prior art, the purpose of the present invention is to provide one kind automatically determining optimal themeA counting method, compares Name-based Routing, and method proposed by the present invention is more intuitive, reliable in practical applications.
The technical scheme is that:
A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme, step include:
1) in initial setting Qu Jian [K0,KMAX]Interior change K values, for every K values of selection:By destination document setTheme number initial value is set as current K values, is trained to the destination document set using LDA models, obtains K theme-wordProbabilityDistribution Vector;Calculate the average similarity AC between the ProbabilityDistribution Vector of this K theme-wordKAnd it is stored toGlobal average similarity array AC_Array;Global average similarity array AC_Array is one-dimension array;
2) average similarity minimum value is chosen from overall situation average similarity array AC_Array as interim minimum flatEqual similarity;The corresponding theme number of the interim minimum average B configuration similarity is TEMP_Kbest;
3) in overall situation average similarity array AC_Array, it is with current interim minimum average B configuration similarity positionThe array element total number of the right of center is denoted as NUM_R_TEMP_K by centerbest, the array element of the left of center is totalNumber is denoted as NUM_L_TEMP_Kbest;
If 4) NUM_R_TEMP_KbestMore than N × NUM_L_TEMP_Kbest, then theme number K is exportedbest=TEMP_Kbest;If NUM_L_TEMP_KbestMore than NUM_R_TEMP_Kbest, then K is enabledMAX=KMAX+ m, K0=Km, r=r0, repeat step1)~4);If NUM_L_TEMP_KbestLess than NUM_R_TEMP_KbestAnd NUM_R_TEMP_KbestLess than N × NUM_TEMP_L_Kbest, then KMAX=N × NUM_L_TEMP_Kbest, K0=Km, r=r1, repeat step 1)~4), r1More than r0;
5) by theme number KbestAs the best theme number of the destination document set, theme number KbestIt is correspondingLDA models are the optimal L DA models of the destination document set.
Further, according to K=K0+ r*n changes K values, and r is the theme number incremental spacing number, and n is positive integer.
Further, average similarity AC is calculatedKMethod be:Calculate first the probability distribution of this K theme-word toThe similarity of amount between any two, then calculates being averaged for each similarity and is worth to average similarity ACK。
Further, the N values are 4.
Further, r0The initial value of the number that is the theme incremental spacing;r0=3, KMAXInitial value be set as 30, r0=10.
Further, the m=30.
The present invention will automate the proposition to search problem, resolving ideas, specific algorithm according to the optimal theme number of LDA modelsIt is illustrated etc. several parts.
First part, LDA models optimize the proposition that theme number automatically determines problem.
In use, theme number needs specified in advance LDA models.Specified different theme number, LDA are trainedThe data arrived are also different.For Cao Juan etc. it has been proved that when average similarity minimum between theme, corresponding LDA models are optimal,At this moment optimal theme number is just obtained.Based on this conclusion, how algorithm for design, minimum similarity degree is between being automatically found themeA research topic of great practical value.
It is exemplified below, the common method for manually determining optimal theme number in each document.For specific documentSet, calculates and draws the tendency chart that average similarity between theme change with theme number, and artificial foundation tendency chart can be sentencedMake optimal theme number.Detailed process is as follows:First, theme number computer capacity is set, the meter of setting theme number is includedCalculate section and incremental spacing;Then, under the theme number of each setting, corresponding LDA model trainings are carried out, and calculateAverage similarity under each theme number between theme.After the completion of all calculating, so that it may average similar between theme to drawThe tendency chart that degree (ordinate) changes with theme number (abscissa), (this is in several different scales as shown in Figure 1 to Figure 3The experiment done on data set).According to trend chart, the position of global average similarity minimum value manually may determine that, withDetermine optimal theme number.
An algorithm is designed, the above process is automated, is the problem to be solved in the present invention.
Second part, the optimal L DA Automatic Model Selection algorithm resolving ideas based on minimum average B configuration similarity between theme.
Fig. 1 to Fig. 3 is certain embodiments.On different data sets, trend that average similarity changes with theme numberFigure estimation is various.In face of a specific collection of document, if expecting the corresponding theme of minimum average B configuration similarityHow number, be obtained by calculation reliable average similarity that one can refer to and become with theme number variation diagram and solve the problems, such asIt is crucial.And to solve the problems, such as this, it is crucial that determining that (theme number investigates section, starting point to abscissa range in variation diagramAnd terminal) and interval (theme number incremental spacing).In the present invention, specific design is as follows:
In the present invention, the starting point (minimum value) in theme number investigation section is set as 2.Starting point setting is excessive, it is possible toMiss minimum average B configuration similarity.
Consider the terminal of theme number computation interval.LDA models, the assignable maximum value of theme number, theoretically, noMore than the dictionary maximum value obtained from this document set, (extreme case, a word indicate a theme, and a document has moreLack different words, both how many different theme).But it is shown from the test data of current all kinds of documents, optimal theme numberFar from reaching this value (because of typically tens, a word up to a hundred, indicating a theme).On the contrary, most relative to above-mentioned theoryBig value, optimal theme number, be usually all close to starting point (minimum value) end on one side, so during usually everybody tests, themeNumber, which investigates mode, to be calculated since starting point end (minimum value), is then incremented by according to certain intervals, is calculated to some estimationMaximum value terminate, this mode, which can guarantee, finds and full out finds best theme number.Simultaneously as finding best themeDuring number, LDA models are trained according to theme number, so to consider to train cost problem.Theme number specifies numerical valueBigger, corresponding LDA model trainings calculation amount is bigger, and calculating duration may be at several times, the growth of decades of times.So comprehensiveIt closes and considers that factors above, ideal case are, size is arranged in theme number interval terminal, should ensure that section includes minimum average B configuration phaseIt is suitably more again like the corresponding point of degree, to can determine that out follow-up variation tendency, ensure the point found be global minima orPerson's approximation global minima.In the present invention, theme number computation interval terminal (maximum value), using when judging increased mode intoRow, terminal maximum value determine that centered on the theme number where minimum average B configuration similarity, maximum value numerical value is by following methodCenter " the right " theme number 5 times (the corresponding average similarity of center " the right " theme number it is more corresponding than center place mostSmall average similarity is big), it is presently believed that in this way, can largely portray average similarity between themeWith the trend that theme number changes, it can guarantee that the optimal theme number of acquisition is global optimum or approximate global optimum, it can be to closeThe cost of reason farthest meets the needs of practical business application.
The setting of theme number incremental spacing number.In order not to miss global minima because the setting of space-number is excessive and be averaged phaseLike degree, while it is further contemplated that initial incremental spacing is set as 3 by calculation amount, the present invention.
Part III, specific algorithm of the present invention illustrates, including algorithm steps, parameter etc..
The optimal theme number algorithm of LDA models that the present invention designs is as follows:
Algorithm inputs:Destination document set, theme number investigate section threshold value K0, end point values KMAX, theme number be incremented bySpace-number r.
Algorithm exports:Optimal theme number Kbest, the output data of optimal L DA models acquisition.
Step:
1, K is enabled0=2, meanwhile, theme number incremental spacing r0Initial value is set as 3, KMAXInitial value be set as 30;
2, designated key number K, K=K in the following manner0+ r*n, n=(0,1,2 ... ...), and K [ in section;K0,KMAX], space-number r=r0;Then, cycle carries out following calculating step (2.1-2.3), until traversing all K (notes:It is incremented byThe theme number maximum value of acquisition is denoted as Km):
2.1 are directed to destination document set, set the theme number initial value of the destination document set as K, utilize LDA modelsThe destination document set is trained, every time after the completion of training, obtains the ProbabilityDistribution Vector of K theme-word, each vectorUse ziIt indicates, and i=(1,2 ... .., K), zi={ βI, 1, βI, 2,..., βI, V,, wherein βI, 1Indicate every in vectorial collection of documentProbability value of a word at this theme i, V are that (LDA model trainings, please refer to for the size of the dictionary obtained by collection of documentHeinrich Gregor, Parameter estimation for text analysis, 2009.).
After the completion of 2.2 above-mentioned training, the average similarity between the theme obtained according to following formula, computation model, noteFor ACK, subscript K expression theme numbers are K, are appended to successively in a global average similarity array AC_Array, AC_Array arrays are one-dimension array.
First, similarity between two themes of calculating, calculation formula are as follows:
Then, the average similarity between theme is calculated, calculation formula is as follows:
3, interim minimum average B configuration similarity is calculated.To the average similarity array AC_Array[ in step 2.2;2,KMAX]It asksMinimum value obtains interim minimum average B configuration similarity and at this time corresponding theme number TEMP_Kbest。
4, after having carried out step 3, investigation algorithm continues to run with or algorithm terminates.
Investigate interim minimum average B configuration similarity (corresponding theme number TEMP_Kbest) in average similarity array AC_Position distribution situation in Array:
In array AC_Array, using interim minimum average B configuration similarity position as "center", the number on the right of "center"Group element total number is denoted as NUM_R_TEMP_Kbest, the array element total number on the "center" left side is denoted as NUM_L_TEMP_Kbest,
(1) if NUM_L_TEMP_KbestMore than NUM_R_TEMP_Kbest, then K is enabledMAX=KMAX+ m (m=30 in the present invention),K0=Km, r=r0It is transferred to step 2;
(2) if NUM_L_TEMP_KbestLess than NUM_R_TEMP_KbestAnd NUM_R_TEMP_KbestLess than 4 × NUM_TEMP_L_Kbest, then KMAX=4 × NUM_L_TEMP_Kbest,K0=Km, r=10, is transferred to step 2.
(3) if NUM_R_TEMP_KbestMore than 4 × NUM_L_TEMP_Kbest, terminate and calculate, and enable Kbest=TEMP_Kbest。
5, it is directed to destination document set, exports best theme number Kbest;Assert this theme number (Kbest) under it is correspondingLDA models are optimal L DA models;According to the calculating of above-mentioned steps 2, this LDA model trained mistake, you can export this LDA mouldThe related data that type training obtains, including the ProbabilityDistribution Vector of theme-word, document-theme ProbabilityDistribution Vector (these dataIt is the implicit semantic information that LDA models are excavated from destination document set), other calculating can be used for.
Beneficial effects of the present invention
In order to examine the actual effect of algorithm proposed by the present invention, this algorithm to be counted on 3 different data setsAccording to the experiment.Experiment shows and (refers to attached fig. 4 to fig. 6), in the case of no human interference, can accurately automatically determine out mostExcellent theme number.Meanwhile between the main body drawn average similarity with theme number variation diagram, it is shown that variation tendency showsWhat is found is strictly Optimal units.
Specific implementation mode
Features described above and advantage to enable the present invention are clearer and more comprehensible, special embodiment below, and institute's attached drawing is coordinated to makeDetailed description are as follows.
Present invention implementation is very simple, as long as according to algorithm steps, writes program implementation, method flow of the inventionAs shown in fig. 7, its detailed algorithm is as follows:
Algorithm inputs:Destination document set, theme number investigate section threshold value K0, end point values KMAX, theme number be incremented bySpace-number r.
Algorithm exports:Optimal theme number Kbest, the output of the LDA models obtained at this time.
Step:
1, K is enabled0=2, meanwhile, theme number incremental spacing r0Initial value is set as 3, KMAXInitial value be set as 30;
2, designated key number K, K=K in the following manner0+ r*n, n=(0,1,2 ... ...), and K [ in section;K0,KMAX], space-number r=r0;Then, cycle carries out following calculating step (2.1-2.3), until traversing all K (notes:It is incremented byThe theme number maximum value of acquisition is denoted as Km):
2.1 are directed to destination document set, set the theme number initial value of the destination document set as K, utilize LDA modelsThe destination document set is trained, every time after the completion of training, obtains the ProbabilityDistribution Vector of K theme-word, each vectorUse ziIt indicates, and i=(1,2 ... .., K), zi={ βI, 1, βI, 2,..., βI, V,, wherein βI, 1Indicate every in vectorial collection of documentProbability value of a word at this theme i, V are that (LDA model trainings, please refer to for the size of the dictionary obtained by collection of documentHeinrich Gregor, Parameter estimation for text analysis, 2009.).
After the completion of 2.2 above-mentioned training, the average similarity between the theme obtained according to following formula, computation model, noteFor ACK, subscript K expression theme numbers are K, are appended to successively in a global average similarity array AC_Array, AC_Array arrays are one-dimension array.
First, similarity between two themes of calculating, calculation formula are as follows:
Then, the average similarity between theme is calculated, calculation formula is as follows:
3, interim minimum average B configuration similarity is calculated.To the average similarity array AC_Array[ in step 2.2;2,KMAX]It asksMinimum value obtains interim minimum average B configuration similarity and at this time corresponding theme number TEMP_Kbest。
4, after having carried out step 3, investigation algorithm continues to run with or algorithm terminates.
Investigate interim minimum average B configuration similarity (corresponding theme number TEMP_Kbest) in average similarity array AC_Position distribution situation in Array:
In array AC_Array, using interim minimum average B configuration similarity position as "center", the number on the right of "center"Group element total number is denoted as NUM_R_TEMP_Kbest, the array element total number on the "center" left side is denoted as NUM_L_TEMP_Kbest,
(1) if NUM_L_TEMP_KbestMore than NUM_R_TEMP_Kbest, then K is enabledMAX=KMAX+ m (m=30 in the present invention),K0=Km, r=r0It is transferred to step 2;
(2) if NUM_L_TEMP_KbestLess than NUM_R_TEMP_KbestAnd NUM_R_TEMP_KbestLess than 4 × NUM_TEMP_L_Kbest, then KMAX=4 × NUM_L_TEMP_Kbest,K0=Km, r=10, is transferred to step 2.
(3) if NUM_R_TEMP_KbestMore than 4 × NUM_L_TEMP_Kbest, terminate and calculate, and enable Kbest=TEMP_Kbest。
5, it is directed to destination document set, exports best theme number Kbest;Assert this theme number (Kbest) under it is correspondingLDA models are optimal L DA models;According to the calculating of above-mentioned steps 2, this LDA model trained mistake, you can export this LDA mouldThe related data that type training obtains, including the ProbabilityDistribution Vector of theme-word, document-theme ProbabilityDistribution Vector (these dataIt is the implicit semantic information that LDA models are excavated from destination document set), other calculating can be used for.
It is above to implement to be merely illustrative of the technical solution of the present invention rather than be limited, the ordinary skill people of this fieldMember can be modified or replaced equivalently technical scheme of the present invention, without departing from the spirit and scope of the present invention, this hairBright protection domain should be subject to described in claims.