Movatterモバイル変換


[0]ホーム

URL:


Skip to Main Content

Advertisement

MIT Press Direct, home
header search
    Transactions of the Association for Computational Linguistics
    Skip Nav Destination
    July 01 2020

    Topic Modeling in Embedding Spaces

    Adji B. Dieng,
    Adji B. Dieng
    Columbia University, New York, NY, USA.[email protected]
    Search for other works by this author on:
    Francisco J. R. Ruiz,
    Francisco J. R. Ruiz
    DeepMind, London, UK.[email protected]
    Search for other works by this author on:
    David M. Blei
    David M. Blei
    Columbia University, New York, NY, USA.[email protected]
    Search for other works by this author on:
    Crossmark: Check for Updates
    Adji B. Dieng
    Columbia University, New York, NY, USA.[email protected]
    Francisco J. R. Ruiz
    DeepMind, London, UK.[email protected]
    David M. Blei
    Columbia University, New York, NY, USA.[email protected]
    *

    Work done while at Columbia University and the University of Cambridge.

    Received:February 01 2019
    Revision Received:May 01 2020
    © 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
    2020
    Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
    This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visithttps://creativecommons.org/licenses/by/4.0/legalcode
    Transactions of the Association for Computational Linguistics (2020) 8: 439–453.
    Article history
    Received:
    February 01 2019
    Revision Received:
    May 01 2020
    Citation

    Adji B. Dieng,Francisco J. R. Ruiz,David M. Blei; Topic Modeling in Embedding Spaces.Transactions of the Association for Computational Linguistics 2020; 8 439–453. doi:https://doi.org/10.1162/tacl_a_00325

    Download citation file:

    toolbar search
    toolbar search

      Abstract

      Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop theembedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, theetm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit theetm, we develop an efficient amortized variational inference algorithm. Theetm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

      1 Introduction

      Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents (Blei et al.,2003; Blei,2012). Topic models and their extensions have been applied to many fields, such as marketing, sociology, political science, and the digital humanities. Boyd-Graber et al. (2017) provide a review.

      Most topic models build on latent Dirichlet allocation (lda) (Blei et al.,2003).lda is a hierarchical probabilistic model that represents each topic as a distribution over terms and represents each document as a mixture of the topics. When fit to a collection of documents, the topics summarize their contents, and the topic proportions provide a low-dimensional representation of each document.lda can be fit to large datasets of text by using variational inference and stochastic optimization (Hoffman et al.,2010,s).

      lda is a powerful model and it is widely used. However, it suffers from a pervasive technical problem—it fails in the face of large vocabularies. Practitioners must severely prune their vocabularies in order to fit good topic models—namely, those that are both predictive and interpretable. This is typically done by removing the most and least frequent words. On large collections, this pruning may remove important terms and limit the scope of the models. The problem of topic modeling with large vocabularies has yet to be addressed in the research literature.

      In parallel with topic modeling came the idea of word embeddings. Research in word embeddings begins with the neural language model of Bengio et al. (2003), published in the same year and journal as Blei et al. (2003). Word embeddings eschew the “one-hot” representation of words—a vocabulary-length vector of zeros with a single one—to learn a distributed representation, one where words with similar meanings are close in a lower-dimensional vector space (Rumelhart and Abrahamson,1973; Bengio et al.,2006). As for topic models, researchers scaled up embedding methods to large datasets (Mikolov et al.,2013a,s; Pennington et al.,2014; Levy and Goldberg,2014; Mnih and Kavukcuoglu,2013). Word embeddings have been extended and developed in many ways. They have become crucial in many applications of natural language processing (Maas et al.,2011; Li and Yang,2018), and they have also been extended to datasets beyond text (Rudolph et al.,2016).

      In this paper, we develop theembedded topic model(etm), a document model that marrieslda and word embeddings. Theetm enjoys the good properties of topic models and the good properties of word embeddings. As a topic model, it discovers an interpretable latent semantic structure of the documents; as a word embedding model, it provides a low-dimensional representation of the meaning of words. Theetm robustly accommodates large vocabularies and the long tail of language data.

      Figure 1 illustrates the advantages. This figure shows the ratio between the perplexity on held-out documents (a measure of predictive performance) and the topic coherence (a measure of the quality of the topics), as a function of the size of the vocabulary. (The perplexity has been normalized by the vocabulary size.) This is for a corpus of 11.2K articles from the20NewsGroup and for 100 topics. The red line islda; its performance deteriorates as the vocabulary size increases—the predictive performance and the quality of the topics get worse. The blue line is theetm; it maintains good performance, even as the vocabulary size become large.

      Figure 1: 

      Ratio of the held-out perplexity on a document completion task and the topic coherence as a function of the vocabulary size for theetm andlda on the20NewsGroup corpus. The perplexity is normalized by the size of the vocabulary. While the performance oflda deteriorates for large vocabularies, theetm maintains good performance.

      Figure 1: 

      Ratio of the held-out perplexity on a document completion task and the topic coherence as a function of the vocabulary size for theetm andlda on the20NewsGroup corpus. The perplexity is normalized by the size of the vocabulary. While the performance oflda deteriorates for large vocabularies, theetm maintains good performance.

      Close modal

      Likelda, theetm is a generative probabilistic model: Each document is a mixture of topics and each observed word is assigned to a particular topic. In contrast tolda, the per-topic conditional probability of a term has a log-linear form that involves a low-dimensional representation of the vocabulary. Each term is represented by an embedding and each topic is a point in that embedding space. The topic’s distribution over terms is proportional to the exponentiated inner product of the topic’s embedding and each term’s embedding.Figures 2 and3 show topics from a 300-topicetm ofThe New York Times. The figures show each topic’s embedding and its closest words; these topics are about Christianity and sports.

      Figure 2: 

      A topic about Christianity found by theetm onThe New York Times. The topic is a point in the word embedding space.

      Figure 2: 

      A topic about Christianity found by theetm onThe New York Times. The topic is a point in the word embedding space.

      Close modal
      Figure 3: 

      Topics about sports found by theetm onThe New York Times. Each topic is a point in the word embedding space.

      Figure 3: 

      Topics about sports found by theetm onThe New York Times. Each topic is a point in the word embedding space.

      Close modal

      Representing topics as points in the embedding space allows theetm to be robust to the presence of stop words, unlike most topic models. When stop words are included in the vocabulary, theetm assigns topics to the corresponding area of the embedding space (we demonstrate this inSection 6).

      As for most topic models, the posterior of the topic proportions is intractable to compute. We derive an efficient algorithm for approximating the posterior with variational inference (Jordan et al.,1999; Hoffman et al.,2013; Blei et al.,2017) and additionally use amortized inference to efficiently approximate the topic proportions (Kingma and Welling,2014; Rezende et al.,2014). The resulting algorithm fits theetm to large corpora with large vocabularies. This algorithm can either use previously fitted word embeddings, or fit them jointly with the rest of the parameters. (In particular,Figures 1 to3 were made using the version of theetm that uses pre-fitted skip-gram word embeddings.)

      We compared the performance of theetm tolda, the neural variational document model (nvdm) (Miao et al.,2016), andprodlda (Srivastava and Sutton,2017).1 Thenvdm is a form of multinomial matrix factorization andprodlda is a modern version oflda that uses a product of experts to model the distribution over words. We also compare to a document model that combinesprodlda with pre-fitted word embeddings. Theetm yields better predictive performance, as measured by held-out log-likelihood on a document comple tion task (Wallach et al.,2009b). It also discovers more meaningful topics, as measured by topic coherence (Mimno et al.,2011) and topic diver sity. The latter is a metric we introduce in this paper that, together with topic coherence, gives a better indication of the quality of the topics. Theetm is especially robust to large vocabularies.

      2 Related Work

      This work develops a new topic model that extendslda.lda has been extended in many ways, and topic modeling has become a subfield of its own. For a review, see Blei (2012) and Boyd-Graber et al. (2017).

      A broader set of related works are neural topic models. These mainly focus on improving topic modeling inference through deep neural networks (Srivastava and Sutton,2017; Card et al.,2017; Cong et al.,2017; Zhang et al.,2018). Specifically, these methods reduce the dimension of the text data through amortized inference and the variational auto-encoder (Kingma and Welling,2014; Rezende et al.,2014). To perform inference in theetm, we also avail ourselves of amortized inference methods (Gershman and Goodman,2014).

      As a document model, theetm also relates to works that learn per-document representations as part of an embedding model (Le and Mikolov,2014; Moody,2016; Miao et al.,2016; Li et al.,2016). In contrast to these works, the document variables in theetm are part of a larger probabilistic topic model.

      One of the goals in developing theetm is to incorporate word similarity into the topic model, and there is previous research that shares this goal. These methods either modify the topic priors (Petterson et al.,2010; Zhao et al.,2017b; Shi et al.,2017; Zhao et al.,2017a) or the topic assignment priors (Xie et al.,2015). For example, Petterson et al. (2010) use a word similarity graph (as given by a thesaurus) to biaslda towards assigning similar words to similar topics. As another example, Xie et al. (2015) model the per-word topic assignments oflda using a Markov random field to account for both the topic proportions and the topic assignments of similar words. These methods use word similarity as a type of “side information” about language; in contrast, theetm directly models the similarity (via embeddings) in its generative process of words.

      However, a more closely related set of works directly combine topic modeling and word embeddings. One common strategy is to convert the discrete text into continuous observations of embeddings, and then adaptlda to generate real-valued data (Das et al.,2015; Xun et al.,2016; Batmanghelich et al.,2016; Xun et al.,2017). With this strategy, topics are Gaussian distributions with latent means and covariances, and the likelihood over the embeddings is modeled with a Gaussian (Das et al.,2015) or a Von-Mises Fisher distribution (Batmanghelich et al.,2016). Theetm differs from these approaches in that it is a model of categorical data, one that goes through the embeddings matrix. Thus it does not require pre-fitted embeddings and, indeed, can learn embeddings as part of its inference process. Theetm also differs from these approaches in that it is amenable to large datasets with large vocabularies.

      There are few other ways of combininglda and embeddings. Nguyen et al. (2015) mix the likelihood defined bylda with a log-linear model that uses pre-fitted word embeddings; Bunk and Krestel (2018) randomly replace words drawn from a topic with their embeddings drawn from a Gaussian; Xu et al. (2018) adopt a geometric perspective, using Wasserstein distances to learn topics and word embeddings jointly; and Keya et al. (2019) propose the neural embedding allocation (NEA), which has a similar generative process to theetm but is fit using a pre-fittedlda model as a target distribution. Because it requireslda, thenea suffers from the same limitation aslda. These models often lack scalability with respect to the vocabulary size and are fit using Gibbs sampling, limiting their scalability to large corpora.

      3 Background

      Theetm builds on two main ideas,lda and word embeddings. Consider a corpus ofD documents, where the vocabulary containsV distinct terms. Letwdn ∈{1,…,V } denote thenth word in thedth document.

      Latent Dirichlet Allocation.

      lda is a probabilistic generative model of documents (Blei et al.,2003). It positsK topicsβ1:K, each of which is a distribution over the vocabulary.lda assumes each document comes from a mixture of topics, where the topics are shared across the corpus and the mixture proportions are unique for each document. The generative process for each document is the following:

      1. Draw topic proportionθd ∼Dirichlet(αθ).

      2. For each wordn in the document:

        • Draw topic assignmentzdn ∼Cat(θd).

        • Draw wordwdnCat(βzdn).

      Here, Cat(⋅) denotes the categorical distribution.lda places a Dirichlet prior on the topics,
      βkDirichlet(αβ) fork=1,,K.
      The concentration parametersαβ andαθ of the Dirichlet distributions are fixed model hyperparameters.

      Word Embeddings.

      Word embeddings provide models of language that use vector representations of words (Rumelhart and Abrahamson,1973; Bengio et al.,2003). The word representations are fitted to relate to meaning, in that words with similar meanings will have representations that are close. (In embeddings, the “meaning” of a word comes from the contexts in which it is used [Harris,1954].)

      We focus on the continuous bag-of-words (CBOW) variant of word embeddings (Mikolov et al.,2013b). Incbow, the likelihood of each wordwdn is
      wdnsoftmax(ραdn).
      (1)
      The embedding matrixρ is aL ×V matrix whose columns contain the embedding representations of the vocabulary,ρv ∈ℝL. The vectorαdn is thecontext embedding. The context embedding is the sum of the context embedding vectors (αv for each wordv) of the words surroundingwdn.

      4 The Embedded Topic Model

      Theetm is a topic model that uses embedding representations of both words and topics. It contains two notions of latent dimension. First, it embeds the vocabulary in anL-dimensional space. These embeddings are similar in spirit to classical word embeddings. Second, it represents each document in terms ofK latent topics.

      In traditional topic modeling, each topic is a full distribution over the vocabulary. In theetm, however, thekth topic is a vectorαk ∈ℝL in the embedding space. We callαk atopic embedding— it is a distributed representation of thekth topic in the semantic space of words.

      In its generative process, theetm uses the topic embedding to form a per-topic distribution over the vocabulary. Specifically, theetm uses a log-linear model that takes the inner product of the word embedding matrix and the topic embedding. With this form, theetm assigns high probability to a wordv in topick by measuring the agreement between the word’s embedding and the topic’s embedding.

      Denote theL ×V word embedding matrix byρ; the columnρv is the embedding of termv. Under theetm, the generative process of thedth document is the following:

      1. Draw topic proportionsθdLN(0,I).

      2. For each wordn in the document:

        • Draw topic assignmentzdn ∼Cat(θd).

        • Draw the wordwdn ∼softmax(ραzdn).

      In Step 1,LN() denotes the logistic-normal distribution (Aitchison and Shen,1980; Blei and Lafferty,2007); it transforms a standard Gaussian random variable to the simplex. A drawθd from this distribution is obtained as
      δdN(0,I);θd=softmax(δd).
      (2)
      (We replaced the Dirichlet with the logistic normal to easily use reparameterization in the inference algorithm; seeSection 5.)

      Steps 1 and 2a are standard for topic modeling: They represent documents as distributions over topics and draw a topic assignment for each observed word. Step 2b is different; it uses the embeddings of the vocabularyρ and the assigned topic embeddingαzdn to draw the observed word from the assigned topic, as given byzdn.

      The topic distribution in Step 2b mirrors thecbow likelihood in Eq.1. Recallcbow uses the surrounding words to form the context vectorαdn. In contrast, theetm uses the topic embeddingαzdn as the context vector, where the assigned topiczdn is drawn from the per-document variableθd. Theetm draws its words from a document context, rather than from a window of surrounding words.

      Theetm likelihood uses a matrix of word embeddingsρ, a representation of the vocabulary in a lower dimensional space. In practice, it can either rely on previously fitted embeddings or learn them as part of its overall fitting procedure. When theetm learns the embeddings as part of the fitting procedure, it simultaneously finds topics and an embedding space.

      When theetm uses previously fitted embeddings, it learns the topics of a corpus in a particular embedding space. This strategy is particularly useful when there are words in the embedding that are not used in the corpus. Theetm can hypothesize how those words fit in to the topics because it can calculateρvαk even for wordsv that do not appear in the corpus.

      5 Inference and Estimation

      We are given a corpus of documents {w1,…,wD}, where thedth documentwd is a collection ofNd words. How do we fit theetm to this corpus?

      The Marginal Likelihood.

      The parameters of theetm are the word embeddingsρ1:V and the topic embeddingsα1:K; eachαk is a point in the word embedding space. We maximize the log marginal likelihood of the documents,
      L(α,ρ)=d=1Dlogp(wd|α,ρ).
      (3)
      The problem is that the marginal likelihood of each document—p(wd |α,ρ)—is intractable to compute. It involves a difficult integral over the topic proportions, which we write in terms of the untransformed proportionsδd in Eq.2,
      p(wd|α,ρ)=p(δd)n=1Ndp(wdn|δd,α,ρ)dδd.
      (4)
      The conditional distributionp(wdn |δd,α,ρ) of each word marginalizes out the topic assignmentzdn,
      p(wdn|δd,α,ρ)=k=1Kθdkβk,wdn.
      (5)
      Here,θdk denotes the (transformed) topic proportions (Eq.2) andβk,v denotes a traditional “topic,” that is, a distribution over words, induced by the word embeddingsρ and the topic embeddingαk,
      βkv=softmax(ραk)v.
      (6)
      Eqs.4,5,6 flesh out the likelihood in Eq.3.

      Variational Inference.

      We sidestep the intractable integral in Eq. eq:integral with variational inference (Jordan et al.,1999; Blei et al.,2017). Variational inference optimizes a sum of per-document bounds on the log of the marginal likelihood of Eq.4.

      To begin, posit a family of distributions of the untransformed topic proportionsq(δd ;wd,ν). This family of distributions is parameterized byν. We use amortized inference, whereq(δd ;wd,ν) (called avariational distribution) depends on both the documentwd and shared parametersν. In particular,q(δd ;wd,ν) is a Gaussian whose mean and variance come from an “inference network,” a neural network parameterized byν (Kingma and Welling,2014). The inference network ingests a bag-of-words representation of the documentwd and outputs the mean and covariance ofδd. (To accommodate documents of varying length, we form the input of the inference network by normalizing the bag-of-word representation of the document by the number of wordsNd.)

      We use this family of distributions to bound the log of the marginal likelihood in Eq.4. The bound is called the evidence lower bound (ELBO) and is a function of the model parameters and the variational parameters,
      L(α,ρ,ν)=d=1Dn=1Nd𝔼q[logp(wnd|δd,ρ,α)]d=1DKL(q(δd;wd,ν)p(δd)).
      (7)
      The first term of theelbo (Eq.7) encourages variational distributionsq(δd ;wd,ν) that place mass on topic proportionsδd that explain the observed words and the second term encouragesq(δd ;wd,ν) to be close to the priorp(δd). Maximizing theelbo with respect to the model parameters (α,ρ) is equivalent to maximizing the expected complete log-likelihood,dlogp(δd,wd |α,ρ).
      Theelbo in Eq.7 is intractable because the expectation is intractable. However, we can form a Monte Carlo approximation of theelbo,
      L~(α,ρ,ν)=1Sd=1Dn=1Nds=1Slogp(wnd|δd(s),ρ,α)d=1DKL(q(δd;wd,ν)p(δd)),
      (8)
      whereδd(s)q(δd;wd,ν) fors = 1…S. To form an unbiased estimator of theelbo and its gradients, we use the reparameterization trick when sampling the unnormalized proportionsδd(1),,δd(S) (Kingma and Welling,2014; Titsias and Lázaro-Gredilla,2014; Rezende et al.,2014). That is, we sampleδd(s) fromq(δd;wd,ν) as
      εd(s)N(0,I)andδd(s)=μd+Σd12εd(s),
      (9)
      whereμd and Σd are the mean and covariance ofq(δd;wd,ν) respectively, which depend implicitly onν andwd via the inference network. We use a diagonal covariance matrix Σd.
      We also use data subsampling to handle large collections of documents (Hoffman et al.,2013). Denote by ℬ a minibatch of documents. Then the approximation of theelbo using data subsampling is
      L~(α,ρ,ν)=D|B|dBn=1Nds=1Slogp(wnd|δd(s),ρ,α)D|B|dBKL(q(δd;wd,ν)p(δd)).
      (10)
      Given that the priorp(δd) andq(δd;wd,ν) are both Gaussians, the KL admits a closed-form expression,
      KL(q(δd;wd,ν)p(δd))=12tr(Σd)+μdμdlogdet(Σd)K.
      (11)
      We optimize the stochasticelbo in Equation10 with respect to both the model parameters (α,ρ) and the variational parametersν. We set the learning rate with Adam (Kingma and Ba,2015). The procedure is shown in Algorithm 1, where we set the number of Monte Carlo samplesS = 1 and the notation NN(x ;ν) represents a neural network with inputx and parametersν.

      graphic

      6 Empirical Study

      We study the performance of theetm and compare it to other unsupervised document models. A good document model should provide both coherent patterns of language and an accurate distribution of words, so we measure performance in terms of both predictive accuracy and topic interpretability. We measure accuracy with log-likelihood on a document completion task (Rosen-Zvi et al.,2004; Wallach et al.,2009b); we measure topic interpretability as a blend of topic coherence and diversity. We find that, of the interpretable models, theetm is the one that provides better predictions and topics.

      In a separate analysis (Section 6.1), we study the robustness of each method in the presence of stop words. Standard topic models fail in this regime—because stop words appear in many documents, every learned topic includes some stop words, leading to poor topic interpretability. In contrast, theetm is able to use the information from the word embeddings to provide interpretable topics.

      Corpora.

      We study the20Newsgroups corpus and theNew York Times corpus; the statistics of both corpora are summarized inTable 1.

      Table 1: 
      Statistics of the different corpora studied.df denotes document frequency, K denotes a thousand, and M denotes a million.
      DatasetMinimumdf#Tokens Train#Tokens Valid#Tokens TestVocabulary
      20Newsgroups 100 604.9 K 5,998 399.6 K 3,102 
      30 778.0 K 7,231 512.5 K 8,496 
      10 880.3 K 6,769 578.8 K 18,625 
      922.3 K 8,494 605.9 K 29,461 
      966.3 K 8,600 622.9 K 52,258 
       
      New York Times 5,000 226.9 M 13.4 M 26.8 M 9,842 
      200 270.1 M 15.9 M 31.8 M 55,627 
      100 272.3 M 16.0 M 32.1 M 74,095 
      30 274.8 M 16.1 M 32.3 M 124,725 
      10 276.0 M 16.1 M 32.5 M 212,237 
      DatasetMinimumdf#Tokens Train#Tokens Valid#Tokens TestVocabulary
      20Newsgroups 100 604.9 K 5,998 399.6 K 3,102 
      30 778.0 K 7,231 512.5 K 8,496 
      10 880.3 K 6,769 578.8 K 18,625 
      922.3 K 8,494 605.9 K 29,461 
      966.3 K 8,600 622.9 K 52,258 
       
      New York Times 5,000 226.9 M 13.4 M 26.8 M 9,842 
      200 270.1 M 15.9 M 31.8 M 55,627 
      100 272.3 M 16.0 M 32.1 M 74,095 
      30 274.8 M 16.1 M 32.3 M 124,725 
      10 276.0 M 16.1 M 32.5 M 212,237 

      The20Newsgroup corpus is a collection of newsgroup posts. We preprocess the corpus by filtering stop words, words with document frequency above 70%, and tokenizing. To form the vocabulary, we keep all words that appear in more than a certain number of documents, and we vary the threshold from 100 (a smaller vocabulary, whereV = 3,102) to 2 (a larger vocabulary, whereV = 52,258). After preprocessing, we further remove one-word documents from the validation and test sets. We split the corpus into a training set of 11,260 documents, a test set of 7,532 documents, and a validation set of 100 documents.

      TheNew York Times corpus is a larger collection of news articles. It contains more than 1.8 million articles, spanning the years 1987–2007. We follow the same preprocessing steps as for20Newsgroups. We form versions of this corpus with vocabularies ranging fromV = 9,842 toV = 212,237. After preprocessing, we use 85% of the documents for training, 10% for testing, and 5% for validation.

      Models.

      We compare the performance of theetm against several document models. We briefly describe each below.

      We consider latent Dirichlet allocation (lda) (Blei et al.,2003), a standard topic model that posits Dirichlet priors for the topicsβk and topic proportionsθd. (We set the prior hyperparameters to 1.) It is a conditionally conjugate model, amenable to variational inference with coordinate ascent. We considerlda because it is the most commonly used topic model, and it has a similar generative process as theetm.

      We also consider the neural variational document model (nvdm) (Miao et al.,2016). Thenvdm is a multinomial factor model of documents; it posits the likelihoodwdn ∼softmax(βθd), where theK-dimensional vectorθdN(0,IK) is a per-document variable, andβ is a real-valued matrix of sizeK ×V. Thenvdm uses a per-document real-valued latent vectorθd to average over the embedding matrixβ in the logit space. Like theetm, thenvdm uses amortized variational inference to jointly learn the approximate posterior over the document representationθd and the model parameterβ.

      nvdm is not interpretable as a topic model; its latent variables are unconstrained. We study a more interpretable variant of thenvdm which constrainsθd to lie in the simplex, replacing its Gaussian prior with a logistic normal (Aitchison and Shen,1980). (This can be thought of as a semi-nonnegative matrix factorization.) We call this document model Δ-nvdm.

      We also considerprodlda (Srivastava and Sutton,2017). It posits the likelihoodwdn ∼softmax(βθd) where the topic proportionsθd are from the simplex. Contrary tolda, the topic-matrixβ s unconstrained.

      prodlda shares the generative model with Δ-nvdm but it is fit differently.prodlda uses amortized variational inference with batch normalization (Ioffe and Szegedy,2015) and dropout (Srivastava et al.,2014).

      Finally, we consider a document model that combinesprodlda with pre-fitted word embeddingsρ, by using the likelihoodwdn ∼softmax(ρθd). We call this document modelprodlda-PWE, where PWE stands for Pre-fitted Word Embeddings.

      We study two variants of theetm, one where the word embeddings are pre-fitted and one where they are learned jointly with the rest of the parameters. The variant with pre-fitted embeddings is called theetm-PWE.

      Forprodlda-PWE and theetm-PWE, we first obtain the word embeddings (Mikolov et al.,2013b) by training skip-gram on each corpus. (We reuse the same embeddings across the experiments with varying vocabulary sizes.)

      Algorithm Settings.

      Given a corpus, each model comes with an approximate posterior inference problem. We use variational inference for all of the models and employsvi (Hoffman et al.,2013) to speed up the optimization. The minibatch size is 1,000 documents. Forlda, we set the learning rate as suggested by Hoffman et al. (2013): the delay is 10 and the forgetting factor is 0.85.

      Withinsvi,lda enjoys coordinate ascent variational updates; we use five inner steps to optimize the local variables. For the other models, we use amortized inference over the local variablesθd. We use 3-layer inference networks and we set the local learning rate to 0.002. We use2 regularization on the variational parameters (the weight decay parameter is 1.2 × 10−6).

      Qualitative Results.

      We first examine the embeddings. Theetm,nvdm, Δ-nvdm, andprodlda all learn word embeddings. We illustrate them by fixing a set of terms and showing the closest words in the embedding space (as measured by cosine distance). For comparison, we also illustrate word embeddings learned by the skip-gram model.

      Table 2 illustrates the embeddings of the different models. All the methods provide interpretable embeddings—words with related meanings are close to each other. Theetm, thenvdm, andprodlda learn embeddings that are similar to those from the skip-gram. The embeddings of Δ-nvdm are different; the simplex constraint on the local variable and the inference procedure change the nature of the embeddings.

      Table 2: 
      Word embeddings learned by all document models (and skip-gram) on theNew York Times with vocabulary size 118,363.
      Skip-gram embeddingsetm embeddings
      love family woman politics love family woman politics 
      loved families man political joy children girl political 
      passion grandparents girl religion loves son boy politician 
      loves mother boy politicking loved mother mother ideology 
      affection friends teenager ideology passion father daughter speeches 
      adore relatives person partisanship wonderful wife pregnant ideological 
       
      nvdm embeddings Δ-nvdm embeddings 
      love family woman politics love family woman politics 
      loves sons girl political miss home life political 
      passion life women politician young father marriage faith 
      wonderful brother man politicians born son women marriage 
      joy son pregnant politically dream day read politicians 
      beautiful lived boyfriend democratic younger mrs young election 
       
       prodlda embeddings  
       love family woman politics  
       loves husband girl political  
       affection wife boyfriend politician  
       sentimental daughters boy liberal  
       dreams sister teenager politicians  
       laugh friends ager ideological  
      Skip-gram embeddingsetm embeddings
      love family woman politics love family woman politics 
      loved families man political joy children girl political 
      passion grandparents girl religion loves son boy politician 
      loves mother boy politicking loved mother mother ideology 
      affection friends teenager ideology passion father daughter speeches 
      adore relatives person partisanship wonderful wife pregnant ideological 
       
      nvdm embeddings Δ-nvdm embeddings 
      love family woman politics love family woman politics 
      loves sons girl political miss home life political 
      passion life women politician young father marriage faith 
      wonderful brother man politicians born son women marriage 
      joy son pregnant politically dream day read politicians 
      beautiful lived boyfriend democratic younger mrs young election 
       
       prodlda embeddings  
       love family woman politics  
       loves husband girl political  
       affection wife boyfriend politician  
       sentimental daughters boy liberal  
       dreams sister teenager politicians  
       laugh friends ager ideological  

      We next look at the learned topics.Table 3 displays the seven most used topics for all methods, as given by the average of the topic proportionsθd.lda and both variants of theetm provide interpretable topics. The rest of the models do not provide interpretable topics; their matricesβ are unconstrained and thus are not interpretable as distributions over the vocabulary that mix to form documents. Δ-nvdm also suffers from this effect although it is less apparent (see, e.g., the fifth listed topic for Δ-nvdm).

      Table 3: 
      Top five words of seven most used topics from different document models on 1.8M documents of theNew York Times corpus with vocabulary size 212,237 andK = 300 topics.
      LDA
      time year officials mr city percent state 
      day million public president building million republican 
      back money department bush street company party 
      good pay report white park year bill 
      long tax state clinton house billion mr 
       
      nvdm 
      scholars japan gansler spratt assn ridership pryce 
      gingrich tokyo wellstone tabitha assoc mtv mickens 
      funds pacific mccain mccorkle qtr straphangers mckechnie 
      institutions europe shalikashvili cheetos yr freierman mfume 
      endowment zealand coached vols nyse riders filkins 
       
      Δ-nvdm 
      concerto servings nato innings treas patients democrats 
      solos tablespoons soviet scored yr doctors republicans 
      sonata tablespoon iraqi inning qtr medicare republican 
      melodies preheat gorbachev shutout outst dr senate 
      soloist minced arab scoreless telerate physicians dole 
       
      prodlda 
      temptation grasp electron played amato briefly giant 
      repressed unruly nuclei lou model precious boarding 
      drowsy choke macal greg delaware serving bundle 
      addiction drowsy trained bobby morita set distance 
      conquering drift mediaone steve dual virgin foray 
       
      prodlda-PWE 
      mercies cheesecloth scoreless chapels distinguishable floured gillers 
      lockbox overcook floured magnolias cocktails impartiality lacerated 
      pharm strainer hitless asea punishable knead polshek 
      shims kirberger asterisk bogeyed checkpoints refrigerate decimated 
      cp browned knead birdie disobeying tablespoons inhuman 
       
      etm-PWE 
      music republican yankees game wine court company 
      dance bush game points restaurant judge million 
      songs campaign baseball season food case stock 
      opera senator season team dishes justice shares 
      concert democrats mets play restaurants trial billion 
       
      etm 
      game music united wine company yankees art 
      team mr israel food stock game museum 
      season dance government sauce million baseball show 
      coach opera israeli minutes companies mets work 
      play band mr restaurant billion season artist 
      LDA
      time year officials mr city percent state 
      day million public president building million republican 
      back money department bush street company party 
      good pay report white park year bill 
      long tax state clinton house billion mr 
       
      nvdm 
      scholars japan gansler spratt assn ridership pryce 
      gingrich tokyo wellstone tabitha assoc mtv mickens 
      funds pacific mccain mccorkle qtr straphangers mckechnie 
      institutions europe shalikashvili cheetos yr freierman mfume 
      endowment zealand coached vols nyse riders filkins 
       
      Δ-nvdm 
      concerto servings nato innings treas patients democrats 
      solos tablespoons soviet scored yr doctors republicans 
      sonata tablespoon iraqi inning qtr medicare republican 
      melodies preheat gorbachev shutout outst dr senate 
      soloist minced arab scoreless telerate physicians dole 
       
      prodlda 
      temptation grasp electron played amato briefly giant 
      repressed unruly nuclei lou model precious boarding 
      drowsy choke macal greg delaware serving bundle 
      addiction drowsy trained bobby morita set distance 
      conquering drift mediaone steve dual virgin foray 
       
      prodlda-PWE 
      mercies cheesecloth scoreless chapels distinguishable floured gillers 
      lockbox overcook floured magnolias cocktails impartiality lacerated 
      pharm strainer hitless asea punishable knead polshek 
      shims kirberger asterisk bogeyed checkpoints refrigerate decimated 
      cp browned knead birdie disobeying tablespoons inhuman 
       
      etm-PWE 
      music republican yankees game wine court company 
      dance bush game points restaurant judge million 
      songs campaign baseball season food case stock 
      opera senator season team dishes justice shares 
      concert democrats mets play restaurants trial billion 
       
      etm 
      game music united wine company yankees art 
      team mr israel food stock game museum 
      season dance government sauce million baseball show 
      coach opera israeli minutes companies mets work 
      play band mr restaurant billion season artist 

      Quantitative Results.

      We next study the models quantitatively. We measure the quality of the topics and the predictive performance of the model. We found that among the models with interpretable topics, theetm provides the best predictions.

      We measure topic quality by blending two metrics: topic coherence and topic diversity. Topic coherence is a quantitative measure of the interpretability of a topic (Mimno et al.,2011). It is the average pointwise mutual information of two words drawn randomly from the same document,
      TC=1Kk=1K145i=110j=i+110f(wi(k),wj(k)),
      where{w1(k),,w10(k)} denotes the top-10 most likely words in topick. We choosef(⋅,⋅) as the normalized pointwise mutual information (Bouma,2009; Lau et al.,2014),
      f(wi,wj)=logP(wi,wj)P(wi)P(wj)logP(wi,wj).
      Here,P(wi,wj) is the probability of wordswi andwj co-occurring in a document andP(wi) is the marginal probability of wordwi. We approximate these probabilities with empirical counts.

      The idea behind topic coherence is that a coherent topic will display words that tend to occur in the same documents. In other words, the most likely words in a coherent topic should have high mutual information. Document models with higher topic coherence are more interpretable topic models.

      We combine coherence with a second metric, topic diversity. We define topic diversity to be the percentage of unique words in the top 25 words of all topics. Diversity close to 0 indicates redundant topics; diversity close to 1 indicates more varied topics.

      We define the overall quality of a model’s topics as the product of its topic diversity and topic coherence.

      A good topic model also provides a good distribution of language. To measure predictive power, we calculate log likelihood on a document completion task (Rosen-Zvi et al.,2004; Wallach et al.,2009b). We divide each test document into two sets of words. The first half is observed: it induces a distribution over topics which, in turn, induces a distribution over the next words in the document. We then evaluate the second half under this distribution. A good document model should provide high log-likelihood on the second half. (For all methods, we approximate the likelihood by settingθd to the variational mean.)

      We study both corpora and with different vocabularies.Figures 4 and5 show interpretability of the topics as a function of predictive power. (To ease visualization, we exponentiate topic quality and normalize all metrics by subtracting the mean and dividing by the standard deviation across methods.) The best models are on the upper right corner.

      Figure 4: 

      Interpretability as measured by the exponentiated topic quality (the higher the better) vs. predictive performance as measured by log-likelihood on document completion (the higher the better) on the20NewsGroup dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing by the standard deviation across models. Better models are on the top right corner. Overall, theetm is a better topic model.

      Figure 4: 

      Interpretability as measured by the exponentiated topic quality (the higher the better) vs. predictive performance as measured by log-likelihood on document completion (the higher the better) on the20NewsGroup dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing by the standard deviation across models. Better models are on the top right corner. Overall, theetm is a better topic model.

      Close modal
      Figure 5: 

      Interpretability as measured by the exponentiated topic quality (the higher the better) vs. predictive performance as measured by log-likelihood on document completion (the higher the better) on theNew York Times dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing by the standard deviation across models. Better models are on the top right corner. Overall, theetm is a better topic model.

      Figure 5: 

      Interpretability as measured by the exponentiated topic quality (the higher the better) vs. predictive performance as measured by log-likelihood on document completion (the higher the better) on theNew York Times dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing by the standard deviation across models. Better models are on the top right corner. Overall, theetm is a better topic model.

      Close modal

      lda predicts worst in almost all settings. On the20NewsGroups, thenvdm’s predictions are in general better thanlda but worse than for the other methods; on theNew York Times, thenvdm gives the best predictions. However, topic quality for thenvdm is far below the other methods. (It does not provide “topics”, so we assess the interpretability of itsβ matrix.) In prediction, both versions of theetm are at least as good as the simplex-constrained Δ-nvdm. More importantly, both versions of theetm outperform theprodlda-PWE; signaling theetm provides a better way of integrating word embeddings into a topic model.

      These figures show that, of the interpretable models, theetm provides the best predictive performance while keeping interpretable topics. It is robust to large vocabularies.

      6.1 Stop Words

      We now study a version of theNew York Times corpus that includes all stop words. We remove infrequent words to form a vocabulary of size 10,283. Our goal is to show that theetm-PWE provides interpretable topics even in the presence of stop words, another regime where topic models typically fail. In particular, given that stop words appear in many documents, traditional topic models learn topics that contain stop words, regardless of the actual semantics of the topic. This leads to poor topic interpretability. There are extensions of topic models specifically designed to cope with stop words (Griffiths et al.,2004; Chemudugunta et al.,2006; Wallach et al.,2009a); our goal here is not to establish comparisons with these methods but to show the performance of theetm-PWE in the presence of stop words.

      We fitlda, the Δ-nvdm, theprodlda-PWE, and theetm-PWE withK = 300 topics. (We do not report thenvdm because it does not provide interpretable topics.)Table 4 shows the topic quality (the product of topic coherence and topic diversity). Overall, theetm-PWE gives the best performance in terms of topic quality.

      Table 4: 
      Topic quality on theNew York Times data in the presence of stop words. Topic quality here is given by the product of topic coherence and topic diversity (higher is better). Theetm-PWE is robust to stop words; it achieves similar topic coherence than when there are no stop words.
      tctdQuality
      lda 0.13 0.14 0.0182 
      Δ-nvdm 0.17 0.11 0.0187 
      prodlda-PWE 0.03 0.53 0.0159 
      etm-PWE 0.18 0.22 0.0396 
      tctdQuality
      lda 0.13 0.14 0.0182 
      Δ-nvdm 0.17 0.11 0.0187 
      prodlda-PWE 0.03 0.53 0.0159 
      etm-PWE 0.18 0.22 0.0396 

      While theetm has a few “stop topics” that are specific for stop words (see, e.g.,Figure 6), Δ-nvdm andlda have stop words in almost every topic. (The topics are not displayed here for space constraints.) The reason is that stop words co-occur in the same documents as every other word; therefore traditional topic models have difficulties telling apart content words and stop words. Theetm-PWE recognizes the location of stop words in the embedding space; its sets them off on their own topic.

      Figure 6: 

      A topic containing stop words found by theetm-PWE onThe New York Times. Theetm is robust even in the presence of stop words.

      Figure 6: 

      A topic containing stop words found by theetm-PWE onThe New York Times. Theetm is robust even in the presence of stop words.

      Close modal

      7 Conclusion

      We developed theetm, a generative model of documents that marrieslda with word embeddings. Theetm assumes that topics and words live in the same embedding space, and that words are generated from a categorical distribution whose natural parameter is the inner product of the word embeddings and the embedding of the assigned topic.

      Theetm learns interpretable word embeddings and topics, even in corpora with large vocabularies. We studied the performance of theetm against several document models. Theetm learns both coherent patterns of language and an accurate distribution of words.

      Acknowledgments

      DB and AD are supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, Amazon, NVIDIA, and the Simons Foundation. FR received funding from the EU’s Horizon 2020 R&I programme under the Marie Skłodowska-Curie grant agreement 706760. AD is supported by a Google PhD Fellowship.

      Notes

      1 

      Code is available athttps://github.com/adjidieng/ETM.

      References

      John
      Aitchison
      and
      Shir Ming
      Shen
      .
      1980
      .
      Logistic normal distributions: Some properties and uses
      .
      Biometrika
      ,
      67
      (
      2
      ):
      261
      272
      .
      Kayhan
      Batmanghelich
      ,
      Ardavan
      Saeedi
      ,
      Karthik
      Narasimhan
      , and
      Sam
      Gershman
      .
      2016
      .
      Nonparametric spherical topic modeling with word embeddings
      . In
      Association for Computational Linguistics
      , volume
      2016
      , page
      537
      .
      Yoshua
      Bengio
      ,
      Réjean
      Ducharme
      ,
      Pascal
      Vincent
      , and
      Christian
      Janvin
      .
      2003
      .
      A neural probabilistic language model
      .
      Journal of Machine Learning Research
      ,
      3
      :
      1137
      1155
      .
      Yoshua
      Bengio
      ,
      Holger
      Schwenk
      ,
      Jean-Sébastien
      Senécal
      ,
      Fréderic
      Morin
      , and
      Jean-Luc
      Gauvain
      .
      2006
      ,
      Neural probabilistic language models
      . In
      Innovations in Machine Learning
      .
      David M.
      Blei
      .
      2012
      .
      Probabilistic topic models
      .
      Communications of the ACM
      ,
      55
      (
      4
      ):
      77
      84
      .
      David M.
      Blei
      ,
      Alp
      Kucukelbir
      , and
      Jon D.
      McAuliffe
      .
      2017
      .
      Variational inference: A review for statisticians
      .
      Journal of the American Statistical Association
      ,
      112
      (
      518
      ):
      859
      877
      .
      David M.
      Blei
      and
      Jon D.
      Lafferty
      .
      2007
      .
      A correlated topic model of Science
      .
      The Annals of Applied Statistics
      ,
      1
      (
      1
      ):
      17
      35
      .
      David M.
      Blei
      ,
      Andrew Y.
      Ng
      , and
      Michael I.
      Jordan
      .
      2003
      .
      Latent Dirichlet allocation
      .
      Journal of Machine Learning Research
      ,
      3
      (
      Jan
      ):
      993
      1022
      .
      Gerlof
      Bouma
      .
      2009
      .
      Normalized (pointwise) mutual information in collocation extraction
      . In
      German Society for Computational Linguistics and Language Technology Conference
      .
      Jordan
      Boyd-Graber
      ,
      Yuening
      Hu
      , and
      David
      Mimno
      .
      2017
      .
      Applications of topic models
      .
      Foundations and Trends in Information Retrieval
      ,
      11
      (
      2–3
      ):
      143
      296
      .
      Stefan
      Bunk
      and
      Ralf
      Krestel
      .
      2018
      .
      WELDA: Enhancing topic models by incorporating local word context
      . In
      ACM/IEEE Joint Conference on Digital Libraries
      .
      Dallas
      Card
      ,
      Chenhao
      Tan
      , and
      Noah A.
      Smith
      .
      2017
      .
      A neural framework for generalized topic models
      . In
      arXiv:1705.09296
      .
      Chaitanya
      Chemudugunta
      ,
      Padhraic
      Smyth
      , and
      Mark
      Steyvers
      .
      2006
      .
      Modeling general and specific aspects of documents with a probabilistic topic model
      . In
      Advances in Neural Information Processing Systems
      .
      Yulai
      Cong
      ,
      Bo C.
      Chen
      ,
      Hongwei
      Liu
      , and
      Mingyuan
      Zhou
      .
      2017
      .
      Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC
      . In
      International Conference on Machine Learning
      .
      Rajarshi
      Das
      ,
      Manzil
      Zaheer
      , and
      Chris
      Dyer
      .
      2015
      .
      Gaussian LDA for topic models with word embeddings
      . In
      Association for Computational Linguistics and International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
      .
      Samuel J.
      Gershman
      and
      Noah D.
      Goodman
      .
      2014
      .
      Amortized inference in probabilistic reasoning
      . In
      Annual Meeting of the Cognitive Science Society
      .
      Thomas L.
      Griffiths
      ,
      Mark
      Steyvers
      ,
      David M.
      Blei
      , and
      Joshua B.
      Tenenbaum
      .
      2004
      .
      Integrating topics and syntax
      . In
      Advances in Neural Information Processing Systems
      .
      Zellig S.
      Harris
      .
      1954
      .
      Distributional structure
      .
      Word
      ,
      10
      (
      2–3
      ):
      146
      162
      .
      Matthew D.
      Hoffman
      ,
      David M.
      Blei
      , and
      Francis
      Bach
      .
      2010
      .
      Online learning for latent Dirichlet allocation
      . In
      Advances in Neural Information Processing Systems
      .
      Matthew D.
      Hoffman
      ,
      David M.
      Blei
      ,
      Chong
      Wang
      , and
      John
      Paisley
      .
      2013
      .
      Stochastic variational inference
      .
      Journal of Machine Learning Research
      ,
      14
      :
      1303
      1347
      .
      Sergey
      Ioffe
      and
      Christian
      Szegedy
      .
      2015
      .
      Batch normalization: Accelerating deep network training by reducing internal covariate shift
      . In
      International Conference on Machine Learning
      .
      Michael I.
      Jordan
      ,
      Zoubin
      Ghahramani
      ,
      Tommi S.
      Jaakkola
      , and
      Lawrence K.
      Saul
      .
      1999
      .
      An introduction to variational methods for graphical models
      .
      Machine Learning
      ,
      37
      (
      2
      ):
      183
      233
      .
      Kamrun Naher
      Keya
      ,
      Yannis
      Papanikolaou
      , and
      James R.
      Foulds
      .
      2019
      .
      Neural embedding allocation: Distributed representations of topic models
      .
      arXiv preprint arXiv:1909.04702
      .
      Diederik P.
      Kingma
      and
      Jimmy L.
      Ba
      .
      2015
      .
      Adam: A method for stochastic optimization
      . In
      International Conference on Learning Representations
      .
      Diederik P.
      Kingma
      and
      Max
      Welling
      .
      2014
      .
      Auto-encoding variational Bayes
      . In
      International Conference on Learning Representations
      .
      Jey H.
      Lau
      ,
      David
      Newman
      , and
      Timothy
      Baldwin
      .
      2014
      .
      Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality
      . In
      Conference of the European Chapter of the Association for Computational Linguistics
      .
      Quoc
      Le
      and
      Tomas
      Mikolov
      .
      2014
      .
      Distributed representations of sentences and documents
      . In
      International Conference on Machine Learning
      .
      Omer
      Levy
      and
      Yoav
      Goldberg
      .
      2014
      .
      Neural word embedding as implicit matrix factorization
      . In
      Neural Information Processing Systems
      .
      Shaohua
      Li
      ,
      Tat-Seng
      Chua
      ,
      Jun
      Zhu
      , and
      Chunyan
      Miao
      .
      2016
      .
      Generative topic embedding: A continuous representation of documents
      . In
      Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
      .
      Yang
      Li
      and
      Tao
      Yang
      .
      2018
      .
      Word Embedding for Understanding Natural Language: A Survey
      ,
      Springer International Publishing
      .
      Andrew L.
      Maas
      ,
      Raymond E.
      Daly
      ,
      Peter T.
      Pham
      ,
      Dan
      Huang
      ,
      Andrew Y.
      Ng
      , and
      Christopher
      Potts
      .
      2011
      .
      Learning word vectors for sentiment analysis
      . In
      Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
      .
      Yishu
      Miao
      ,
      Lei
      Yu
      , and
      Phil
      Blunsom
      .
      2016
      .
      Neural variational inference for text processing
      . In
      International Conference on Machine Learning
      .
      Tomas
      Mikolov
      ,
      Kai
      Chen
      ,
      Greg S.
      Corrado
      , and
      Jeffrey
      Dean
      .
      2013a
      .
      Efficient estimation of word representations in vector space
      .
      arXiv preprint arXiv:1301.3781
      Tomas
      Mikolov
      ,
      Ilya
      Sutskever
      ,
      Kai
      Chen
      ,
      Greg S.
      Corrado
      , and
      Jeff
      Dean
      .
      2013b
      .
      Distributed representations of words and phrases and their compositionality
      . In
      Neural Information Processing Systems
      .
      David
      Mimno
      ,
      Hanna M.
      Wallach
      ,
      Edmund
      Talley
      ,
      Miriam
      Leenders
      , and
      Andrew
      McCallum
      .
      2011
      .
      Optimizing semantic coherence in topic models
      . In
      Conference on Empirical Methods in Natural Language Processing
      .
      Andriy
      Mnih
      and
      Koray
      Kavukcuoglu
      .
      2013
      .
      Learning word embeddings efficiently with noise-contrastive estimation
      . In
      Neural Information Processing Systems
      .
      Christopher E.
      Moody
      .
      2016
      .
      Mixing Dirichlet topic models and word embeddings to make LDA2vec
      .
      arXiv:1605.02019
      .
      Dat Q.
      Nguyen
      ,
      Richard
      Billingsley
      ,
      Lan
      Du
      , and
      Mark
      Johnson
      .
      2015
      .
      Improving topic models with latent feature word representations
      .
      Transactions of the Association for Computational Linguistics
      ,
      3
      :
      299
      313
      .
      Jeffrey
      Pennington
      ,
      Richard
      Socher
      , and
      Christopher D.
      Manning
      .
      2014
      .
      GloVe: Global vectors for word representation.
      In
      Conference on Empirical Methods on Natural Language Processing
      .
      James
      Petterson
      ,
      Wray
      Buntine
      ,
      Shravan M.
      Narayanamurthy
      ,
      Tibério S.
      Caetano
      , and
      Alex J.
      Smola
      .
      2010
      .
      Word features for latent dirichlet allocation
      . In
      Advances in Neural Information Processing Systems
      .
      Danilo J.
      Rezende
      ,
      Shakir
      Mohamed
      , and
      Daan
      Wierstra
      .
      2014
      .
      Stochastic backpropagation and approximate inference in deep generative models
      . In
      International Conference on Machine Learning
      .
      Michal
      Rosen-Zvi
      ,
      Thomas
      Griffiths
      ,
      Mark
      Steyvers
      , and
      Padhraic
      Smyth
      .
      2004
      .
      The author-topic model for authors and documents
      . In
      Uncertainty in Artificial Intelligence
      .
      Maja
      Rudolph
      ,
      Francisco J. R.
      Ruiz
      ,
      Stephan
      Mandt
      , and
      David M.
      Blei
      .
      2016
      .
      Exponential family embeddings
      . In
      Advances in Neural Information Processing Systems
      .
      David E.
      Rumelhart
      and
      Adele A.
      Abrahamson
      .
      1973
      .
      A model for analogical reasoning
      .
      Cognitive Psychology
      ,
      5
      (
      1
      ):
      1
      28
      .
      Bei
      Shi
      ,
      Wai
      Lam
      ,
      Shoaib
      Jameel
      ,
      Steven
      Schockaert
      , and
      Kwun P.
      Lai
      .
      2017
      .
      Jointly learning word embeddings and latent topics
      . In
      ACM SIGIR Conference on Research and Development in Information Retrieval
      .
      Akash
      Srivastava
      and
      Charles
      Sutton
      .
      2017
      .
      Autoencoding variational inference for topic models
      . In
      International Conference on Learning Representations
      .
      Nitish
      Srivastava
      ,
      Geoffrey
      Hinton
      ,
      Alex
      Krizhevsky
      ,
      Ilya
      Sutskever
      , and
      Ruslan
      Salakhutdinov
      .
      2014
      .
      Dropout: a simple way to prevent neural networks from overfitting
      .
      Journal of Machine Learning Research
      ,
      15
      (
      1
      ):
      1929
      1958
      .
      Michalis K.
      Titsias
      and
      Miguel
      Lázaro-Gredilla
      .
      2014
      .
      Doubly stochastic variational Bayes for non-conjugate inference
      . In
      International Conference on Machine Learning
      .
      Hanna M.
      Wallach
      ,
      David M.
      Mimno
      , and
      Andrew
      McCallum
      .
      2009a
      .
      Rethinking LDA: Why priors matter
      . In
      Advances in Neural Information Processing Systems
      .
      Hanna M.
      Wallach
      ,
      Iain
      Murray
      ,
      Ruslan
      Salakhutdinov
      , and
      David
      Mimno
      .
      2009b
      .
      Evaluation methods for topic models
      . In
      International Conference on Machine Learning
      .
      Pengtao
      Xie
      ,
      Diyi
      Yang
      , and
      Eric
      Xing
      .
      2015
      .
      Incorporating word correlation knowledge into topic modeling
      . In
      Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
      .
      Hongteng
      Xu
      ,
      Wenlin
      Wang
      ,
      Wei
      Liu
      , and
      Lawrence
      Carin
      .
      2018
      .
      Distilled Wasserstein learning for word embedding and topic modeling
      . In
      Advances in Neural Information Processing Systems
      .
      Guangxu
      Xun
      ,
      Vishrawas
      Gopalakrishnan
      ,
      Fenglong
      Ma
      ,
      Yaliang
      Li
      ,
      Jing
      Gao
      , and
      Aidong
      Zhang
      .
      2016
      .
      Topic discovery for short texts using word embeddings
      . In
      IEEE International Conference on Data Mining
      .
      Guangxu
      Xun
      ,
      Yaliang
      Li
      ,
      Wayne Xin
      Zhao
      ,
      Jing
      Gao
      , and
      Aidong
      Zhang
      .
      2017
      .
      A correlated topic model using word embeddings.
      In
      Joint Conference on Artificial Intelligence
      .
      Hao
      Zhang
      ,
      Bo
      Chen
      ,
      Dandan
      Guo
      , and
      Mingyuan
      Zhou
      .
      2018
      .
      WHAI: Weibull hybrid autoencoding inference for deep topic modeling
      . In
      International Conference on Learning Representations
      .
      He
      Zhao
      ,
      Lan
      Du
      , and
      Wray
      Buntine
      .
      2017a
      .
      A word embeddings informed focused topic model
      . In
      Asian Conference on Machine Learning
      .
      He
      Zhao
      ,
      Lan
      Du
      ,
      Wray
      Buntine
      , and
      Gang
      Liu
      .
      2017b
      .
      MetaLDA: A topic model that efficiently incorporates meta information
      . In
      IEEE International Conference on Data Mining
      .

      Author notes

      *

      Work done while at Columbia University and the University of Cambridge.

      © 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
      2020
      Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
      This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visithttps://creativecommons.org/licenses/by/4.0/legalcode
      30,563Views
      341Web of Science
      394Crossref

      Advertisement

      Related Book Chapters

      Topical Studies
      Playframes: How Do We Know We Are Playing?
      Basic Topics
      Social Cognition: Making Sense of People
      Emergent Topics
      Sonic Interaction Design
      Additional Topics
      Systems That Learn: An Introduction to Learning Theory

      Advertisement

      Transactions of the Association for Computational Linguistics
      • Online ISSN 2307-387X
      Close Modal
      Close Modal
      This Feature Is Available To Subscribers Only

      Sign In orCreate an Account

      Close Modal
      Close Modal
      This site uses cookies. By continuing to use our website, you are agreeing toour privacy policy. No content on this site may be used to train artificial intelligence systems without permission in writing from the MIT Press.
      Accept

      [8]ページ先頭

      ©2009-2025 Movatter.jp