Movatterモバイル変換

BTM - BitermTopic Modelling for Short Text with R

This is an R package wrapping the C++ code available athttps://github.com/xiaohuiyan/BTM for constructing aBitermTopic Model (BTM). This model models word-word co-occurrencespatterns (e.g., biterms).

Topic modelling using biterms is particularly good for finding topicsin short texts (as occurs in short survey answers or twitter data).

Installation

This R package is on CRAN, just install it withinstall.packages('BTM')

What

The Biterm Topic Model (BTM) is a word co-occurrence based topicmodel that learns topics by modeling word-word co-occurrences patterns(e.g., biterms)

A biterm consists of two words co-occurring in the same context, forexample, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA modelswhich model the word occurrences in a document).
It’s a generative model. In the generation procedure, a biterm isgenerated by drawing two words independently from a same topicz. In other words, the distribution of a bitermb=(wi,wj) is defined as:P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)} where k is the number oftopics you want to extract.
Estimation of the topic model is done with the Gibbs samplingalgorithm. Where estimates are provided forP(w|k)=phi andP(z)=theta.

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm TopicModel For Short Text. WWW2013.https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Example

library(udpipe)library(BTM)data("brussels_reviews_anno", package = "udpipe")## Taking only nouns of Dutch datax <- subset(brussels_reviews_anno, language == "nl")x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))x <- x[, c("doc_id", "lemma")]## Building the modelset.seed(321)model  <- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)## Inspect the model - topic frequency + conditional term probabilitiesmodel$theta[1] 0.3406998 0.2413721 0.4179281topicterms <- terms(model, top_n = 10)topicterms[[1]]         token probability1  appartement  0.061682972      brussel  0.040570123        kamer  0.023724424      centrum  0.015508555      locatie  0.015476716         stad  0.012292277        buurt  0.011814608     verblijf  0.011559859         huis  0.0111140210         dag  0.01041345[[2]]         token probability1  appartement  0.056873122      brussel  0.018883073        buurt  0.018838124        kamer  0.014656965     verblijf  0.013398126     badkamer  0.012858627   slaapkamer  0.012768708          dag  0.012139289          bed  0.0119594510        raam  0.01164474[[3]]         token probability1  appartement 0.0618048122      brussel 0.0358733773      centrum 0.0221938314         huis 0.0200912825        buurt 0.0199355376     verblijf 0.0186117107     aanrader 0.0146142728        kamer 0.0114474709      locatie 0.01090236510      keuken 0.009448751scores <- predict(model, newdata = x)

Make a specific topic called the background

# If you set background to TRUE# The first topic is set to a background topic that equals to the empirical word distribution. # This can be used to filter out common words.set.seed(321)model      <- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100)topicterms <- terms(model, top_n = 5)topicterms

Visualisation of your model

Can be done using the textplot package(https://github.com/bnosac/textplot), which can be found at CRAN as well(https://cran.r-project.org/package=textplot)
An example visualisation built on a model of all R packages from theNatural Language Processing and Machine Learning task views is shownabove (see alsohttps://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts)

library(textplot)library(ggraph)library(concaveman)plot(model)

Provide your own set ofbiterms

An interesting use case of this package is to

cluster based on parts of speech tags like nouns and adjectiveswhich can be found in the text in the neighbourhood of one another
cluster dependency relationships provided by NLP tools like udpipe(https://CRAN.R-project.org/package=udpipe)

This can be done by providing your own set of biterms to clusterupon.

Example clustering cooccurrences ofnouns/adjectives

library(data.table)library(udpipe)## Annotate text with parts of speech tagsdata("brussels_reviews", package = "udpipe")anno <- subset(brussels_reviews, language %in% "nl")anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)anno <- udpipe(anno, "dutch", trace = 10)## Get cooccurrences of nouns / adjectives and proper nounsbiterms <- as.data.table(anno)biterms <- biterms[, cooccurrence(x = lemma,                                   relevant = upos %in% c("NOUN", "PROPN", "ADJ"),                                  skipgram = 2),                    by = list(doc_id)]                   ## Build the modelset.seed(123456)x     <- subset(anno, upos %in% c("NOUN", "PROPN", "ADJ"))x     <- x[, c("doc_id", "lemma")]model <- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE,              biterms = biterms, trace = 100)topicterms <- terms(model, top_n = 5)topicterms

Example clustering dependency relationships

library(udpipe)library(tm)library(data.table)data("brussels_reviews", package = "udpipe")exclude <- stopwords("nl")## Do annotation on Dutch textanno <- subset(brussels_reviews, language %in% "nl")anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)anno <- udpipe(anno, "dutch", trace = 10)anno <- setDT(anno)anno <- merge(anno, anno,               by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"),               by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"),               all.x = TRUE, all.y = FALSE, suffixes = c("", "_parent"), sort = FALSE)## Specify a set of relationships you are interested in (e.g. objects of a verb)anno$relevant <- anno$dep_rel %in% c("obj") & !is.na(anno$lemma_parent)biterms <- subset(anno, relevant == TRUE)biterms <- data.frame(doc_id = biterms$doc_id,                       term1 = biterms$lemma,                       term2 = biterms$lemma_parent,                      cooc = 1,                       stringsAsFactors = FALSE)biterms <- subset(biterms, !term1 %in% exclude & !term2 %in% exclude)## Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTManno <- anno[, keep := relevant | (token_id %in% head_token_id[relevant == TRUE]), by = list(doc_id, paragraph_id, sentence_id)]x    <- subset(anno, keep == TRUE, select = c("doc_id", "lemma"))x    <- subset(x, !lemma %in% exclude)## Build the topic modelmodel <- BTM(data = x,              biterms = biterms,              k = 6, iter = 2000, background = FALSE, trace = 100)topicterms <- terms(model, top_n = 5)topicterms

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

[8]ページ先頭