This is an R package wrapping the C++ code available athttps://github.com/xiaohuiyan/BTM for constructing aBitermTopic Model (BTM). This model models word-word co-occurrencespatterns (e.g., biterms).
Topic modelling using biterms is particularly good for finding topicsin short texts (as occurs in short survey answers or twitter data).
This R package is on CRAN, just install it withinstall.packages('BTM')
The Biterm Topic Model (BTM) is a word co-occurrence based topicmodel that learns topics by modeling word-word co-occurrences patterns(e.g., biterms)
z. In other words, the distribution of a bitermb=(wi,wj) is defined as:P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)} where k is the number oftopics you want to extract.P(w|k)=phi andP(z)=theta.More detail can be referred to the following paper:
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm TopicModel For Short Text. WWW2013.https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

library(udpipe)library(BTM)data("brussels_reviews_anno", package = "udpipe")## Taking only nouns of Dutch datax <- subset(brussels_reviews_anno, language == "nl")x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))x <- x[, c("doc_id", "lemma")]## Building the modelset.seed(321)model <- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)## Inspect the model - topic frequency + conditional term probabilitiesmodel$theta[1] 0.3406998 0.2413721 0.4179281topicterms <- terms(model, top_n = 10)topicterms[[1]] token probability1 appartement 0.061682972 brussel 0.040570123 kamer 0.023724424 centrum 0.015508555 locatie 0.015476716 stad 0.012292277 buurt 0.011814608 verblijf 0.011559859 huis 0.0111140210 dag 0.01041345[[2]] token probability1 appartement 0.056873122 brussel 0.018883073 buurt 0.018838124 kamer 0.014656965 verblijf 0.013398126 badkamer 0.012858627 slaapkamer 0.012768708 dag 0.012139289 bed 0.0119594510 raam 0.01164474[[3]] token probability1 appartement 0.0618048122 brussel 0.0358733773 centrum 0.0221938314 huis 0.0200912825 buurt 0.0199355376 verblijf 0.0186117107 aanrader 0.0146142728 kamer 0.0114474709 locatie 0.01090236510 keuken 0.009448751scores <- predict(model, newdata = x)Make a specific topic called the background
# If you set background to TRUE# The first topic is set to a background topic that equals to the empirical word distribution. # This can be used to filter out common words.set.seed(321)model <- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100)topicterms <- terms(model, top_n = 5)topictermslibrary(textplot)library(ggraph)library(concaveman)plot(model)An interesting use case of this package is to
This can be done by providing your own set of biterms to clusterupon.
Example clustering cooccurrences ofnouns/adjectives
library(data.table)library(udpipe)## Annotate text with parts of speech tagsdata("brussels_reviews", package = "udpipe")anno <- subset(brussels_reviews, language %in% "nl")anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)anno <- udpipe(anno, "dutch", trace = 10)## Get cooccurrences of nouns / adjectives and proper nounsbiterms <- as.data.table(anno)biterms <- biterms[, cooccurrence(x = lemma, relevant = upos %in% c("NOUN", "PROPN", "ADJ"), skipgram = 2), by = list(doc_id)] ## Build the modelset.seed(123456)x <- subset(anno, upos %in% c("NOUN", "PROPN", "ADJ"))x <- x[, c("doc_id", "lemma")]model <- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE, biterms = biterms, trace = 100)topicterms <- terms(model, top_n = 5)topictermsExample clustering dependency relationships
library(udpipe)library(tm)library(data.table)data("brussels_reviews", package = "udpipe")exclude <- stopwords("nl")## Do annotation on Dutch textanno <- subset(brussels_reviews, language %in% "nl")anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE)anno <- udpipe(anno, "dutch", trace = 10)anno <- setDT(anno)anno <- merge(anno, anno, by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"), by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"), all.x = TRUE, all.y = FALSE, suffixes = c("", "_parent"), sort = FALSE)## Specify a set of relationships you are interested in (e.g. objects of a verb)anno$relevant <- anno$dep_rel %in% c("obj") & !is.na(anno$lemma_parent)biterms <- subset(anno, relevant == TRUE)biterms <- data.frame(doc_id = biterms$doc_id, term1 = biterms$lemma, term2 = biterms$lemma_parent, cooc = 1, stringsAsFactors = FALSE)biterms <- subset(biterms, !term1 %in% exclude & !term2 %in% exclude)## Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTManno <- anno[, keep := relevant | (token_id %in% head_token_id[relevant == TRUE]), by = list(doc_id, paragraph_id, sentence_id)]x <- subset(anno, keep == TRUE, select = c("doc_id", "lemma"))x <- subset(x, !lemma %in% exclude)## Build the topic modelmodel <- BTM(data = x, biterms = biterms, k = 6, iter = 2000, background = FALSE, trace = 100)topicterms <- terms(model, top_n = 5)topictermsNeed support in text mining? Contact BNOSAC: http://www.bnosac.be