Movatterモバイル変換

Term Frequency and Inverse DocumentFrequency (tf-idf) Using Tidy Data Principles

Julia Silge and David Robinson

2025-07-24

A central question in text mining and natural language processing ishow to quantify what a document is about. Can we do this by looking atthe words that make up the document? One measure of how important a wordmay be is itsterm frequency (tf), how frequently a word occursin a document. There are words in a document, however, that occur manytimes but may not be important; in English, these are probably wordslike “the”, “is”, “of”, and so forth. We might take the approach ofadding words like these to a list of stop words and removing them beforeanalysis, but it is possible that some of these words might be moreimportant in some documents than others. A list of stop words is not asophisticated approach to adjusting term frequency for commonly usedwords.

Another approach is to look at a term’sinverse documentfrequency (idf), which decreases the weight for commonly used wordsand increases the weight for words that are not used very much in acollection of documents. This can be combined with term frequency tocalculate a term’stf-idf, the frequency of a term adjusted forhow rarely it is used. It is intended to measure how important a word isto a document in a collection (or corpus) of documents. It is arule-of-thumb or heuristic quantity; while it has proved useful in textmining, search engines, etc., its theoretical foundations are consideredless than firm by information theory experts. The inverse documentfrequency for any given term is defined as

\[idf(\text{term}) =\ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containingterm}}}\right)}\]

We can use tidy data principles, as described inthe main vignette, to approach tf-idf analysisand use consistent, effective tools to quantify how important variousterms are in a document that is part of a collection.

Let’s look at the published novels of Jane Austen and examine firstterm frequency, then tf-idf. We can start just by using dplyr verbs suchasgroup_by andjoin. What are the mostcommonly used words in Jane Austen’s novels? (Let’s also calculate thetotal words in each novel here, for later use.)

library(dplyr)library(janeaustenr)library(tidytext)book_words<-austen_books()%>%unnest_tokens(word, text)%>%count(book, word,sort =TRUE)total_words<- book_words%>%group_by(book)%>%summarize(total =sum(n))book_words<-left_join(book_words, total_words)book_words

## # A tibble: 40,379 × 4##    book              word      n  total##    <fct>             <chr> <int>  <int>##  1 Mansfield Park    the    6206 160460##  2 Mansfield Park    to     5475 160460##  3 Mansfield Park    and    5438 160460##  4 Emma              to     5239 160996##  5 Emma              the    5201 160996##  6 Emma              and    4896 160996##  7 Mansfield Park    of     4778 160460##  8 Pride & Prejudice the    4331 122204##  9 Emma              of     4291 160996## 10 Pride & Prejudice to     4162 122204## # ℹ 40,369 more rows

The usual suspects are here, “the”, “and”, “to”, and so forth. Let’slook at the distribution ofn/total for each novel, thenumber of times a word appears in a novel divided by the total number ofterms (words) in that novel. This is exactly what term frequency is.

library(ggplot2)ggplot(book_words,aes(n/total,fill = book))+geom_histogram(show.legend =FALSE)+scale_x_continuous(limits =c(NA,0.0009))+facet_wrap(vars(book),ncol =2,scales ="free_y")

Histograms for word counts in Jane Austen's novels

There are very long tails to the right for these novels (thoseextremely common words!) that we have not shown in these plots. Theseplots exhibit similar distributions for all the novels, with many wordsthat occur rarely and fewer words that occur frequently. The idea oftf-idf is to find the important words for the content of each documentby decreasing the weight for commonly used words and increasing theweight for words that are not used very much in a collection or corpusof documents, in this case, the group of Jane Austen’s novels as awhole. Calculating tf-idf attempts to find the words that are important(i.e., common) in a text, but nottoo common. Let’s do thatnow.

book_words<- book_words%>%bind_tf_idf(word, book, n)book_words

## # A tibble: 40,379 × 7##    book              word      n  total     tf   idf tf_idf##    <fct>             <chr> <int>  <int>  <dbl> <dbl>  <dbl>##  1 Mansfield Park    the    6206 160460 0.0387     0      0##  2 Mansfield Park    to     5475 160460 0.0341     0      0##  3 Mansfield Park    and    5438 160460 0.0339     0      0##  4 Emma              to     5239 160996 0.0325     0      0##  5 Emma              the    5201 160996 0.0323     0      0##  6 Emma              and    4896 160996 0.0304     0      0##  7 Mansfield Park    of     4778 160460 0.0298     0      0##  8 Pride & Prejudice the    4331 122204 0.0354     0      0##  9 Emma              of     4291 160996 0.0267     0      0## 10 Pride & Prejudice to     4162 122204 0.0341     0      0## # ℹ 40,369 more rows

Notice that idf and thus tf-idf are zero for these extremely commonwords. These are all words that appear in all six of Jane Austen’snovels, so the idf term (which will then be the natural log of 1) iszero. The inverse document frequency (and thus tf-idf) is very low (nearzero) for words that occur in many of the documents in a collection;this is how this approach decreases the weight for common words. Theinverse document frequency will be a higher number for words that occurin fewer of the documents in the collection. Let’s look at terms withhigh tf-idf in Jane Austen’s works.

book_words%>%select(-total)%>%arrange(desc(tf_idf))

## # A tibble: 40,379 × 6##    book                word          n      tf   idf  tf_idf##    <fct>               <chr>     <int>   <dbl> <dbl>   <dbl>##  1 Sense & Sensibility elinor      623 0.00519  1.79 0.00931##  2 Sense & Sensibility marianne    492 0.00410  1.79 0.00735##  3 Mansfield Park      crawford    493 0.00307  1.79 0.00551##  4 Pride & Prejudice   darcy       373 0.00305  1.79 0.00547##  5 Persuasion          elliot      254 0.00304  1.79 0.00544##  6 Emma                emma        786 0.00488  1.10 0.00536##  7 Northanger Abbey    tilney      196 0.00252  1.79 0.00452##  8 Emma                weston      389 0.00242  1.79 0.00433##  9 Pride & Prejudice   bennet      294 0.00241  1.79 0.00431## 10 Persuasion          wentworth   191 0.00228  1.79 0.00409## # ℹ 40,369 more rows

Here we see all proper nouns, names that are in fact important inthese novels. None of them occur in all of novels, and they areimportant, characteristic words for each text. Some of the values foridf are the same for different terms because there are 6 documents inthis corpus and we are seeing the numerical value for\(\ln(6/1)\),\(\ln(6/2)\), etc. Let’s look specifically atPride and Prejudice.

book_words%>%filter(book=="Pride & Prejudice")%>%select(-total)%>%arrange(desc(tf_idf))

## # A tibble: 6,538 × 6##    book              word          n       tf   idf  tf_idf##    <fct>             <chr>     <int>    <dbl> <dbl>   <dbl>##  1 Pride & Prejudice darcy       373 0.00305  1.79  0.00547##  2 Pride & Prejudice bennet      294 0.00241  1.79  0.00431##  3 Pride & Prejudice bingley     257 0.00210  1.79  0.00377##  4 Pride & Prejudice elizabeth   597 0.00489  0.693 0.00339##  5 Pride & Prejudice wickham     162 0.00133  1.79  0.00238##  6 Pride & Prejudice collins     156 0.00128  1.79  0.00229##  7 Pride & Prejudice lydia       133 0.00109  1.79  0.00195##  8 Pride & Prejudice lizzy        95 0.000777 1.79  0.00139##  9 Pride & Prejudice longbourn    88 0.000720 1.79  0.00129## 10 Pride & Prejudice gardiner     84 0.000687 1.79  0.00123## # ℹ 6,528 more rows

These words are, as measured by tf-idf, the most important toPride and Prejudice and most readers would likely agree.

[8]ページ先頭