- Notifications
You must be signed in to change notification settings - Fork4
Weighted tidy log odds ratio ⚖️
License
Unknown, MIT licenses found
Licenses found
juliasilge/tidylo
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Authors:Julia Silge,AlexHayes,TylerSchnoebelen
License:MIT
How can we measure how the usage or frequency of somefeature, suchas words, differs across some group orset, such as documents? Oneoption is to use the log odds ratio, but the log odds ratio alone doesnot account for sampling variability; we haven’t counted every featurethe same number of times so how do we know which differences aremeaningful?
Enter theweighted log odds, which tidylo provides an implementationfor, using tidy data principles. In particular, here we use the methodoutlined inMonroe, Colaresi, and Quinn(2008) to weight the log odds ratioby a prior. By default, the prior is estimated from the data itself, anempirical Bayes approach, but an uninformative prior is also available.
You can install the released version of tidylo fromCRAN with:
install.packages("tidylo")Or you can install the development version from GitHub withdevtools:
# install.packages("devtools")devtools::install_github("juliasilge/tidylo")
Using weighted log odds is a great approach for text analysis when wewant to measure how word usage differs across a set of documents. Let’sexplore thesix published, completed novels of JaneAusten and use thetidytext package to count upthe bigrams (sequences of two adjacent words) in each novel. Thisweighted log odds approach would work equally well for single words.
library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, unionlibrary(janeaustenr)library(tidytext)tidy_bigrams<- austen_books() %>% unnest_tokens(bigram,text,token="ngrams",n=2) %>% filter(!is.na(bigram))bigram_counts<-tidy_bigrams %>% count(book,bigram,sort=TRUE)bigram_counts#> # A tibble: 300,903 × 3#> book bigram n#> <fct> <chr> <int>#> 1 Mansfield Park of the 712#> 2 Mansfield Park to be 612#> 3 Emma to be 586#> 4 Mansfield Park in the 533#> 5 Emma of the 529#> 6 Pride & Prejudice of the 439#> 7 Emma it was 430#> 8 Pride & Prejudice to be 422#> 9 Sense & Sensibility to be 418#> 10 Emma in the 416#> # … with 300,893 more rows
Now let’s use thebind_log_odds() function from the tidylo package tofind the weighted log odds for each bigram. The weighted log oddscomputed by this function are alsoz-scores for the logodds; this quantity is useful for comparing frequencies acrosscategories or sets but its relationship to an odds ratio is notstraightforward after the weighting.
What are the bigrams with the highest weighted log odds for these books?
library(tidylo)bigram_log_odds<-bigram_counts %>% bind_log_odds(book,bigram,n)bigram_log_odds %>% arrange(-log_odds_weighted)#> # A tibble: 300,903 × 4#> book bigram n log_odds_weighted#> <fct> <chr> <int> <dbl>#> 1 Mansfield Park sir thomas 266 27.2#> 2 Pride & Prejudice mr darcy 230 27.0#> 3 Emma mr knightley 239 25.9#> 4 Sense & Sensibility mrs jennings 185 24.3#> 5 Emma mrs weston 208 24.2#> 6 Mansfield Park miss crawford 196 23.4#> 7 Persuasion captain wentworth 143 23.0#> 8 Persuasion mr elliot 133 22.2#> 9 Emma mr elton 174 22.1#> 10 Mansfield Park mrs norris 148 20.3#> # … with 300,893 more rows
The bigrams more likely to come from each book, compared to the others,involve proper nouns. We can make a visualization as well.
library(ggplot2)bigram_log_odds %>% group_by(book) %>% slice_max(log_odds_weighted,n=10) %>% ungroup() %>% mutate(bigram= reorder(bigram,log_odds_weighted)) %>% ggplot(aes(log_odds_weighted,bigram,fill=book))+ geom_col(show.legend=FALSE)+ facet_wrap(vars(book),scales="free")+ labs(y=NULL)
This project is released with aContributor Code ofConduct.By contributing to this project, you agree to abide by its terms.Feedback, bug reports (and fixes!), and feature requests are welcome;file issues or seek supporthere.
About
Weighted tidy log odds ratio ⚖️
Topics
Resources
License
Unknown, MIT licenses found
Licenses found
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.
