juliasilge/tidyloPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star97

Weighted tidy log odds ratio ⚖️

License

Unknown, MIT licenses found

Licenses found

97 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
R		R
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
tidylo.Rproj		tidylo.Rproj

Repository files navigation

tidylo: Weighted Tidy Log Odds Ratio ⚖️

Authors:Julia Silge,AlexHayes,TylerSchnoebelen
License:MIT

How can we measure how the usage or frequency of somefeature, suchas words, differs across some group orset, such as documents? Oneoption is to use the log odds ratio, but the log odds ratio alone doesnot account for sampling variability; we haven’t counted every featurethe same number of times so how do we know which differences aremeaningful?

Enter theweighted log odds, which tidylo provides an implementationfor, using tidy data principles. In particular, here we use the methodoutlined inMonroe, Colaresi, and Quinn(2008) to weight the log odds ratioby a prior. By default, the prior is estimated from the data itself, anempirical Bayes approach, but an uninformative prior is also available.

Installation

You can install the released version of tidylo fromCRAN with:

install.packages("tidylo")

Or you can install the development version from GitHub withdevtools:

# install.packages("devtools")devtools::install_github("juliasilge/tidylo")

Example

Using weighted log odds is a great approach for text analysis when wewant to measure how word usage differs across a set of documents. Let’sexplore thesix published, completed novels of JaneAusten and use thetidytext package to count upthe bigrams (sequences of two adjacent words) in each novel. Thisweighted log odds approach would work equally well for single words.

library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#>     filter, lag#> The following objects are masked from 'package:base':#>#>     intersect, setdiff, setequal, unionlibrary(janeaustenr)library(tidytext)tidy_bigrams<- austen_books() %>%    unnest_tokens(bigram,text,token="ngrams",n=2) %>%    filter(!is.na(bigram))bigram_counts<-tidy_bigrams %>%    count(book,bigram,sort=TRUE)bigram_counts#> # A tibble: 300,903 × 3#>    book                bigram     n#>    <fct>               <chr>  <int>#>  1 Mansfield Park      of the   712#>  2 Mansfield Park      to be    612#>  3 Emma                to be    586#>  4 Mansfield Park      in the   533#>  5 Emma                of the   529#>  6 Pride & Prejudice   of the   439#>  7 Emma                it was   430#>  8 Pride & Prejudice   to be    422#>  9 Sense & Sensibility to be    418#> 10 Emma                in the   416#> # … with 300,893 more rows

Now let’s use thebind_log_odds() function from the tidylo package tofind the weighted log odds for each bigram. The weighted log oddscomputed by this function are alsoz-scores for the logodds; this quantity is useful for comparing frequencies acrosscategories or sets but its relationship to an odds ratio is notstraightforward after the weighting.

What are the bigrams with the highest weighted log odds for these books?

library(tidylo)bigram_log_odds<-bigram_counts %>%    bind_log_odds(book,bigram,n)bigram_log_odds %>%    arrange(-log_odds_weighted)#> # A tibble: 300,903 × 4#>    book                bigram                n log_odds_weighted#>    <fct>               <chr>             <int>             <dbl>#>  1 Mansfield Park      sir thomas          266              27.2#>  2 Pride & Prejudice   mr darcy            230              27.0#>  3 Emma                mr knightley        239              25.9#>  4 Sense & Sensibility mrs jennings        185              24.3#>  5 Emma                mrs weston          208              24.2#>  6 Mansfield Park      miss crawford       196              23.4#>  7 Persuasion          captain wentworth   143              23.0#>  8 Persuasion          mr elliot           133              22.2#>  9 Emma                mr elton            174              22.1#> 10 Mansfield Park      mrs norris          148              20.3#> # … with 300,893 more rows

The bigrams more likely to come from each book, compared to the others,involve proper nouns. We can make a visualization as well.

library(ggplot2)bigram_log_odds %>%    group_by(book) %>%    slice_max(log_odds_weighted,n=10) %>%    ungroup() %>%    mutate(bigram= reorder(bigram,log_odds_weighted)) %>%    ggplot(aes(log_odds_weighted,bigram,fill=book))+    geom_col(show.legend=FALSE)+    facet_wrap(vars(book),scales="free")+    labs(y=NULL)

Community Guidelines

This project is released with aContributor Code ofConduct.By contributing to this project, you agree to abide by its terms.Feedback, bug reports (and fixes!), and feature requests are welcome;file issues or seek supporthere.

About

Weighted tidy log odds ratio ⚖️

juliasilge.github.io/tidylo/

Topics

r tidy-data tidyverse empirical-bayes weighted-log-odds log-odds-ratio

Resources

Readme

License

Unknown, MIT licenses found

Releases2

tidylo 0.2.0 Latest

Mar 22, 2022

+ 1 release

Packages

No packages published

Contributors5

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

tidylo: Weighted Tidy Log Odds Ratio ⚖️

Installation

Example

Community Guidelines

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Contributors5

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

juliasilge/tidylo

Folders and files

Latest commit

History

Repository files navigation

tidylo: Weighted Tidy Log Odds Ratio ⚖️

Installation

Example

Community Guidelines

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Contributors5

Uh oh!

Languages

Packages