Movatterモバイル変換

In this vignette, we’ll walk through conducting an analysis ofvariance (ANOVA) test using infer. ANOVAs are used to analyzedifferences in group means.

Throughout this vignette, we’ll make use of thegssdataset supplied by infer, which contains a sample of data from theGeneral Social Survey. See?gss for more information on thevariables included and their source. Note that this data (and ourexamples on it) are for demonstration purposes only, and will notnecessarily provide accurate estimates unless weighted properly. Forthese examples, let’s suppose that this dataset is a representativesample of a population we want to learn about: American adults. The datalooks like this:

dplyr::glimpse(gss)

To carry out an ANOVA, we’ll examine the association between age andpolitical party affiliation in the United States. Theagevariable is a numerical variable measuring the respondents’ age at thetime that the survey was taken, andpartyid is a factorvariable with unique values ind, rep, dem, other.

If there were no relationship, we would expect to see the each ofthese boxplots lining up along the y-axis. It looks like the average ageof democrats and republicans seems to be a bit larger than independentand other American voters. Is this difference just random noise,though?

First, to calculate the observed statistic, we can usespecify() andcalculate().

# calculate the observed statisticobserved_f_statistic<- gss|>specify(age~ partyid)|>hypothesize(null ="independence")|>calculate(stat ="F")

The observed\(F\) statistic is2.4842. Now, we want to compare this statistic to a null distribution,generated under the assumption that age and political party affiliationare not actually related, to get a sense of how likely it would be forus to see this observed statistic if there were actually no associationbetween the two variables.

We cangenerate() an approximation of the nulldistribution using randomization. The randomization approach permutesthe response and explanatory variables, so that each person’s partyaffiliation is matched up with a random age from the sample in order tobreak up any association between the two.

# generate the null distribution using randomizationnull_dist<- gss|>specify(age~ partyid)|>hypothesize(null ="independence")|>generate(reps =1000,type ="permute")|>calculate(stat ="F")

Note that, in the linespecify(age ~ partyid) above, wecould use the equivalent syntaxspecify(response = age, explanatory = partyid).

To get a sense for what this distribution looks like, and where ourobserved statistic falls, we can usevisualize():

# visualize the null distribution and test statistic!null_dist|>visualize()+shade_p_value(observed_f_statistic,direction ="greater")

We could also visualize the observed statistic against thetheoretical null distribution. To do so, use theassume()verb to define a theoretical null distribution and then pass it tovisualize() like a null distribution outputted fromgenerate() andcalculate().

# visualize the theoretical null distribution and test statistic!null_dist_theory<- gss|>specify(age~ partyid)|>assume(distribution ="F")visualize(null_dist_theory)+shade_p_value(observed_f_statistic,direction ="greater")

To visualize both the randomization-based and theoretical nulldistributions to get a sense of how the two relate, we can pipe therandomization-based null distribution intovisualize(), andthen further providemethod = "both" tovisualize().

# visualize both null distributions and the test statistic!null_dist|>visualize(method ="both")+shade_p_value(observed_f_statistic,direction ="greater")

Either way, it looks like our observed test statistic would be quiteunlikely if there were actually no association between age and politicalparty affiliation. More exactly, we can approximate the p-value from therandomization-based approximation to the null distribution:

# calculate the p value from the observed statistic and null distributionp_value<- null_dist|>get_p_value(obs_stat = observed_f_statistic,direction ="greater")p_value

Thus, if there were really no relationship between age and politicalparty affiliation, our approximation of the probability that we wouldsee a statistic as or more extreme than 2.4842 is approximately0.05.

To calculate the p-value using the true\(F\) distribution, we can use thepf() function from base R. This function allows us tosituate the test statistic we calculated previously in the\(F\) distribution with the appropriatedegrees of freedom.

pf(observed_f_statistic$stat,3,496,lower.tail =FALSE)

Note that, while the observed statistic stays the same, the resultingp-value differs slightly between these two approaches since therandomization-based empirical\(F\)distribution is an approximation of the true\(F\) distribution.

Movatterモバイル変換

Tidy ANOVA (Analysis of Variance) withinfer