Anscombe?
Imagine that we have some observations, and we want to use them to conclude something about the world around us. Statistics can help us in the common case when the observations are composed of a systematic component combined with random chance effects. A classic "toy" example is that of dice. Let us say that we wish to analyse the following result, obtained by rolling a die five times.
| Rolls | |
|---|---|
| 2 | |
| 3 | |
| 6 | |
| 6 | |
| 6 |
Although as a general rule, your first step should be to plot your data, there is little point in this instance. The dataset is so small that we can get a feel for it by just inspecting the numbers. The main striking feature is that we seem to have a preponderance of 6's, although this could, of course, be due to chance.
We could just treat the sequence {2,3,6,6,6} as a unique, arbitrary sequence of events. But this is rather pointless: data is usually analysed in order to seek general patterns, and by generalizing, increase our understanding. In this case, we might wish to use the result to decide whether the die is loaded in favour of sixes, and so if it can be relied upon to play a game.
Naively, we might hope to analyse this with a completely open mind; to approach the situation with no prior assumptions. A moment's thought should reveal this is impossible. For example, imagine that the die was a fraudster's dream: a sophisticated miniature machine that could be pre-programmed to give a particular sequence of numbers each day. If could, for example, be programmed to give a 2, followed by a 3, then three 6's, and spend the rest of the day rolling 1's. In this case, the results tally exactly with what we have observed, but they do not tell us anything about the subsequent behaviour of the die. Although it explains the data perfectly, most people would (quite reasonably) adopt the prior assumption that the "miniature machine" explanation was highly unlikely.
This is, of course, an extreme example, but it illustrates the point. Whether or not we realise it, wealways examine data with prior notions of what is a reasonable explanation and what is not. Statistical analysis is the process of formalising these explanations, then using the data to choose between them. A good way to do this is by describing theassumptions that we have made in each case. For example, the following two assumptions are common to nearly all explanations that we might want to test.
The problem with making any assumptions is that they are just that: assumptions. They may or may not not be true. When trying to convince others with our analysis, we are asking them to take our assumptions on trust. For this reason we should try to make widely accepted assumptions and, more importantly, ensure they are completely explicit. That way, others can decide for themselves if the analysis is to be trusted. We can encapsulate this most easily and concisely by formulating amodel of the underlying process.
A common way of understanding the world around us is to describe it in terms of amodel. The more
Many basic statistics books teach simple tests, such as the t-test or sign-test. These are all based on an underlying model.
So what does an appropriate model look like? There are various ways in which we can
Testing a particular model.
We can easily disprove this by a single observation. However, we can never prove it. This turns out to be generally true. It is impossible to prove that something is the case, because there could always be a
If the model contains an element of chance, how can we know whether ***
Once we have our models, we could either
The simplest is simulation (compare to likelihood)
One of the major ways in which we can use models is simulation. This will be a major way in which models are explored in this book. To do so, we need to convert the various models described above into simulations. The "fair die" model above provides a good, simple example. We will convert this model to a simulation in R. This involves learning a little about how R deals with numbers, so you should check that you are comfortable with the idea offunctions in R, as describedpreviously.
Sampling with replacement - describe here the idea of using random sampling for simulation
###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number only appears oncesample(replace=FALSE, size=5, x=1:6) #you can change the order of the argumentssample(x=1:6, size=5) #the same, because replace=FALSE by defaultsample(1:6, 5) #we don't need x= and size= if arguments are in the same order as in the help file### The next line is a different modelsample(1:6, 5, TRUE) #sampling WITH replacement (the same number can appear twice)###The next 4 lines are equivalent, 5 numbers are selected from a list of 1..6sample(x=1:6, size=5, replace=FALSE) #when sampling WITHOUT replacement, each number only appears once[1] 1 5 4 3 6sample(replace=FALSE, size=5, x=1:6) #you can change the order of the arguments[1] 5 6 4 2 1sample(x=1:6, size=5) #the same, because replace=FALSE by default[1] 2 3 4 6 5sample(1:6, 5) #we don't need x= and size= if arguments are in the same order as in the help file[1] 1 6 3 5 4### Now simulate a different modelsample(1:6, 5, TRUE) #sampling WITH replacement (the same number can appear twice)[1] 3 6 2 1 3
sample(1:6, 5, TRUE)
We can try to disprove a particular model, or select between different models using some informed judgement
Various ways to test a model. E.g. compare results from the simulation with the observed ones
We now have a simple method of simulating data produced by the model. How can we
Now that we can simulate How do we ***. We are unlikely to get exactly the sequence we observed. A classic method is to use asample statistic. If we got 3 fives or 3 ones we would also be surprised. Link to idea of probability space.
The sample statistics
tabulate(c(2,3,6,6,6)) #an example: we can see that the
max(tabulate(c(2,3,6,6,6))) #simply confirms what
'"`UNIQ--pre-00000008-QINU`"'> replicate(1000, max(tabulate(sample(1:6, 5, TRUE)))) [1] 3 2 2 2 2 3 2 2 3 2 2 2 2 2 3 2 2 2 3 1 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 [36] 1 3 4 2 2 2 2 2 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 [71] 1 3 2 2 2 2 1 2 3 2 2 2 3 2 2 3 2 2 2 3 2 2 3 2 2 1 2 2 2 3 2 2 1 2 3 [106] 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 3 3 2 1 2 [141] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 [176] 2 3 2 2 2 2 2 2 2 2 2 2 4 2 4 2 2 2 1 2 2 2 2 3 2 3 3 2 2 2 2 2 3 2 2 [211] 2 3 2 2 2 2 2 2 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 1 [246] 2 2 3 2 3 2 2 3 2 3 2 1 2 2 2 1 2 2 2 2 3 2 3 2 2 2 3 2 3 2 2 2 2 2 2 [281] 2 2 2 2 3 2 2 2 3 2 2 3 2 2 1 2 2 1 3 3 2 2 2 2 2 2 3 2 2 2 3 2 2 2 3 [316] 4 3 2 1 1 3 2 2 2 3 3 1 3 2 2 1 2 4 2 3 2 2 2 1 2 2 2 2 2 2 3 2 1 2 2 [351] 1 3 2 2 3 2 2 2 2 3 1 4 2 3 3 3 2 4 3 2 2 1 2 2 2 2 2 2 3 2 2 1 2 3 2 [386] 3 2 2 4 2 2 2 1 1 2 3 3 3 2 2 2 2 2 3 2 2 1 2 3 1 2 2 2 2 2 2 2 2 3 2 [421] 2 1 2 3 2 2 2 2 1 3 2 2 2 2 2 3 1 1 2 2 2 2 2 2 3 2 2 2 3 3 2 2 2 3 2 [456] 2 1 2 2 2 2 2 2 3 2 2 3 1 3 2 2 3 2 3 2 2 2 2 1 2 3 2 3 2 2 3 2 4 2 2 [491] 3 2 2 3 2 2 2 4 3 1 2 2 3 2 2 2 2 2 2 4 1 2 2 1 2 2 2 2 2 3 1 2 2 2 2 [526] 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 1 1 2 3 2 2 2 2 2 2 [561] 4 2 1 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 1 2 3 2 2 2 3 2 2 2 2 2 2 2 2 [596] 2 2 3 2 2 2 3 2 3 2 2 2 2 2 1 2 2 3 3 3 2 2 2 2 2 2 3 2 4 1 2 2 2 2 2 [631] 3 2 2 2 2 2 2 2 3 2 2 3 3 2 2 1 1 2 2 3 2 4 2 1 2 2 1 2 2 2 2 2 2 3 2 [666] 2 2 2 3 2 2 2 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3 2 2 3 3 2 2 2 3 3 2 2 3 [701] 3 3 2 2 2 2 2 1 2 2 3 2 2 2 2 3 2 3 2 3 2 1 2 2 2 2 2 2 3 2 1 2 2 2 2 [736] 3 3 2 3 2 2 2 3 2 2 2 1 2 2 2 3 2 3 3 2 2 3 1 2 2 2 2 2 4 2 2 2 2 2 2 [771] 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 3 3 3 2 2 2 2 2 2 [806] 2 2 3 1 2 2 4 2 2 1 4 2 3 3 2 2 2 3 2 1 2 2 3 2 2 2 1 2 2 2 2 2 2 2 1 [841] 2 2 1 2 1 2 3 2 2 2 3 3 2 3 1 2 2 2 2 2 2 3 3 2 2 2 3 2 2 2 2 1 1 2 2 [876] 1 1 2 2 2 3 1 2 2 2 1 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 3 2 2 3 2 1 2 3 1 [911] 3 2 3 2 3 3 1 2 2 2 3 2 1 2 2 2 2 2 3 4 2 2 2 2 3 2 2 2 4 4 2 1 1 2 2 [946] 3 3 2 2 3 2 2 2 3 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 3 1 3 4 3 2 2 2 2 2 2 [981] 1 2 2 2 2 2 3 1 3 3 2 2 3 2 2 4 2 2 4 2
You should be able to see that there are a smattering of 1’s,
Prelude to introducing the concept of probability.