Averages of repeated trials converge to the expected value
Anillustration of the law of large numbers using a particular run of rolls of a singledie. As the number of rolls in this run increases, the average of the values of all the results approaches 3.5. Although each run would show a distinctive shape over a small number of throws (at the left), over a large number of rolls (to the right) the shapes would be extremely similar.
Inprobability theory, thelaw of large numbers is amathematical law that states that theaverage of the results obtained from a large number of independent random samples converges to the true value, if it exists.[1] More formally, the law of large numbers states that given a sample of independent and identically distributed values, the sample mean converges to the truemean.
The law of large numbers is important because it guarantees stable long-term results for the averages of somerandomevents.[1][2] For example, while acasino may losemoney in a single spin of theroulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. Importantly, the law applies (as the name indicates) only when alarge number of observations are considered. There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others (see thegambler's fallacy).
Throughout its history, many mathematicians have refined this law. Today, the law of large numbers is used in many fields including statistics, probability theory, economics, and insurance.[3]
For example, a single roll of a six-sideddice produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equalprobability. Therefore, theexpected value of the roll is:
According to the law of large numbers, if a large number of six-sided dice are rolled, the average of their values (sometimes called thesample mean) will approach 3.5, with the precision increasing as more dice are rolled.
This image illustrates the convergence of relative frequencies to their theoretical probabilities. The probability of picking a red ball from a sack is 0.4 and black ball is 0.6. The left plot shows the relative frequency of picking a black ball, and the right plot shows the relative frequency of picking a red ball, both over 10,000 trials. As the number of trials increases, the relative frequencies approach their respective theoretical probabilities, demonstrating the law of large numbers.
For example, afair coin toss is a Bernoulli trial. When a fair coin is flipped once, the theoretical probability that the outcome will be heads is equal to1⁄2. Therefore, according to the law of large numbers, the proportion of heads in a "large" number of coin flips "should be" roughly1⁄2. In particular, the proportion of heads aftern flips willalmost surelyconverge to1⁄2 asn approaches infinity.
Although the proportion of heads (and tails) approaches1⁄2, almost surely theabsolute difference in the number of heads and tails will become large as the number of flips becomes large. That is, the probability that the absolute difference is a small number approaches zero as the number of flips becomes large. Also, almost surely the ratio of the absolute difference to the number of flips will approach zero. Intuitively, the expected difference grows, but at a slower rate than the number of flips.
Another good example of the law of large numbers is theMonte Carlo method. These methods are a broad class ofcomputationalalgorithms that rely on repeatedrandom sampling to obtain numerical results. The larger the number of repetitions, the better the approximation tends to be. The reason that this method is important is mainly that, sometimes, it is difficult or impossible to use other approaches.[4]
The average of the results obtained from a large number of trials may fail to converge in some cases. For instance, the average ofn results taken from theCauchy distribution or somePareto distributions (α<1) will not converge asn becomes larger; the reason isheavy tails.[5] The Cauchy distribution and the Pareto distribution represent two cases: the Cauchy distribution does not have an expectation,[6] whereas the expectation of the Pareto distribution (α<1) is infinite.[7] One way to generate the Cauchy-distributed example is where the random numbers equal thetangent of an angle uniformly distributed between −90° and +90°.[8] Themedian is zero, but the expected value does not exist, and indeed the average ofn such variables have the same distribution as one such variable. It does not converge in probability toward zero (or any other value) asn goes to infinity.
If the trials embed aselection bias, typical in human economic/rational behaviour, the law of large numbers does not help in solving the bias, even if the number of trials is increased the selection bias remains.
Diffusion is an example of the law of large numbers. Initially, there aresolute molecules on the left side of a barrier (magenta line) and none on the right. The barrier is removed, and the solute diffuses to fill the whole container.
Top: With a single molecule, the motion appears to be quite random.
Middle: With more molecules, there is clearly a trend where the solute fills the container more and more uniformly, but there are also random fluctuations.
Bottom: With an enormous number of solute molecules (too many to see), the randomness is essentially gone: The solute appears to move smoothly and systematically from high-concentration areas to low-concentration areas. In realistic situations, chemists can describe diffusion as a deterministic macroscopic phenomenon (seeFick's laws), despite its underlying random nature.
The Italian mathematicianGerolamo Cardano (1501–1576) stated without proof that the accuracies of empirical statistics tend to improve with the number of trials.[9][3] This was then formalized as a law of large numbers. A special form of the law of large numbers (for a binary random variable) was first proved byJacob Bernoulli.[10][3] It took him over 20 years to develop a sufficiently rigorous mathematical proof which was published in hisArs Conjectandi (The Art of Conjecturing) in 1713. He named this his "golden theorem" but it became generally known as "Bernoulli's theorem". This should not be confused withBernoulli's principle, named after Jacob Bernoulli's nephewDaniel Bernoulli. In 1837,S. D. Poisson further described it under the name"la loi des grands nombres" ("the law of large numbers").[11][12][3] Thereafter, it was known under both names, but the "law of large numbers" is most frequently used.
After Bernoulli and Poisson published their efforts, other mathematicians also contributed to refinement of the law, includingChebyshev,[13]Markov,Borel,Cantelli,Kolmogorov andKhinchin.[3] Markov showed that the law can apply to a random variable that does not have a finite variance under some other weaker assumption, and Khinchin showed in 1929 that if the series consists of independent identically distributed random variables, it suffices that theexpected value exists for the weak law of large numbers to be true.[14][15] These further studies have given rise to two prominent forms of the law of large numbers. One is called the "weak" law and the other the "strong" law, in reference to two different modes ofconvergence of the cumulative sample means to the expected value; in particular, as explained below, the strong form implies the weak.[14]
There are two different versions of the law of large numbers that are described below. They are called thestrong law of large numbers and theweak law of large numbers.[16][1] Stated for the case whereX1,X2, ... is an infinite sequence ofindependent and identically distributed (i.i.d.)Lebesgue integrable random variables with expected value E(X1) = E(X2) = ... =μ, both versions of the law state that the sample average
converges to the expected value:
1
(Lebesgue integrability ofXj means that the expected value E(Xj) exists according to Lebesgue integration and is finite. It doesnot mean that the associated probability measure isabsolutely continuous with respect toLebesgue measure.)
Introductory probability texts often additionally assume identical finitevariance (for all) and no correlation between random variables. In that case, the variance of the average ofn random variables is
which can be used to shorten and simplify the proofs. This assumption of finitevariance isnot necessary. Large or infinite variance will make the convergence slower, but the law of large numbers holds anyway.[17]
The difference between the strong and the weak version is concerned with the mode of convergence being asserted. For interpretation of these modes, seeConvergence of random variables.
Simulation illustrating the law of large numbers. Each frame, a coin that is red on one side and blue on the other is flipped, and a dot is added in the corresponding column. A pie chart shows the proportion of red and blue so far. Notice that while the proportion varies significantly at first, it approaches 50% as the number of trials increases.
Interpreting this result, the weak law states that for any nonzero margin specified (ε), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin.
As mentioned earlier, the weak law applies in the case of i.i.d. random variables, but it also applies in some other cases. For example, the variance may be different for each random variable in the series, keeping the expected value constant. If the variances are bounded, then the law applies, as shown byChebyshev as early as 1867. (If the expected values change during the series, then we can simply apply the law to the average deviation from the respective expected values. The law then states that this converges in probability to zero.) In fact, Chebyshev's proof works so long as the variance of the average of the firstn values goes to zero asn goes to infinity.[15] As an example, assume that each random variable in the series follows aGaussian distribution (normal distribution) with mean zero, but with variance equal to, which is not bounded. At each stage, the average will be normally distributed (as the average of a set of normally distributed variables). The variance of the sum is equal to the sum of the variances, which isasymptotic to. The variance of the average is therefore asymptotic to and goes to zero.
There are also examples of the weak law applying even though the expected value does not exist.
What this means is that, as the number of trialsn goes to infinity, the probability that the average of the observations converges to the expected value, is equal to one. The modern proof of the strong law is more complex than that of the weak law, and relies on passing to an appropriate sub-sequence.[17]
The strong law of large numbers can itself be seen as a special case of thepointwise ergodic theorem. This view justifies the intuitive interpretation of the expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the "long-term average".
Law 3 is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). However the weak law is known to hold in certain conditions where the strong law does not hold and then the convergence is only weak (in probability). SeeDifferences between the weak law and the strong law.
The strong law applies to independent identically distributed random variables having an expected value (like the weak law). This was proved by Kolmogorov in 1930. It can also apply in other cases. Kolmogorov also showed, in 1933, that if the variables are independent and identically distributed, then for the average to converge almost surely onsomething (this can be considered another statement of the strong law), it is necessary that they have an expected value (and then of course the average will converge almost surely on that).[22]
If the summands are independent but not identically distributed, then
2
provided that eachXk has a finite second moment and
This statement is known asKolmogorov's strong law, see e.g.Sen & Singer (1993, Theorem 2.3.10).
Differences between the weak law and the strong law
Theweak law states that for a specified largen, the average is likely to be nearμ.[23] Thus, it leaves open the possibility that happens an infinite number of times, although at infrequent intervals. (Not necessarily for alln).
Thestrong law shows that thisalmost surely will not occur. I.e., with probability 1 for anyε > 0 the inequality holds for all large enoughn.[24]
The strong law does not hold in the following cases, but the weak law does.[25][26]
Let X be anexponentially distributed random variable with parameter 1. The random variable has no expected value according to Lebesgue integration, but using conditional convergence and interpreting the integral as aDirichlet integral, which is an improperRiemann integral, we can say:
Let X be ageometrically distributed random variable with probability 0.5. The random variable does not have an expected value in the conventional sense because the infiniteseries is not absolutely convergent, but using conditional convergence, we can say:
LetXk be plus or minus (starting at sufficiently largek so that the denominator is positive) with probability1⁄2 for each.[22] The variance ofXk is then Kolmogorov's strong law does not apply because the partial sum in his criterion up tok = n is asymptotic to and this is unbounded. If we replace the random variables with Gaussian variables having the same variances, namely, then the average at any point will also be normally distributed. The width of the distribution of the average will tend toward zero (standard deviation asymptotic to), but for a givenε, there is probability which does not go to zero withn, while the average sometime after thenth trial will come back up toε. Since the width of the distribution of the average is not zero, it must have a positive lower boundp(ε), which means there is a probability of at leastp(ε) that the average will attain ε aftern trials. It will happen with probabilityp(ε)/2 before somem which depends onn. But even afterm, there is still a probability of at leastp(ε) that it will happen. (This seems to indicate thatp(ε)=1 and the average will attain ε an infinite number of times.)
There are extensions of the law of large numbers to collections of estimators, where the convergence is uniform over the collection; thus the nameuniform law of large numbers.
Supposef(x,θ) is somefunction defined forθ ∈ Θ, and continuous inθ. Then for any fixedθ, the sequence {f(X1,θ),f(X2,θ), ...} will be a sequence of independent and identically distributed random variables, such that the sample mean of this sequence converges in probability to E[f(X,θ)]. This is thepointwise (inθ) convergence.
A particular example of auniform law of large numbers states the conditions under which the convergence happensuniformly inθ. If[29][30]
Borel's law of large numbers, named afterÉmile Borel, states that if an experiment is repeated a large number of times, independently under identical conditions, then the proportion of times that any specified event is expected to occur approximately equals the probability of the event's occurrence on any particular trial; the larger the number of repetitions, the better the approximation tends to be. More precisely, ifE denotes the event in question,p its probability of occurrence, andNn(E) the number of timesE occurs in the firstn trials, then with probability one,[31]
This theorem makes rigorous the intuitive notion of probability as the expected long-run relative frequency of an event's occurrence. It is a special case of any of several more general laws of large numbers in probability theory.
This proof uses the assumption of finitevariance (for all). The independence of the random variables implies no correlation between them, and we have that
The common mean μ of the sequence is the mean of the sample average:
μ is a constant, which implies that convergence in distribution to μ and convergence in probability to μ are equivalent (seeConvergence of random variables.) Therefore,
2
This shows that the sample mean converges in probability to the derivative of the characteristic function at the origin, as long as the latter exists.
We give a relatively simple proof of the strong law under the assumptions that the areiid,,, and.
Let us first note thatwithout loss of generality we can assume that by centering. In this case, the strong law says that
orIt is equivalent to show thatNote thatand thus to prove the strong law we need to show that for every, we haveDefine the events, and if we can show thatthen the Borel-Cantelli Lemma implies the result. So let us estimate.
We computeWe first claim that every term of the form where all subscripts are distinct, must have zero expectation. This is because by independence, and the last term is zero—and similarly for the other terms. Therefore the only terms in the sum with nonzero expectation are and. Since the are identically distributed, all of these are the same, and moreover.
There are terms of the form and terms of the form, and soNote that the right-hand side is a quadratic polynomial in, and as such there exists a such that for sufficiently large. By Markov,for sufficiently large, and therefore this series is summable. Since this holds for any, we have established the strong law of large numbers.[32] The proof can be strengthened immensely by dropping all finiteness assumptions on the second and fourth moments. It can also be extended for example to discuss partial sums of distributions without any finite moments. Such proofs use more intricate arguments to prove the same Borel-Cantelli predicate, a strategy attributed to Kolmogorov to conceptually bring the limit inside the probability parentheses.[33]
The law of large numbers provides an expectation of an unknown distribution from a realization of the sequence, but also any feature of theprobability distribution.[1] By applyingBorel's law of large numbers, one could easily obtain theprobability mass function. For each event in the objective probability mass function, one could approximate the probability of the event's occurrence with the proportion of times that any specified event occurs. The larger the number of repetitions, the better the approximation. As for the continuous case:, for small positive h. Thus, for large n:
With this method, one can cover the whole x-axis with a grid (with grid size 2h) and obtain a bar graph which is called ahistogram.
One application of the law of large numbers is an important method of approximation known as theMonte Carlo method,[3] which uses a random sampling of numbers to approximate numerical results. The algorithm to compute an integral of f(x) on an interval [a, b] is as follows:[3]
Simulate uniform random variables X1, X2, …, Xn which can be done using a software, and use a random number table that gives U1, U2, …, Un independent and identically distributed (i.i.d.) random variables on [0, 1]. Then let Xi = a + (b - a) Ui for i= 1, 2, …, n. Then X1, X2, …, Xn are independent and identically distributed uniform random variables on [a, b].
Evaluate f(X1), f(X2), …, f(Xn).
Take the average of f(X1), f(X2), …, f(Xn) by computing, and then by the strong law of large numbers this converges to.
We can find the integral of on [-1, 2]. Using traditional methods to compute this integral is very difficult, so the Monte Carlo method can be used here.[3] Using the above algorithm, we get
when n = 25
and
when n = 250.
We observe that as n increases, the numerical value also increases. When we get the actual results for the integral we get
.
When the LLN was used, the approximation of the integral was closer to its true value, and thus more accurate.[3]
Another example is the integration of over [0, 1].[34] Using the Monte Carlo method and the LLN, we can see that as the number of samples increases, the numerical value gets ever closer to 0.4180233.[34]
^Kroese, Dirk P.; Brereton, Tim; Taimre, Thomas; Botev, Zdravko I. (2014). "Why the Monte Carlo method is so important today".Wiley Interdisciplinary Reviews: Computational Statistics.6 (6):386–392.doi:10.1002/wics.1314.hdl:1959.4/unsworks_43203.S2CID18521840.
^Dekking, Michel, ed. (2005).A modern introduction to probability and statistics: understanding why and how. Springer texts in statistics. London [Heidelberg]: Springer. p. 187.ISBN978-1-85233-896-1.
^Mlodinow, L. (2008).The Drunkard's Walk. New York: Random House. p. 50.
^Bernoulli, Jakob (1713). "4".Ars Conjectandi: Usum & Applicationem Praecedentis Doctrinae in Civilibus, Moralibus & Oeconomicis (in Latin). Translated by Sheynin, Oscar.
^Poisson names the "law of large numbers" (la loi des grands nombres) in:Poisson, S. D. (1837).Probabilité des jugements en matière criminelle et en matière civile, précédées des règles générales du calcul des probabilitiés (in French). Paris, France: Bachelier. p. 7. He attempts a two-part proof of the law on pp. 139–143 and pp. 277 ff.
^Hacking, Ian (1983). "19th-century Cracks in the Concept of Determinism".Journal of the History of Ideas.44 (3):455–475.doi:10.2307/2709176.JSTOR2709176.
^Bhattacharya, Rabi; Lin, Lizhen; Patrangenaru, Victor (2016).A Course in Mathematical Statistics and Large Sample Theory. Springer Texts in Statistics. New York, NY: Springer New York.doi:10.1007/978-1-4939-4032-5.ISBN978-1-4939-4030-1.
Grimmett, G. R.; Stirzaker, D. R. (1992).Probability and Random Processes (2nd ed.). Oxford: Clarendon Press.ISBN0-19-853665-8.
Durrett, Richard (1995).Probability: Theory and Examples (2nd ed.). Duxbury Press.
Martin Jacobsen (1992).Videregående Sandsynlighedsregning [Advanced Probability Theory] (in Danish) (3rd ed.). Copenhagen: HCØ-tryk.ISBN87-91180-71-6.
Loève, Michel (1977).Probability theory 1 (4th ed.). Springer.
Newey, Whitney K.;McFadden, Daniel (1994). "36".Large sample estimation and hypothesis testing. Handbook of econometrics. Vol. IV. Elsevier Science. pp. 2111–2245.
Ross, Sheldon (2009).A first course in probability (8th ed.). Prentice Hall.ISBN978-0-13-603313-4.
Sen, P. K; Singer, J. M. (1993).Large sample methods in statistics. Chapman & Hall.
Apple CEO Tim Cook said something that would make statisticians cringe. "We don't believe in such laws as laws of large numbers. This is sort of, uh, old dogma, I think, that was cooked up by somebody [..]" said Tim Cook and while: "However, the law of large numbers has nothing to do with large companies, large revenues, or large growth rates. The law of large numbers is a fundamental concept in probability theory and statistics, tying together theoretical probabilities that we can calculate to the actual outcomes of experiments that we empirically perform. explainedBusiness Insider