This article includes a list ofgeneral references, butit lacks sufficient correspondinginline citations. Please help toimprove this article byintroducing more precise citations.(September 2011) (Learn how and when to remove this message) |
| Part of a series onstatistics |
| Probability theory |
|---|
Inprobability andstatistics, aBernoulli process (named afterJacob Bernoulli) is a finite or infinite sequence of binaryrandom variables, so it is adiscrete-time stochastic process that takes only two values, canonically 0 and 1. The componentBernoulli variablesXi areidentically distributed and independent. Prosaically, a Bernoulli process is a repeatedcoin flipping, possibly with an unfair coin (but with consistent unfairness). Every variableXi in the sequence is associated with aBernoulli trial or experiment. They all have the sameBernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes (such as the process for a six-sided die); this generalization is known as theBernoulli scheme.
The problem of determining the process, given only a limited sample of Bernoulli trials, may be called the problem ofchecking whether a coin is fair.
ABernoulli process is a finite or infinite sequence ofindependentrandom variablesX1, X2, X3, ..., such that
In other words, a Bernoulli process is a sequence ofindependent identically distributedBernoulli trials.
Independence of the trials implies that the process ismemoryless, in which past event frequencies have no influence on about future event probability frequencies. In most instances the true value ofp is unknown, therefore we use past frequencies to assess/forecast/estimate future events & their probabilities indirectly via applying probabilistic inference upon p.
If the process is infinite, then from any point the future trials constitute a Bernoulli process identical to the whole process, the fresh-start property.
The two possible values of eachXi are often called "success" and "failure". Thus, when expressed as a number 0 or 1, the outcome may be called the number of successes on theith "trial".
Two other common interpretations of the values are true or false and yes or no. Under any interpretation of the two values, the individual variablesXi may be calledBernoulli trials with parameter p.
In many applications time passes between trials, as the index i increases. In effect, the trialsX1, X2, ... Xi, ... happen at "points in time" 1, 2, ..., i, .... That passage of time and the associated notions of "past" and "future" are not necessary, however. Most generally, anyXi andXj in the process are simply two from a set of random variables indexed by {1, 2, ..., n}, the finite cases, or by {1, 2, 3, ...}, the infinite cases.
One experiment with only two possible outcomes, often referred to as "success" and "failure", usually encoded as 1 and 0, can be modeled as aBernoulli distribution.[1] Several random variables andprobability distributions beside the Bernoullis may be derived from the Bernoulli process:
The negative binomial variables may be interpreted as randomwaiting times.
The Bernoulli process can be formalized in the language ofprobability spaces as a random sequence of independent realisations of a random variable that can take values of heads or tails. The state space for an individual value is denoted by
Consider thecountably infinitedirect product of copies of. It is common to examine either the one-sided set or the two-sided set. There is a naturaltopology on this space, called theproduct topology. The sets in this topology are finite sequences of coin flips, that is, finite-lengthstrings ofH andT (H stands for heads andT stands for tails), with the rest of (infinitely long) sequence taken as "don't care". These sets of finite sequences are referred to ascylinder sets in the product topology. The set of all such strings forms asigma algebra, specifically, aBorel algebra. This algebra is then commonly written as where the elements of are the finite-length sequences of coin flips (the cylinder sets).
If the chances of flipping heads or tails are given by the probabilities, then one can define a naturalmeasure on the product space, given by (or by for the two-sided process). In another word, if adiscrete random variableX has aBernoulli distribution with parameterp, where 0 ≤p ≤ 1, and itsprobability mass function is given by
We denote this distribution by Ber(p).[1]
Given a cylinder set, that is, a specific sequence of coin flip results at times, the probability of observing this particular sequence is given by
wherek is the number of times thatH appears in the sequence, andn−k is the number of times thatT appears in the sequence. There are several different kinds of notations for the above; a common one is to write
where each is a binary-valuedrandom variable with inIverson bracket notation, meaning either if or if. This probability is commonly called theBernoulli measure.[2]
Note that the probability of any specific, infinitely long sequence of coin flips is exactly zero; this is because, for any. A probability equal to 1 implies that any given infinite sequence hasmeasure zero. Nevertheless, one can still say that some classes of infinite sequences of coin flips are far more likely than others, this is given by theasymptotic equipartition property.
To conclude the formal definition, a Bernoulli process is then given by the probability triple, as defined above.
Let us assume the canonical process with represented by and represented by. Thelaw of large numbers states that the average of the sequence, i.e.,, will approach theexpected value almost certainly, that is, the events which do not satisfy this limit have zero probability. Theexpectation value of flippingheads, assumed to be represented by 1, is given by. In fact, one has
for any given random variable out of the infinite sequence ofBernoulli trials that compose the Bernoulli process.
One is often interested in knowing how often one will observeH in a sequence ofn coin flips. This is given by simply counting: Givenn successive coin flips, that is, given the set of all possiblestrings of lengthn, the numberN(k,n) of such strings that containk occurrences ofH is given by thebinomial coefficient
If the probability of flipping heads is given byp, then the total probability of seeing a string of lengthn withk heads is
where.The probability measure thus defined is known as theBinomial distribution.
As we can see from the above formula that, if n=1, theBinomial distribution will turn into aBernoulli distribution. So we can know that theBernoulli distribution is exactly a special case ofBinomial distribution when n equals to 1.
Of particular interest is the question of the value of for a sufficiently long sequences of coin flips, that is, for the limit. In this case, one may make use ofStirling's approximation to the factorial, and write
Inserting this into the expression forP(k,n), one obtains theNormal distribution; this is the content of thecentral limit theorem, and this is the simplest example thereof.
The combination of the law of large numbers, together with the central limit theorem, leads to an interesting and perhaps surprising result: theasymptotic equipartition property. Put informally, one notes that, yes, over many coin flips, one will observeH exactlyp fraction of the time, and that this corresponds exactly with the peak of the Gaussian. The asymptotic equipartition property essentially states that this peak is infinitely sharp, with infinite fall-off on either side. That is, given the set of all possible infinitely long strings ofH andT occurring in the Bernoulli process, this set is partitioned into two: those strings that occur with probability 1, and those that occur with probability 0. This partitioning is known as theKolmogorov 0-1 law.
The size of this set is interesting, also, and can be explicitly determined: the logarithm of it is exactly theentropy of the Bernoulli process. Once again, consider the set of all strings of lengthn. The size of this set is. Of these, only a certain subset are likely; the size of this set is for. By using Stirling's approximation, putting it into the expression forP(k,n), solving for the location and width of the peak, and finally taking one finds that
This value is theBernoulli entropy of a Bernoulli process. Here,H stands for entropy; not to be confused with the same symbolH standing forheads.
John von Neumann posed a question about the Bernoulli process regarding the possibility of a given process beingisomorphic to another, in the sense of theisomorphism of dynamical systems. The question long defied analysis, but was finally and completely answered with theOrnstein isomorphism theorem. This breakthrough resulted in the understanding that the Bernoulli process is unique anduniversal; in a certain sense, it is the single most random process possible; nothing is 'more' random than the Bernoulli process (although one must be careful with this informal statement; certainly, systems that aremixing are, in a certain sense, "stronger" than the Bernoulli process, which is merely ergodic but not mixing. However, such processes do not consist of independent random variables: indeed, many purely deterministic, non-random systems can be mixing).
The Bernoulli process can also be understood to be adynamical system, as an example of anergodic system and specifically, ameasure-preserving dynamical system, in one of several different ways. One way is as ashift space, and the other is as anodometer. These are reviewed below.
One way to create a dynamical system out of the Bernoulli process is as ashift space. There is a natural translation symmetry on the product space given by theshift operator
The Bernoulli measure, defined above, is translation-invariant; that is, given any cylinder set, one has
and thus theBernoulli measure is aHaar measure; it is aninvariant measure on the product space.
Instead of the probability measure, consider instead some arbitrary function. Thepushforward
defined by is again some function Thus, the map induces another map on the space of all functions That is, given some, one defines
The map is alinear operator, as (obviously) one has and for functions and constant. This linear operator is called thetransfer operator or theRuelle–Frobenius–Perron operator. This operator has aspectrum, that is, a collection ofeigenfunctions and corresponding eigenvalues. The largest eigenvalue is theFrobenius–Perron eigenvalue, and in this case, it is 1. The associated eigenvector is the invariant measure: in this case, it is the Bernoulli measure. That is,
If one restricts to act on polynomials, then the eigenfunctions are (curiously) theBernoulli polynomials![3][4] This coincidence of naming was presumably not known to Bernoulli.

The above can be made more precise. Given an infinite string of binary digits write
The resulting is a real number in the unit interval The shift induces ahomomorphism, also called, on the unit interval. Since one can see that This map is called thedyadic transformation; for the doubly-infinite sequence of bits the induced homomorphism is theBaker's map.
Consider now the space of functions in. Given some one can find that
Restricting the action of the operator to functions that are on polynomials, one finds that it has adiscrete spectrum given by
where the are theBernoulli polynomials. Indeed, the Bernoulli polynomials obey the identity
Note that the sum
gives theCantor function, as conventionally defined. This is one reason why the set is sometimes called theCantor set.
Another way to create a dynamical system is to define anodometer. Informally, this is exactly what it sounds like: just "add one" to the first position, and let the odometer "roll over" by usingcarry bits as the odometer rolls over. This is nothing more than base-two addition on the set of infinite strings. Since addition forms agroup, and the Bernoulli process was already given a topology, above, this provides a simple example of atopological group.
In this case, the transformation is given by
It leaves the Bernoulli measure invariant only for the special case of (the "fair coin"); otherwise not. Thus, is ameasure preserving dynamical system in this case, otherwise, it is merely aconservative system.
The termBernoulli sequence is often used informally to refer to arealization of a Bernoulli process.However, the term has an entirely different formal definition as given below.
Suppose a Bernoulli process formally defined as a single random variable (see preceding section). For every infinite sequencex of coin flips, there is asequence of integers
called theBernoulli sequence[verification needed] associated with the Bernoulli process. For example, ifx represents a sequence of coin flips, then the associated Bernoulli sequence is the list of natural numbers or time-points for which the coin toss outcome isheads.
So defined, a Bernoulli sequence is also a random subset of the index set, the natural numbers.
Almost all Bernoulli sequences areergodic sequences.[verification needed]
From any Bernoulli process one may derive a Bernoulli process withp = 1/2 by thevon Neumann extractor, the earliestrandomness extractor, which actually extracts uniform randomness.
Represent the observed process as a sequence of zeroes and ones, or bits, and group that input stream in non-overlapping pairs of successive bits, such as (11)(00)(10)... . Then for each pair,
This table summarizes the computation.
| input | output |
|---|---|
| 00 | discard |
| 01 | 0 |
| 10 | 1 |
| 11 | discard |
For example, an input stream of eight bits10011011 would by grouped into pairs as(10)(01)(10)(11). Then, according to the table above, these pairs are translated into the output of the procedure:(1)(0)(1)() (=101).
In the output stream 0 and 1 are equally likely, as 10 and 01 are equally likely in the original, both having probabilityp(1−p) = (1−p)p. This extraction of uniform randomness does not require the input trials to be independent, onlyuncorrelated. More generally, it works for anyexchangeable sequence of bits: all sequences that are finite rearrangements are equally likely.
The von Neumann extractor uses two input bits to produce either zero or one output bits, so the output is shorter than the input by a factor of at least 2. On average the computation discards proportionp2 + (1 − p)2 of the input pairs(00 and 11), which is near one whenp is near zero or one, and is minimized at 1/4 whenp = 1/2 for the original process (in which case the output stream is 1/4 the length of the input stream on average).
Von Neumann (classical) main operationpseudocode:
if (Bit1 ≠ Bit2) { output(Bit1)}This sectioncontainscitations that may notverify the text. Please helpimprove it by checking for citation inaccuracies and resourcing or removing material failing verification.(January 2014) (Learn how and when to remove this message) |
This decrease in efficiency, or waste of randomness present in the input stream, can be mitigated by iterating the algorithm over the input data. This way the output can be made to be "arbitrarily close to the entropy bound".[5]
The iterated version of the von Neumann algorithm, also known as advanced multi-level strategy (AMLS),[6] was introduced by Yuval Peres in 1992.[5] It works recursively, recycling "wasted randomness" from two sources: the sequence of discard-non-discard, and the values of discarded pairs (0 for 00, and 1 for 11). It relies on the fact that, given the sequence already generated, both of those sources are still exchangeable sequences of bits, and thus eligible for another round of extraction. While such generation of additional sequences can be iterated infinitely to extract all available entropy, an infinite amount of computational resources is required, therefore the number of iterations is typically fixed to a low value – this value either fixed in advance, or calculated at runtime.
More concretely, on an input sequence, the algorithm consumes the input bits in pairs, generating output together with two new sequences, () gives AMLS paper notation:
| input | output | new sequence 1(A) | new sequence 2(1) |
|---|---|---|---|
| 00 | none | 0 | 0 |
| 01 | 0 | 1 | none |
| 10 | 1 | 1 | none |
| 11 | none | 0 | 1 |
(If the length of the input is odd, the last bit is completely discarded.) Then the algorithm is applied recursively to each of the two new sequences, until the input is empty.
Example: The input stream from the AMLS paper,11001011101110 using 1 for H and 0 for T, is processed this way:
| step number | input | output | new sequence 1(A) | new sequence 2(1) |
|---|---|---|---|---|
| 0 | (11)(00)(10)(11)(10)(11)(10) | ()()(1)()(1)()(1) | (1)(1)(0)(1)(0)(1)(0) | (1)(0)()(1)()(1)() |
| 1 | (10)(11)(11)(01)(01)() | (1)()()(0)(0) | (0)(1)(1)(0)(0) | ()(1)(1)()() |
| 2 | (11)(01)(10)() | ()(0)(1) | (0)(1)(1) | (1)()() |
| 3 | (10)(11) | (1) | (1)(0) | ()(1) |
| 4 | (11)() | () | (0) | (1) |
| 5 | (10) | (1) | (1) | () |
| 6 | () | () | () | () |
Starting from step 1, the input is a concatenation of sequence 2 and sequence 1 from the previous step (the order is arbitrary but should be fixed). The final output is()()(1)()(1)()(1)(1)()()(0)(0)()(0)(1)(1)()(1) (=1111000111), so from 14 bits of input 10 bits of output were generated, as opposed to 3 bits through the von Neumann algorithm alone. The constant output of exactly 2 bits per round per bit pair (compared with a variable none to 1 bit in classical VN) also allows for constant-time implementations which are resistant totiming attacks.
Von Neumann–Peres (iterated) main operation pseudocode:
if (Bit1 ≠ Bit2) { output(1, Sequence1) output(Bit1)} else { output(0, Sequence1) output(Bit1, Sequence2)}Another tweak was presented in 2016, based on the observation that the Sequence2 channel doesn't provide much throughput, and a hardware implementation with a finite number of levels can benefit from discarding it earlier in exchange for processing more levels of Sequence1.[7]