The Ergodic Hierarchy (EH) is a central part of ergodic theory. It isa hierarchy of properties that dynamical systems can possess. Its fivelevels are ergodicity, weak mixing, strong mixing, Kolmogorov, andBernoulli. Although EH is a mathematical theory, its concepts havebeen widely used in the foundations of statistical physics, accountsof randomness, and discussions about the nature of chaos, as well asin other sciences such economics. We introduce EH and discuss itsapplications.
The object of study in ergodic theory is a dynamical system. We firstintroduce some basic concepts with a simple example, from which weabstract the general definition of a dynamical system. For a briefhistory of the modern notion of a dynamical system and the associatedconcepts of EH see theAppendix, Section A.
A lead ball is hanging from the ceiling on a spring. We pull it down abit and let it go. The ball begins to oscillate. The mechanical stateof the ball is completely determined by a specification of theposition \(x\) and the momentum \(p\) of its center of mass; that is,if we know \(x\) and \(p\), then we know all that there is to knowabout the mechanical state of the ball. If we now conjoin \(x\) and\(p\) in a vector space we obtain the so-calledphase space\(X\) of the system (sometimes also referred to as ‘state space’).[1] This is illustrated in Figure 1 for the two-dimensional phase spaceof the state of the ball moving up and down (i.e., the phase space hasone dimension for the ball’s position and one for itsmomentum).

Figure 1: The motion of a ball on aspring.
Each point of \(X\) represents a state of the ball (because it givesthe ball’s position and momentum). Accordingly, the timeevolution of the ball’s state is represented by a line in \(X\),a so-calledphase space trajectory (from now on‘trajectory’), showing where in phase space the system wasat each instant of time. For instance, let us assume that at time \(t= 0\) the ball is located at point \(x_1\) and then moves to \(x_2\)where it arrives at time \(t = 5\). This motion is represented in\(X\) by the line segment connecting points \(\gamma_1\) and\(\gamma_2\). In other words, the motion of the ball is represented in\(X\) by the motion of a point representing the ball’s(instantaneous) state, and all the states that the ball is in over thecourse of a certain period of time jointly form a trajectory. Themotion of this point has a name: it is thephase flow\(\phi_t\). The phase flow tells us where the ball is at some latertime \(t\) if we specify where it is at \(t = 0\); or, metaphoricallyspeaking, \(\phi_t\) drags the ball’s state around in \(X\) sothat the movement of the state represents the motion of the real ball.In other words, \(\phi_t\) is a mathematical representation of thesystem’s time evolution. The state of the ball at time \(t = 0\)is commonly referred to as theinitial condition. \(\phi_t\)then tells us, for every point in phase space, how this point evolvesif it is chosen as an initial condition. In our concrete example,point \(\gamma_1\) is the initial condition and we have \(\gamma_2 =\phi_{t=5}(\gamma_1)\). More generally, let us call the ball’sinitial condition \(\gamma_0\) and let \(\gamma(t)\) be its state atsome later time \(t\). Then we have \(\gamma(t) = \phi_t (\gamma_0)\).This is illustrated in figure 2a.

Figure 2: Evolution in Phase space.
Since \(\phi_t\) tells us for every point in \(X\) how it evolves intime, it also tells us how sets of points move around. For instance,choose an arbitrary set \(A\) in \(X\); then \(\phi_t (A)\) is theimage of A after \(t\) time units under the dynamics of thesystem. This is illustrated in Figure 2b. Considering sets of pointsrather than single points is important when we think about physicalapplications of this mathematical formalism. We can never determinethe exact initial condition of a system. No matter how precisely wemeasure \(\gamma_0\), there will always be some measurement error. Sowhat we really want to know in practical applications is not how aprecise mathematical point evolves, but rather how a set of pointsaround the initial condition \(\gamma_0\) evolves. In our example withthe ball the evolution is ‘tame’, in that the set keepsits original shape. As we will see below, this is not always thecase.
An important feature of \(X\) is that it is endowed with a so-calledmeasure \(\mu\). We are familiar with measures in manycontexts: from a mathematical point of view, the length that weattribute to a part of a line, the surface we attribute to a part of aplane, and the volume we attribute to a segment of space are measures.A measure is simply a device to attribute a ‘size’ to apart of a space. Although \(X\) is an abstract mathematical space, theleading idea of a measure remains the same: it is a tool to quantifythe size of a set. So we say that the set \(A\) has measure \(\mu(A)\)in much the same way as we say that a certain collection of points ofordinary space (for instance the ones that lie on the inside of abottle) have a certain volume (for instance one litre).
From a more formal point of view, a measure assigns numbers to certainsubsets of a set \(X\) (see Appendix B for a formal definition). Thiscan be done in different ways and hence there are different measures.Consider the example of a plane. There is a measure that simplyassigns to each appropriate region of a plane the area of that region.But now imagine that we pour a bucket of sugar on the plane. The sugaris not evenly distributed; there are little heaps in some places whilethere is almost no sugar in other places. A measure different from thearea measure is one that assigns to a region a number that is equal tothe amount of sugar on that region. One of these measures isparticularly important, namely the so-calledLebesguemeasure. This measure has an intuitive interpretation: it is justa precise formalisation of the measure we commonly use in geometry.The interval [0, 2] has Lebesgue measure 2 and the interval [3, 4] hasLebegues measure 1. In two dimensions, a square whose sides haveLebesgue measure 2 has Lebesgue measure 4; etc. Although this soundssimple, the mathematical theory of measures is rather involved. Westate the basics of measure theory in theAppendix, Section B, and avoid appeal to technical issues in measure theory inwhat follows.
The essential elements in the discussion so far were the phase space\(X\), the time evolution \(\phi_t\), and the measure \(\mu\). Andthese are also the ingredients for the definition of an abstractdynamical system. An abstract dynamical system is a triple \([X, \mu ,T_t]\), where \(\{T_t \mid t \text{ is an instant of time}\}\) is afamily of automorphisms, i.e., a family of transformations of \(X\)onto itself with the property that \(T_{t_1 +t_2} = T_{t_1}(T_{t_2})\)for all \(x \in X\) (Arnold and Avez 1968, 1); we say more about time below.[2] In the above example \(X\) is the phase space of the ball’smotion, \(\mu\) is the Lebesgue measure, and \(T_t\) is\(\phi_t\).
So far we have described \(T_t\) as giving the time evolution of asystem. Now let us look at this from a more mathematical point ofview: the effect of \(T_t\) is that it assigns to every point in \(X\)another point in \(X\) after \(t\) time units have elapsed. In theabove example \(\gamma_1\) is mapped onto \(\gamma_2\) under\(\phi_t\) after \(t = 5\) seconds. Hence, from a mathematical pointof view the time evolution of a system consists in a mapping of \(X\)onto itself, which is why the above definition takes \(T_t\) to be afamily of mappings of \(X\) onto itself. Such a mapping is aprescription that tells you for every point \(x\) in \(X\) on whichother point in \(X\) it is mapped (from now on we use \(x\) to denoteany point in \(X\), and it no longer stands, as in the above example,for the position of the ball).
The systems studied in ergodic theory are forward deterministic. Thismeans that if two identical copies of that system are in the samestate atone instant of time, then they must be in the samestate atall future instants of time. Intuitively speaking,this means that for any given time there is only one way in which thesystem can evolve forward. For a discussion of determinism see Earman(1986).
It should be pointed out that no particular interpretation is intendedin an abstract dynamical system. We have motivated the definition withan example from mechanics, but dynamical systems are not tied to thatcontext. They are mathematical objects in their own right, and as suchthey can be studied independently of particular applications. Thismakes them a versatile tool in many different domains. In fact,dynamical systems are used, among others, in fields as diverse asphysics, biology, geology and economics.
There are many different kinds of dynamical systems. The three mostimportant distinctions are the following.
Discrete versus continuous time. We may consider discreteinstants of time or a continuum of instants of time. For ease ofpresentation, we shall say in the first case that time is discrete andin the second case that time is continuous. This is just a convenientterminology that has no implications for whether time is fundamentallydiscrete or continuous. In the above example with the ball time wascontinuous. But often it is convenient to regard time as discrete. Iftime is continuous, then \(t\) is a real number and the family ofautomorphisms is \(\{T_t \mid t \in \mathbb{R} \}\), where\(\mathbb{R}\) is the set of real numbers. If time is discrete, then\(t\) is in the set \(\mathbb{Z} = \{\ldots -2, -1, 0, 1, 2, \ldots\}\), and the family of automorphisms is \(\{T_t \mid t \in\mathbb{Z}\}\). In order to indicate that we are dealing with adiscrete family rather than a continuous one we sometimes replace‘\(T_t\)’ with ‘\(T_n\)’; this is just anotational convention of no conceptual importance.[3] In such systems the progression from one instant of time to the nextis also referred to as a ‘step’. In population biology,for instance, we often want to know how a population grows over atypical breeding time (e.g. one year). In mathematical models of sucha population the points in \(X\) represent the size of a population(rather than the position and the momentum of a ball, as in the aboveexample), and the transformation \(T_n\) represents the growth of thepopulation after \(n\) time units.
Discrete families of automorphisms have the interesting property thatthey are generated by one mapping. As we have seen above, allautomorphisms satisfy \(T_{t_1 +t_2} = T_{t_1}(T_{t_2})\). From thisit follows that \(T_n (x) = T^{n}_1 (x)\), that is \(T_n\) is the\(n\)-th iterate of \(T_1\). In this sense \(T_1\) generates \(\{T_t\mid t \in \mathbb{Z}\}\); or, in other words, \(\{T_t \mid t \in\mathbb{Z}\}\) can be ‘reduced’ to \(T_1\). For thisreason one often drops the subscript ‘1’, simply calls themap ‘\(T\)’, and writes the dynamical system as the triple\([X, \mu , T]\), where it is understood that \(T = T_1\).
For ease of presentation we use discrete transformations from now on.The definitions and theorems we formulate below carry over tocontinuous transformations without further ado, and where this is notthe case we explicitly say so and treat the two cases separately.
Measure preserving versus non-measure preservingtransformations. Roughly speaking, a transformation is measurepreserving if the size of a set (like set \(A\) in the above example)does not change over the course of time: a set can change its form butit cannot shrink or grow (with respect to the measure). Formally,\(T\) is ameasure-preserving transformation on \(X\) if andonly if (iff) for all sets \(A\) in \(X: \mu(A) = \mu(T^{-1}(A))\),where \(T^{-1}(A)\) is the set of points that gets mapped onto \(A\)under \(T\); that is \(T^{-1}(A) = \{ x \in X \mid T(x) \in A \}\).[4] From now on we also assume that the transformations we consider aremeasure preserving.[5]
In sum, from now on, unless stated otherwise, we consider discretemeasure preserving transformations.
In order to introduce the concept of ergodicity we have to introducethe phase and the time mean of a function \(f\) on \(X\).Mathematically speaking, a function assigns each point in \(X\) anumber. If the numbers are always real the function is a real-valuedfunction; and if the numbers may be complex, then it is acomplex-valued function. Intuitively we can think of these numbers asrepresenting the physical quantities of interest. Recalling theexample of the bouncing ball, \(f\) could for instance assign eachpoint in the phase space \(X\) the kinetic energy the system has atthat point; in this case we would have \(f = p^2 / 2m\), where \(m\)is the mass of the ball. For every function we can take two kinds ofaverages. The first is the infinite time average \(f^*\). The generalidea of a time average is familiar from everyday contexts. You playthe lottery on three consecutive Saturdays. On the first you win $10;on the second you win nothing; and on the third you win $50. Youraverage gain is ($10 + $0 + $50)/3 = $20. Technically speaking this isa time average. This simple idea can easily be put to use in adynamical system: follow the system’s evolution over time (andremember that we are now talking about an average for discrete pointsof time), take the value of the relevant function at each step, addthe values, and then divide by the number of steps. This yields
\[ \frac{1}{k} \sum_{i=0}^{k-1} f(T_i(x_0)), \]where
\[ \sum_{i=0}^{k-1} f(T_i(x_0)), \]is just an abbreviation for
\(f(x_0) + f(T_1(x_0)) + ...+f(T_{k-1}(x_0)).\)This is the finite time average for \(f\) after \(k\) steps. If thesystem’s state continues to evolve infinitely and we keeptracking the system forever, then we get the infinite timeaverage:
\[ f^* = \lim_{k \rightarrow \infty} \frac{1}{k} \sum_{i=0}^{k-1} f(T_i(x_0)), \]where the symbol ‘lim’ (from latin ‘limes’,meaning border or limit) indicates that we let time tend towardsinfinity (in mathematical symbols: \(\infty)\). One point deservesspecial attention, since it will become crucial later on: the presenceof \(x_0\) in the above expression. Time averages depend on where thesystem starts; i.e., they depend on the initial condition. If theprocess starts in a different state, the time average may well bedifferent.
Next we have the space average \(\bar{f}\). Let us again start with acolloquial example: the average height of the students in a particularschool. This is easily calculated: just take each student’sheight, add up all the numbers, and divide the result by the number ofstudents. Technically speaking this is aspace average. Inthe example the students in the school correspond to the points in\(X\); and the fact that we count each student once (we don’t,for instance, take John’s height into account twice and omitJim’s) corresponds to the choice of a measure that gives equal‘weight’ to each point in \(X\). The transformation \(T\)has no pendant in our example, and this is deliberate: space averageshave nothing to do with the dynamics of the system (that’s whatsets them off from time averages). The general mathematical definitionof the space average is as follows:
\[ \bar{f} = \int_x f(x)d\mu , \]where \(\int_X\) is the integral over the phase space \(X\).[6] If the space consists of discrete elements, like the students of theschool (they are ‘discrete’ in that you can count them),then the integral becomes equivalent to a sum like the one we havewhen we determine the average height of a population. If the \(X\) iscontinuous (as the phase space above) things are a bit moreinvolved.
With these concepts in place, we can now define ergodicity.[7] A dynamical system \([X, \mu , T]\) isergodic iff
\[ f^* = \bar{f} \]for all complex-valued Lebesgue integrable functions \(f\) almosteverywhere, meaning for almost all initial conditions. Thequalification ‘almost everywhere’ is non-trivial and isthe source of a famous problem in the foundations of statisticalmechanics, the so-called ‘measure zero problem’ (to whichwe turn in Section 3). So it is worth unpacking carefully what thiscondition involves. Not all sets have a finite size. In fact, thereare sets of measure zero. This may sound abstract but is very natural.Take a ruler and measure the length of certain objects. You will find,for instance, that your pencil is 17cm long—in the language ofmathematics this means that the one-dimensional Lebegue measure of thepencil is 17. Now measure a geometrical point and answer the question:how long is the point? The answer is that such a point has noextension and so its length is zero. In mathematical parlance: a setconsisting of a geometrical point is a measure zero set. The same goesfor a set of two geometrical points: also two geometrical pointstogether have no extension and hence have measure zero. Anotherexample is the following: you have device to measure the surface ofobjects in a plane. You find out that an A4 sheet has a surface of623.7 square centimetres. Then you are asked what the surface of aline is. The answer is: zero. Lines don’t have surfaces. So withrespect to the two-dimensional Lebesgue measure, lines are measurezero sets.
In the context of ergodic theory, ‘almost everywhere’means, by definition, ‘everywhere in \(X\) except, perhaps, in aset of measure zero’. That is, whenever a claim is qualified as‘almost everywhere’ it means that it could be false forsome points in \(X\), but these points taken together have measurezero. Now we are in a position to explain what the phrase means in thedefinition of ergodicity. As we have seen above, the time average (butnot the space average!) depends on the initial condition. If we saythat \(f^* = \bar{f}\) almost everywhere we mean that all thoseinitial conditions for which it turns out to be the case that \(f^*\ne \bar{f}\) taken together form a set of measure zero—they arelike a line in the plane.
Armed with this understanding of the definition of ergodicity, we cannow discuss some important properties of ergodic systems. Consider asubset \(A\) of \(X\). For instance, thinking again about the exampleof the oscillating ball, take the left half of the phase space. Thendefine the so-called characteristic function of \(A, f_A\), asfollows: \(f_A (x) = 1\) for all \(x\) in \(A\) and \(f_A (x) = 0\)for all \(x\) not in \(A\). Plugging this function into the definitionof ergodicity yields: \(f^{*}_A = \mu(A)\). This means that theproportion of time that the system’s state spends in set \(A\)is proportional to the measure of that set. To make this even moreintuitive, assume that the measure is normalised: \(\mu(X) = 1\) (thisis a very common and unproblematic assumption). If we then choose\(A\) so that \(\mu(A) = 1\) ⁄ 2, then we know that the systemspends half of the time in \(A\); if \(\mu(A) = 1\) ⁄ 4, itspends a quarter of the time in \(A\); etc. As we will see below, thisproperty of ergodic systems plays a crucial role in certain approachesto statistical mechanics.
Since we are free to choose \(A\) as we wish, we immediately getanother important result: a system can be ergodic only if itstrajectory may access all parts of \(X\) of positive measure, i.e., ifthe trajectory passes arbitrarily close to any point in \(X\)infinitely many times as time tends towards infinity. And this impliesthat the phase space of ergodic systems is calledmetricallyindecomposable (or also ‘irreducible’ or‘inseparable’): every set invariant under \(T\) (i.e.,every set that is mapped onto itself under \(T)\) has either measure 0or 1. As a consequence, \(X\) cannot be divided into two or moresubspaces (of non-zero measure) that are invariant under \(T\).Conversely, a non-ergodic system is metrically decomposable. Hence,metric indecomposability and ergodicity are equivalent. A metricallydecomposable system is schematically illustrated in Figure 3.

Figure 3: Reducible system: no point inregion \(P\) evolves into region \(Q\) and vice versa.
Finally, we would like to state a theorem that will become importantin Section 4. One can prove that a system is ergodic iff
\[\tag{E} \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{k=0}^{n-1} \mu(T_k B \cap A) = \mu(B)\mu(A) \]holds for all subsets \(A\) and \(B\) of \(X\). Although thiscondition does not have an immediate intuitive interpretation, we willsee below that it is crucial for the understanding of the kind ofrandomness we find in ergodic systems.
It turns out that ergodicity is only the bottom level of an entirehierarchy of dynamical properties. This hierarchy is called theergodic hierarchy (EH), and the study of this hierarchy isthe core task of a mathematical discipline calledergodictheory. This choice of terminology is somewhat misleading, sinceergodicity is only the bottom level of this hierarchy and so EHcontains much more than ergodicity and the scope of ergodic theorystretches far beyond ergodicty. Ergodic theory (thus understood) ispart ofdynamical systems theory, which studies a wider classof dynamical systems than ergodic theory.
EH is a nested classification of dynamical properties. The hierarchyis typically represented as consisting of the following fivelevels:
Bernoulli \(\subset\) Kolmogorov \(\subset\) Strong Mixing \(\subset\)Weak Mixing \(\subset\) Ergodic
The diagram is intended to indicate that all Bernoulli systems areKolmogorov systems, all Kolmogorov systems are strong mixing systems,and so on. Hence all systems in EH are ergodic. However, the converserelations do not hold: not all ergodic systems are weak mixing, and soon. In what follows a system that is ergodic but not weak mixing isreferred to asmerely ergodic and similarly for the nextthree levels.[8]
![]() | ![]() |
| (4a) | (4b) |
Figure 4: Mixing
Mixing can be intuitively explained by the following example, firstused by Gibbs in introducing the concept of mixing. Begin with a glassof water, then add a shot of scotch; this is illustrated in Fig. 4a.The volume \(C\) of the cocktail (scotch + water) is \(\mu(C)\) andthe volume of scotch that was added to the water is \(\mu(S)\), sothat in \(C\)the concentration of scotch is \(\mu(S)/\mu(C)\).
Now stir. Mathematically, stirring is represented by the timeevolution \(T\), meaning that \(T(S)\) is the region occupied by thescotch after one unit of mixing time. Intuitively we say that thecocktail is thoroughly mixed, if the concentration of scotch equals\(\mu(S) / \mu(C)\) not only with respect to thewhole volumeof fluid, but with respect toany region \(V\) in thatvolume. Hence, the drink is thoroughly mixed at time \(n\) if
\[ \frac{\mu(T_n S \cap V)}{\mu(V)} = \frac{\mu(S)}{\mu(C)} \]for any volume \(V\) (of non-zero measure). Now assume that the volumeof the cocktail is one unit: \(\mu(C) = 1\) (which we can do withoutloss of generality since there is always a unit system in which thevolume of the glass is one). Then the cocktail is thoroughly mixediff
\[ \frac{\mu(T_n S \cap V)}{\mu(V)} = \mu(S) \]for any region \(V\) (of non-zero measure). But how large must \(n\)be before the stirring ends with the cocktail well mixed? We nowdon’t require that the drink must be thoroughly mixed at anyfinite time, but only that it approaches a state of being thoroughlymixed as time tends towards infinity:
\[ \lim_{n \rightarrow \infty} \frac{\mu(T_n S \cap V)}{\mu(V)} = \mu(S) \]for any region \(V\) (of non-zero measure). If we now associate theglass with the phase space \(X\) and replace the scotch \(S\) and thevolume \(V\) with two arbitrary subsets \(A\) and \(B\) of \(X\), thenwe get the general definition of what is calledstrong mixing(often also referred to just as ‘mixing’): a system isstrong mixing iff
\[\tag{S-M} \lim_{n \rightarrow \infty} \mu(T_n B \cap A) = \mu(B) \mu(A) \]for all subsets \(A\) and \(B\) of \(X.\) This requirement for mixingcan be relaxed a bit by allowing for fluctuations.[9] That is, instead of requiring that the cocktail reach a uniform stateof being mixed, we now only require that it be mixed on average. Inother words, we allow that bubbles of either scotch or water may cropup every now and then, but they do so in a way that these fluctuationsaverage out as time tends towards infinity. This translates intomathematics in a straightforward way. The deviation from the ideallymixed state at some time \(n\) is \(\mu(T_n B \cap A) -\mu(B)\mu(A)\). The requirement that the average of these deviationsvanishes inspires the notion of weak mixing. A system isweakmixing iff
\[\tag{W-M} \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{k=0}^{n-1} \lvert \mu(T_n B \cap A) - \mu(B) \mu(A) \rvert = 0 \]for all subsets \(A\) and \(B\) of \(X\). The vertical strokes denotethe so-called absolute value; for instance: \(\lvert 5 \rvert = \lvert-5 \rvert = 5\). One can prove that there is a strict implicationrelation between the three dynamical properties we have introduced sofar: strong mixing implies weak mixing, but not vice versa; and weakmixing implies ergodicity, but not vice versa. Hence, strong mixing isstronger condition than weak mixing, and weak mixing is strongercondition than ergodicity.
The next higher level in EH are K-systems. Unlike in the cases ofergodic and mixing systems, there is unfortunately no intuitive way ofexplaining the standard definition of such systems, and the definitionis such that one cannot read off from it the characteristics ofK-systems (we state this definition in theAppendix, Section C). The least unintuitive way to present K-systems is via atheorem due to Cornfeldet al (1982, 283), who prove that adynamical system is a K-system iff it isK-mixing. A systemis K-mixing iff for any subsets \(A_0, A_1 , \ldots ,A_r\) of \(X\)(where \(r\) is a natural number of your choice) the followingcondition holds:
\[\tag{K-M} \lim_{n \rightarrow \infty} \sup_{B \in \sigma(n,r)} \lvert \mu(T_n B \cap A) - \mu(B) \mu(A) \rvert = 0 \]where \(\sigma(n, r)\) is the minimal \(\sigma\)-algebra generated bythe set
\[ \{T_k A_j \mid k \ge n; j = 1, \ldots ,r\}. \]It is far from obvious what this so-called sigma algebra is and hencethe content of this condition is not immediately transparent. We willcome back to this issue in Section 5 where we provide an intuitivereading of this condition. What matters for the time being is itssimilarity to the mixing condition. Strong mixing is, trivially,equivalent to
\[ \lim_{n \rightarrow \infty} \mu(T_n B\cap A) - \mu(B)\mu(A) = 0 \]So we see that K-mixing adds something to strong mixing.
In passing we would like to mention another important property ofK-systems: one can prove that K-systems have positiveKolmogorov-Sinai entropy (KS-entropy); for details see theAppendix, Section C. The KS-entropy itself does not have an intuitiveinterpretation, but it relates to three other concepts of dynamicalsystems theory in an interesting way, and these do have intuitiveinterpretations. First,Lyapunov exponents are a measure forhow fast two originally nearby trajectories diverge on average, andthey are often used in chaos theory to characterise the chaotic natureof the dynamics of a system. Under certain circumstances (essentially,the system has to be differentiable and ergodic) one can prove that adynamical system has a positive KS-entropy if and only if it haspositive Lyapounov exponents (Lichtenberg and Liebermann 1992, 304).This result is known asPessin’s theorem. Second, thealgorithmic complexity of a sequence is the length of theshortest computer programme needed to reproduce the sequence. Somesequences are simple; e.g. a string of a million ‘1’ issimple: the programme needed to reproduce it basically is ‘write‘1’ a million times’, which is very short. Othersare complex: there is no pattern in the sequence5%8£yu@*mS!}<74^F that one could exploit, and so a programmereproducing that sequence essentially reads ‘write5%8£yu@*mS!}<74^F’, which is similar in length to thesequence itself. In the discrete case a trajectory can be representedas a sequence of symbols which corresponds to the states of the systemalong this trajectory. It is then the case that if a system is aK-system, then its KS-entropy equals the algorithmic complexity ofalmost all its trajectories (Brudno 1978). This is now known asBrudno’s theorem (Alekseev and Yakobson 1981). For acontemporary discussion see Towsner (2020). Third, theShannonentropy is a common measure for the uncertainty of a futureoutcome: the higher the entropy the more uncertain we are about whatis going to happen. One can prove that, given certain plausibleassumptions, the KS-entropy is equivalent to a generalised version ofthe Shannon entropy, and can hence be regarded as a measure for theuncertainty of future events given past events (Frigg 2004).
Bernoulli systems mark the highest level in EH. To define Bernoullisystems we first have to introduce the notion of a partition of \(X\)(sometimes also called the ‘coarse graining of \(X\)’). Apartition of X is a division of \(X\) into different parts(the so-called ‘atoms of the partition’) so that theseparts don’t overlap and jointly cover \(X\) (i.e., they aremutually exclusive and jointly exhaustive). For instance, in Figure 1there is a partition of the phase space that has two atoms (the leftand the right part). More formally, \(\alpha = \{\alpha_1,\ldots ,\alpha_m\}\) is a partition of \(X\) (and the \(\alpha_i\) its atoms)iff (i) the intersection of any two atoms of the partition is theempty set, and (ii) the union of all atoms is \(X\) (up to measurezero). Furthermore, it is important to notice that a partition remainsa partition under the dynamics of the system. That is, if \(\alpha\)is a partition, then \(T_n\alpha = \{T_n\alpha_1 ,\ldots,T_n\alpha_m\}\) is also a partition for all \(n\).
There are, of course, many different ways of partitioning a phasespace. In what follows we are going to study how different partitionsrelate to each other. An important concept in this connection isindependence. Let \(\alpha\) and \(\beta\) be two partitions of \(X\).By definition, these partitions areindependent iff\(\mu(\alpha_i \cap \beta_j) = \mu(\alpha_i)\mu(\beta_j)\) for allatoms \(\alpha_i\) of \(\alpha\) and all atoms \(\beta_j\) of\(\beta\). We will explain the intuitive meaning of this definition(and justify calling it ‘independence’) in Section 4; forthe time being we just use it as a formal definition.
With these notions in hand we can now define aBernoullitransformation: a transformation \(T\) is a Bernoullitransformation iff there exists a partition \(\alpha\) of \(X\) sothat the images of \(\alpha\) under \(T\) at different instants oftime are independent; that is, the partitions \(\ldots ,T_{-1}\alpha ,T_0 \alpha , T_1 \alpha ,\ldots\) are all independent.[10] In other words, \(T\) is a Bernoulli transformation iff
\[\tag{B} \mu(\delta_i \cap \beta_j) = \mu(\delta_i)\mu(\beta_j) \]for all atoms \(\delta_i\) of \(T_k\alpha\) and all atoms \(\beta_j\)of \(T_l\alpha\) for all \(k \ne l\). We then refer to \(\alpha\) astheBernoulli partition, and we call a dynamical system \([X,\mu , T]\) aBernoulli system if \(T\) is a Bernoulliautomorphism, i.e., a Bernoulli transformation mapping \(X\) ontoitself.
Let us illustrate this with a well-known example, thebaker’s transformation (so named because of itssimilarity to the kneading of dough). This transformation maps theunit square onto itself. Using standard Cartesian coordinates thetransformation can be written as follows:
\[\begin{align}T(x, y) &= (2x, \frac{y}{2}) \text{ for } 0 \le x \lt \frac{1}{2}, \text{ and} \\ T(x, y) &= (2x-1, \frac{y}{2} + \frac{1}{2}) \text{ for } \frac{1}{2} \le x \le 1 \end{align}\]In words, for all points \((x, y)\) in the unit square that have an\(x\)-coordinate smaller than \(1/2\), the transformation \(T\)doubles the value of \(x\) and halves the value of \(y\). For all thepoints \((x, y)\) that have an \(x\)-coordinate greater or equal to\(1/2\), \(T\) transforms \(x\) in to \(2x-1\) and \(y\) into \(y/2 +1/2\). This is illustrated in Fig. 5a.

Figure 5a: The Baker’stransformation
Now regard the two areas shown in the left-hand part of the abovefigure as the two atoms of a partition \(\alpha = \{\alpha_1,\alpha_2\}\). It is then easy to see that \(\alpha\) and T\(\alpha\)are independent: \(\mu(\alpha_1 \cap T\alpha_2) =\mu(\alpha_1)\mu(T\alpha_2)\), and similarly for all other atoms of\(\alpha\) and \(T\alpha\). This is illustrated in Figure 5b.

Figure 5b: The independence of\(\alpha\) and \(T\alpha\).
One can prove that independence holds for all other iterates of\(\alpha\) as well. So the baker’s transformation together withthe partition \(\alpha\) is a Bernoulli transformation.
In the literature Bernoulli systems are often introduced usingso-calledshift maps (orBernoulli shifts). We herebriefly indicate how shift maps are related to Bernoulli systems withthe example of the baker’s transformation; for a more generaldiscussion see theAppendix, Section D. Choose a point in the unit square and write its \(x\) and\(y\) coordinates as binary numbers: \(x = 0.a_1 a_2 a_3\ldots\) and\(y = 0.b_1 b_2 b_3\ldots\), where all the \(a_i\) and \(b_i\) areeither 0 or 1. Now put both strings together back to back with a dotin the middle to form one infinite string: \(S= \ldots b_3 b_2 b_1.a_1 a_2 a_3\ldots\), which may represent the state of the system justas a ‘standard’ two-dimensional vector does. Somestraightforward algebra then shows that
\[ T(0.a_1 a_2 a_3\ldots , 0.b_1 b_2 b_3\ldots) = (0.a_2 a_3 a_4\ldots , 0.a_1 b_1 b_2 b_3 b_4\ldots). \]From this we see that in our ‘one string’ representationof the point the operation of \(T\) amounts to shifting the dot oneposition to the right: \(TS= \ldots b_3 b_2 b_1 a_1 .a_2 a_3\ldots\)Hence, the baker’s transformation is equivalent to a shift on aninfinite string of zeros and ones.[11]
There are two further notions that are crucial to the theory ofBernoulli systems, the property of beingweak Bernoulli andvery weak Bernoulli. These properties play a crucial role inshowing that certain transformations are in fact Bernoulli. Thebaker’s transformation is one of the few examples that have ageometrically simple Bernoulli partition, and so one often cannotprove directly that a system is a Bernoulli system. One then showsthat a certain geometrically simple partition is weak Bernoulli anduses a theorem due to Ornstein to the effect that if a system is weakBernoulli then there exists a Bernoulli partition for that system. Themathematics of these notions and the associated proofs of equivalenceare intricate and a presentation of them is beyond the scope of thisentry. The interested reader is referred to Ornstein (1974) or Shields(1973).
The concepts of EH, and in particular ergodicity itself, playimportant roles in the foundation of statistical mechanics (SM). Inthis section we review what these roles are.
A discussion of SM faces an immediate problem. Foundational debates inmany other fields of physics can take as their point of departure agenerally accepted formalism. Things are different in SM. Unlike, say,relativity theory, SM has not yet found a generally acceptedtheoretical framework, let alone a canonical formulation.[12] What we find in SM is plethora of different approaches and schools,each with its own programme and mathematical apparatus.[13] However, all these schools use (slight variants) of either of twotheoretical frameworks, one of which can be associated with Boltzmann(1877) and the other with Gibbs (1902), and can thereby be classifyeither as ‘Boltzmannian’ or ‘Gibbsian’. Forthis reason we divide our presentation of SM into a two parts, one foreach of these families of approaches. For further discussions of SMsee the entryPhilosophy of Statistical Mechanics.
We first introduce the main elements of the Boltzmannian framework andthen turn to the use of ergodicity in it. Every system can possesvarious macrostates \(M_1 ,\ldots ,M_k\). These macrostates arecharacterised by the values of macroscopic variables, in the case of agas pressure, temperature, and volume.[14] One state is the system’s in initial state, and another is itsequilibrium state. We label the states \(M_p\) and \(M_{eq}\)respectively. We write ‘‘\(M_p\)’’ for theinitial state because this state is associated with the so-called“past hypothesis”, which we discuss below.
It is one of the fundamental posits of the Boltzmann approach thatmacrostates supervene on microstates, meaning that a change in asystem’s macrostate must be accompanied by a change in itsmicrostate (for a discussion of supervenience see McLaughlin andBennett 2005, and references therein). For instance, it is notpossible to change the pressure of a system and at the same time keepits micro-state constant. Hence, to every given microstate \(x\) therecorrespondsexactly one macrostate. Let us refer to thismacrostate as \(M(x)\). This determination relation is not one-to-one;in fact many different \(x\) can correspond to the same macrostate. Wenow group together all microstates \(x\) that correspond to the samemacro-state, which yields a partitioning of the phase space innon-overlapping regions, each corresponding to a macro-state. For thisreason we also use the same letters, \(M_1 ,\ldots ,M_k\), to refer tomacro-states and the corresponding regions in phase space. This isillustrated in Figure 6a.
![]() | ![]() |
| (6a) | (6b) |
Figure 6: The macrostate structure of\(X\).
We are now in a position to introduce the Boltzmann entropy. To thisend recall that we have a measure \(\mu \) on the phase space thatassigns to every set a particular volume, hencea fortiorialso to macrostates. With this in mind, the Boltzmann entropy of amacro-state \(M_j\) can be defined as \(S_B = k_B\log [\mu(M_j)]\),where \(k_B\) is the Boltzmann constant. The important feature of thelogarithm is that it is amonotonic function: the larger\(M_j\), the larger its logarithm. From this it follows that thelargest macro-state also has the highest entropy!
One can show that, at least in the case of dilute gases, the Boltzmannentropy coincides with the thermodynamic entropy (in the sense thatboth have the same functional dependence on the basic statevariables), and so it is plausible to say that the equilibrium stateis the macro-state for which the Boltzmann entropy is maximal (sincethermodynamics posits that entropy be maximal for equilibrium states).By assumption the system starts off in a low entropy state, theinitial state \(M_p\) (the gas being squeezed into the left half ofthe box). The problem of explaining the approach to equilibrium thenamounts to answering the question: why does a system originally in\(M_p\) eventually move into \(M_{eq}\) and then stay there? (SeeFigure 6b.)
In the 1870s Boltzmann offered an important answer to this question.[15] At the heart of his answer lies the idea to assign probabilities tomacrostates according to their size. So Boltzmann adopted thefollowing postulate: \(p(M_j) = c\mu(M_j)\) for all \(j = 1,\ldots,k\), where \(c\) is a normalisation constant assuring that theprobabilities add up to one. Granted this postulate, it followsimmediately that the most likely state is the equilibrium state (sincethe equilibrium state occupies the largest chunk of the phase space).From this point of view it seems natural to understand the approach toequilibrium as the evolution from an unlikely macrostate to a morelikely macrostate and finally to the most likely macro-state. This,Boltzmann argued, was a statistical justification of the Second Law ofthermodynamics.
But Boltzmann knew that simply postulating \(p(M_j) = c\mu(M_j)\)would not solve the problem unless the postulate could be justified interms of the dynamics of the system. This is where ergodicity entersthe scene. As we have seen above, ergodic systems have the property ofspending a fraction of time in each part of the phase space that isproportional to its size (with respect to \(\mu)\). As we have alsoseen, the equilibrium state is the largest macrostate. In fact, theequilibrium state ismuch larger than the other states. So ifwe assume that the system is ergodic, then it is in equilibrium mostof the time! It is then natural to interpret \(p(M_j)\) as a timeaverage: \(p(M_j)\) is the fraction of time that the system spends instate \(M_j\) over the course of time. We now have the main elementsof Boltzmann’s framework in front of us: (a) partition the phasespace of the system in macrostates and show that the equilibrium stateis by far the largest state; (b) adopt a time average interpretationof probability; and (c) assume that the system in question is ergodic.It then follows that the system is most likely to be found inequilibrium, which justifies (a probabilistic version of) the secondlaw of thermodynamics.
Three objections have been levelled against this line of thought.First, it has been pointed out that assuming ergodicity is too strongin two ways. The first wasy is that it turns out to be extremelydifficult to prove that the systems of interest really are ergodic.One of the simplest models of a monoatomic gas is the hard-ball model.The so-called Boltzmann-Sinai Ergodic Hypothesis says that this modelis ergodic. A general proof of this hypothesis became available only abit over a decade ago, when Simanyi (2010 [Other Internet Resources])was able to prove it.
The second way in which ergodicity seems to be too strong is that evenif eventually we can come by proofs of ergodicity for the relevantsystems, the assumption is too strong because there are systems thatare known not to be ergodic and yet they behave in accordance with theSecond Law. Bricmont (2001) investigates the Kac Ring Model and asystem of \(n\) uncoupled anharmonic oscillators of identical mass,and points out that both systems exhibit thermodynamic behaviour andyet they fail to be ergodic. Hence, ergodicity is not necessary forthermodynamic behaviour. Earman and Redei (1996, p. 70) and van Lith(2001, p. 585) argue that if ergodicity is not necessary forthermodynamic behaviour, then ergodicity cannot provide a satisfactoryexplanation for this behaviour. Either there must be properties otherthan ergodicity that explain thermodynamic behaviour in cases in whichthe system is not ergodic, or there must be an altogether differentexplanation for the approach to equilibrium even for systems which areergodic.
In response to this objection, Vranas (1998) and Frigg and Werndl(2011) argue that most systems that fail to be ergodic are‘almost ergodic’ in a specifiable way, and this is goodenough. We discuss Vranas’ approach below when discussingGibbsian SM since that is the context in which he has put forward hissuggestion. Werndl and Frigg (2015a, 2015b) offer an alternativedefinition of Boltzmannian equilibrium and exploit the ergodicdecomposition theorem to show that even if a system is not ergodic itwill spend most of the time in equilibrium, as envisaged by Boltzmann(roughly the ergodic decomposition theorem says that the phase spaceof every measure preserving system can be partitioned into parts sothat the dynamics is ergodic on each part; for details see Petersen1983). Frigg (2009) suggested exploiting the fact that almost allHamiltonian systems are non-integrable, and that these systems haveso-called Arnold webs, i.e., large regions of phase space on which themotion of the system is ergodic. Lavis (2005) re-examined the Kac ringmodel and pointed out that even though the system is not ergodic, ithas an ergodic decomposition, which is sufficient to guarantee theapproach to equilibrium. He also challenged the assumption, implicitin the above criticism, that providing an explanation for the approachto equilibrium amounts to identifying one (and only one!) propertythat all systems have in common. In fact, it may be the case thatdifferent properties are responsible for the approach to equilibriumin different systems, and there is no reason to rule out suchexplanations. In sum, the tenor of all these responses is that eventhough ergodicitysimpliciter may not have the resources toexplain the approach to equilibrium, somewhat qualified propertiesdo.
The second objection is that even if ergodicity obtains, this is notsufficient to give us what we need. As we have seen above, ergodicitycomes with the qualification ‘almost everywhere’. Thisqualification is usually understood as suggesting that sets of measurezero can be ignored without detriment. The idea is that points fallingin a set of measure zero are ‘sparse’ and can therefore beneglected. The question of whether or not this move is legitimate isknown as the ‘measure zero problem’.
Simply neglecting sets of measure zero seems to be problematic forvarious reasons. First, sets of measure zero can be rather‘big’; for instance, the rational numbers have measurezero within the real numbers. Moreover, a set of measure zero need notbe (or even appear) negligible if sets are compared with respect toproperties other than their measures. For instance, we can judge the‘size’ of a set by its cardinality or Baire categoryrather than by its measure, which leads us to different conclusionsabout the set’s size (Sklar 1993, pp. 182–88). It is alsoa mistake to assume that an event with measure zero cannot occur. Infact, having measure zero and being impossible are distinct notions.Whether or not the system at some point was in one of the specialinitial conditions for which the space and time mean fail to be equalis a factual question that cannot be settled by appeal tomeasures.
In response two things can said. First, discounting sets of measurezero is standard practice in physics and the problem is not specificto ergodic theory. So unless there is a good reason to suspect thatspecific measure zero states are in fact important, one might arguethat the onus of proof is on those who think that discounting them inthis case is illegitimate. Second, the fact that SM works in so manycases suggests that they indeed are scarce.
The third objection is rarely explicitly articulated, but it isclearly in the background of contemporary Boltzmannian approaches toSM such as Albert’s (2000), which reject Boltzmann’sstarting point, namely the postulate \(p(M_j) = c\mu(M_j)\). Albertintroduces an alternative postulate, essentially providing transitionprobabilities between two macrostates conditional on the so-calledPast Hypothesis, the posit that the universe came into existence in alow entropy state (the Big Bang). Albert then argues that in such anaccount, ergidicity becomes an idle wheel, and hence he rejects it ascompletely irrelevant to the foundations of SM. This, however, maywell be too hasty. Although it is true that ergodicity simplicitercannot justify Albert’s probability postulate, another dynamicalassumption is needed in order for this postulate to be true (Frigg2010).
At the basis of Gibbs’ approach stands a conceptual shift. Theobject of study in the Boltzmannian framework is an individual system,consisting of a large but finite number of micro constituents. Bycontrast, within the Gibbs framework the object of study is aso-calledensemble: an imaginary collection of infinitelymany copies of the same system (they are the same in that they havethe same phase space, dynamics and measure), but who happen to be indifferent states. An ensemble of gases, for instance, consists ofinfinitely many copies of the same gas bit in different states: one isconcentrated in the left corner of the box, one is evenly distributed,etc. It is important to emphasise that ensembles are fictions, or‘mental copies of the one system under consideration’(Schrödinger 1952, 3); or alternatively they can be thought of ascollections of possible states of the entire system. Hence, it isimportant not to confuse ensembles with collections of micro-objectssuch as the molecules of a gas!
The instantaneous state of one system of the ensemble is specified byone point in its phase space. The state of the ensembleas awhole is therefore specified by a density function \(\varrho\) onthe system’s phase space. From a technical point of view,\(\varrho\) is a function just like \(f\) that we encountered inSection 1. We furthermore assume that \(\varrho\) is a probabilitydensity, reflecting the probability density of finding the state of asystem chosen at random from the entire ensemble in region \(R\), sothat the probability of the state being in \(R\) is \(p(R) = \int_R\varrho d\mu\). To make this more intuitive consider the followinganalogy. You play a special kind of darts: you fix a plank to thewall, which serves as your dart board. For some reason you know thatthe probability of your dart landing at a particular place on theboard is given by the curve shown in Figure 7. You are then asked whatthe probability is that your next dart lands in the left half of theboard. The answer is 1 ⁄ 2 since one half of the surfaceunderneath the curve is on the left side. The dart board then playsthe role of the system’s state space, a region of the board(here the left half) plays the role of \(R\), and throwing a dartplays the role of picking a system from the ensemble.

Figure 7: Dart board
The importance of this is that it allows us to calculate expectationvalues. Assume that the game is such that you get one Pound if thedart hits the left half and three Pounds if it lands on the righthalf. What is your expected gain? The answer is 1 ⁄ \(2 \times1\) Pound \(+ 1\) ⁄ \(2 \times 3\) Pounds \(= 2\) Pounds. Thisis the expectation value. The same idea is at work in SM. Physicalmagnitudes like, for instance, pressure are associated with functions\(f\) on the phase pace. We then calculate the expectation value ofthese magnitudes, which, in general is given by \(\langle f \rangle =\int fd\mu\). In the context of Gibbsian SM these expectation valuesare also referred to asphase averages orensembleaverages. They are of central importance because in many contextsthey serve as predictions for what is observed in experiments.
By definition, a probability density \(\varrho\) is stationary if itdoes not change over time. Given that observable quantities areassociated with phase averages and that equilibrium is defined interms of the constancy of the macroscopic parameters characterisingthe system, it is natural to regard the stationarity of thedistribution as a necessary condition for equilibrium becausestationary distributions yield constant averages. For this reasonGibbs refers to stationarity as the ‘condition of statisticalequilibrium’.
Among all stationary distributions those satisfying a furtherrequirement, theGibbsian maximum entropy principle, play aspecial role. TheGibbs entropy (sometimes called‘ensemble entropy’) is defined as
\[ S_G (\varrho) = -k_B\int \varrho\log(\varrho)d\mu . \]The Gibbsian maximum entropy principle then requires that \(S_G(\varrho)\) be maximal, given the constraints that are imposed on the system.[16]
The last clause is essential because different constraints single outdifferent distributions. A common choice is to keep both the energyand the particle number in the system fixed. One can prove that underthese circumstances \(S_G (\varrho)\) is maximal for the so-calledmicrocanonical distribution (ormicrocanonicalensemble). If we choose to hold the number of particles constantwhile allowing for energy fluctuations around a given mean value weobtain the so-calledcanonical distribution; if we also allowthe particle number to fluctuate around a given mean value we find theso-calledgrand-canonical distribution.[17]
This formalism is enormously successful in that correct predictionscan be derived for a vast class of systems. But the success of thisformalism is rather puzzling. The first and most obvious questionconcerns the relation of systems and ensembles. The probabilitydistribution in the Gibbs approach is defined over an ensemble, theformalism provides ensemble averages, and equilibrium is regarded as aproperty of an ensemble. But what we are really interested in is thebehaviour of a single system! What could the properties of anensemble—a fictional entity consisting of infinitely many mentalcopies of the real system—tell us about the one real system onthe laboratory table? And more specifically, why do averages over anensemble coincide with the values found in measurements performed onan actual physical system in equilibrium? There is no obvious reasonwhy this should be so, and it turns out that ergodicity plays acentral role in answering these questions.
Common textbook wisdom justifies the use of phase averages as follows.As we have seen, the Gibbs formalism associates physical quantitieswith functions on the system’s phase space. Making an experimentmeasuring one of these quantities takes time and it is assumed thatwhat measurement devices register is not the instantaneous value ofthe function in question, but rather its time average over theduration of the measurement. Hence, time averages are what isempirically accessible. Then, so the argument continues, althoughmeasurements take an amount of time that is short by human standards,it is long compared to microscopic time scales on which typicalmolecular processes take place. For this reason it is assumed that themeasuredfinite time average is approximately equal to theinfinite time average of the measured function. If we nowassume that the system is ergodic, then time averages equal phaseaverages. The latter can easily be obtained from the formalism. Hencewe have found the sought-after connection: the Gibbs formalismprovides phase averages which, due to ergodicity, are equal toinfinite time averages, and these are, to a good approximation, equalto the finite time averages obtained from measurements.
This argument is problematic for at least two reasons. First, from thefact that measurements take some time it does not follow that what isactually measured are time averages. For instance, it could be thecase that the value provided to us by the measurement device is simplythe value assumed by at the last moment of the measurement,irrespective of what the previous values of were (e.g. it’ssimply the last pointer reading registered). So we would need anargument for the conclusion that measurements indeed produce timeaverages. Second, even if we take for granted that measurements doproducefinite time averages, equating these averages withinfinite time averages is problematic. Even if the duration of themeasurement is long by experimental standards (which need not be thecase), finite and infinite averages may assume very different values.That is not to say that they necessarily have to be different; theycould coincide. But whether or not they do is an empiricalquestion, which depends on the specifics of the system underinvestigation. So care is needed when replacing finite with infinitetime averages, and one cannot identify them without furtherargument.
Malament and Zabell (1980) respond to this challenge by suggesting away of explaining the success of equilibrium theory that still invokesergodicity, but avoids appeal to time averages. This solves the abovementioned problems, but suffers from the difficulty that many systemsthat are successfully dealt with by the formalism of SM are notergodic. To circumvent this difficulty Vranas (1998) suggestedreplacing ergodicity with what he calls \(\varepsilon\)-ergodicity.Intuitively a system is \(\varepsilon\)-ergodic if it is ergodic noton the entire phase space, but on a very large part of it, with thoseparts on which it is not ergodic having measure \(\varepsilon\), where\(\varepsilon\) is very small. The leading idea behind his approach isto challenge the commonly held belief that even if a system is just a‘little bit’ non-ergodic, then it behaves in a completely‘un-ergodic’ way. Vranas points out that there is a middleground and then argues that this middle ground actually provides uswith everything we need. This is a promising proposal, but it facesthree challenges. First, it needs to be shown that all relevantsystems really are \(\varepsilon\)-ergodicity. Second, the argument sofar has only been developed for the microcanonical ensemble, but onewould like to know whether, and if so how, it works for the canonicaland the grandcanonical ensembles. Third, it is still based on theassumption that equilibrium is characterised by a stationarydistribution, which, as we will see below, is an obstacle when itcomes to formulating a workable Gibbsian non-equilibrium theory.
The second response begins with Khinchin’s work. Khinchin (1949)pointed out that the problems of the ergodic programme are due to thefact that it focuses on too general a class of systems. Rather thanstudying dynamical systems at a general level, we should focus onthose cases that are relevant in statistical mechanics. This involvestwo restrictions. First, we only have to consider systems with a largenumber of degrees of freedom; second, we only need to take intoaccount a special class of phase functions, the so-called ‘sumfunctions’. These functions are a sum of one-particle functions,i.e., functions that take into account only the position and momentumof one particle. Under these assumption Khinchin proved that as \(n\)becomes larger, the measure of those regions on the energy hypersurface[18] where the time and the space means differ by more than a small amounttends towards zero. Roughly speaking, this result says that for large\(n\) the system behaves, for all practical purposes, as if it wasergodic.
The problem with this result is that it is valid only for sumfunctions, and in particular only if the energy function of the systemis itself a sum function, which is not the case when particlesinteract. So the question is how this result can be generalised tomore realistic cases. This problem stands at the starting point of aresearch programme now known as thethermodynamic limit,championed, among others, by Lanford, Mazur, Ruelle, and van derLinden (see van Lith (2001) for a survey). Its leading question iswhether one can still prove ‘Khinchin-like’ results in thecase of energy functionwith interaction terms.[19] Results of this kind can be proven in the limit for \(n \rightarrow\infty\), if also the volume \(V\) of the system tends towardsinfinity in such a way that the number-density \(n/V\) remainsconstant.
So far we have only dealt with equilibrium, and things get worse oncewe turn to non-equilibrium. The main problem is that it is aconsequence of the formalism that the Gibbs entropy is a constant!This precludes a characterisation of the approach to equilibrium interms of increasing Gibbs entropy, which is what one would expect ifwe were to treat the Gibbs entropy as the SM counterpart of thethermodynamic entropy. The standard way around this problem is tocoarse-grain the phase space, and then define the so-called coarsegrained Gibbs entropy. Put simply, course-graining the phase spaceamounts to putting a grid on the phase space and declare that allpoints within one cell of the grid are indistinguishable. Thisprocedure turns a continuous phase space into a discrete collection ofcells, and the state of the system is then specified by saying inwhich cell the system’s state is. If we define the Gibbs entropyon this grid, it turns out (for purely mathematical reasons) that theentropy is no longer a constant and can actually increase or decrease.If one then assumes that the system is mixing, it follows from theso-called convergence theorem of ergodic theory that thecoarse-grained Gibbs entropy approaches a maximum. However, thissolution is fraught with controversy, the two main bones of contentionbeing the justification of coarse-graining and the assumption that thesystem is mixing.
In sum, ergodicity plays a central role in many attempts to justifythe posits of SM. And even where a simplistic use of ergodicity iseventually unsuccessful, somewhat modified notions prove fruitful inan analysis of the problem and in the search for better solutions.
EH is often presented as a hierarchy of increasing degrees ofrandomness in deterministic systems: the higher up in this hierarchy asystem is placed the more random its behaviour.[20] However, the definitions of different levels of EH do not makeexplicit appeal to randomness; nor does the usual way of presenting EHinvolve a specification of the notion of randomness that is supposedto underlie the hierarchy. So there is a question about what notion ofrandomness underlies EH and in what sense exactly EH is a hierarchy ofrandom behaviour.
Berkovitz, Frigg and Kronz (2006) discuss this problem and argue thatEH is best understood as a hierarchy of random behaviour if randomnessis explicated in terms of unpredictability, where unpredictability isaccounted for in terms of probabilistic relevance. Different patternsof probabilistic relevance, in turn, are spelled out in terms ofdifferent types of decay of correlation between a system’sstates at different times. Let us introduce these elements one at atime.
Properties of systems can be associated with different parts of thephase space. In the ball example, for instance, the propertyhaving positive momentum is associated with the right half ofthe phase space; that is, it is associated with the set \(\{x \in X\mid p \gt 0\}\). Generalising this idea we say that to every subset\(A\) of a system’s phase space there corresponds a property\(P_A\) so that the system possesses that property at time \(t\) iffthe system’s state \(x\) is in \(A\) at \(t\). The subset \(A\)may be arbitrary and the property corresponding to \(A\) may not beintuitive, unlike, for example, the property ofhaving positivemomentum. But nothing in the analysis to follow hangs on aproperty being ‘intuitive’. We then define theevent \(A^t\) as the obtaining of \(P_A\) at time \(t\).
At every time \(t\) there is a matter of fact whether \(P_A\) obtains,which is determined by the dynamics of the system. However, we may notknow whether or not this is the case. We therefore introduce epistemicprobabilities expressing our uncertainty about whether \(P_A\)obtains: \(p(A^t)\) reflects an agent’s degree of belief in\(P_A\)’s obtaining at time \(t\). In the same way we canintroduce conditional probabilities: \(p(A^t \mid B^{t_1})\) is ourdegree of belief in the system having \(P_A\) at \(t\) given that ithad \(P_B\) at an earlier time \(t_1\), where \(B\) is also a subsetof the system’s phase space. By the usual rule of conditionalprobability we have \(p(A^t \mid B^{t_1}) = p(A^t \amp B^{t_1}) /(p(B^{t_1})\). This can of course be generalised to more than oneevent: \(p(A^t \mid B_{1}^{t_1} \amp \ldots \amp B_{r}^{t_r})\) is ourdegree of belief in the system having \(P_A\) at \(t\) given that ithad \(P_{B_1}\) at \(t_1, P_{B_2}\) at \(t_2,\ldots\), and \(P_{B_{r}}\) at \(t_r\), where \(B_1 ,\ldots ,B_r\) are subsets of thesystem’s phase space (and \(r\) a natural number), and \(t_1,\ldots ,t_r\) are successive instants of time (i.e., \(t \gt t_1 \gt\ldots \gt t_r)\).
Intuitively, an event in the past is relevant if taking the past eventinto account makes a difference to our predictions, or morespecifically if it lowers or raises the probability for a futureevent. In other words, \(p(A^t \mid B_{1}^{t_1}) - p(A^t)\) is ameasure of the relevance of \(B^{t_1}\) to predicting \(A^t :B^{t_1}\) is positively relevant if the \(p(A^t \mid B_{1}^{t_1}) -p(A^t) \gt 0\), negatively relevant if \(p(A^t \mid B_{1}^{t_1}) -p(A^t) \lt 0\), and irrelevant if \(p(A^t \mid B_{1}^{t_1}) - p(A^t) =0\). For technical reasons it turns out to be easier to work with aslightly different but equivalent notion of relevance, which isobtained from the above by multiplying both sides of the equation by\(p(B^{t_1})\). Therefore we adopt the following definition. Therelevance of \(B^{t_1}\) for \(A^t\) is
\[\tag{R} R(B^{t_1}, A^t) = p(A^t \amp B^{t_1}) - p(A^t)p(B^{t_1}). \]The generalisation of this definition to cases with more than one set\(B\) (as above) is straightforward.
Relevance serves to explicate unpredictability. Intuitively, the lessrelevant past events are for \(A^t\), the less predictable the systemis. This basic idea can then be refined in various ways. First, thetype of unpredictability we obtain depends on the type of events towhich (R) is applied. For instance, the degree of the unpredictabilityof \(A^t\) increases if its probability is independent not only of\(B^{t_1}\) or other ‘isolated’ past events, but ratherthe entire past. Second, the unpredictability of an event \(A^t\)increases if the probabilistic dependence of that event on past events\(B^{t_1}\) decreases rapidly with the increase of the temporaldistance between the events. Third, the probability of \(A^t\) may beindependent of past eventssimpliciter, or it may beindependent of such events only on average. These ideas underlie theanalysis of EH as a hierarchy of unpredictability.
Before we can provide such an analysis, two further steps are needed.First, if the probabilities are to be useful to understandingrandomness in adynamical system, the probability assignmenthas to reflect the properties of the system. So we have to connect theabove probabilities to features of the system. The natural choice isthe system’s measure \(\mu\).[21] So we postulate that the probability of an event \(A^t\) is equal tothe measure of the set \(A: p(A^{t}) = \mu(A)\) for all \(t\). Thiscan be generalised to joint probabilities as follows:
\[\tag{P} p(A^{t} \amp B^{t_1}) = \mu(A \cap T_{t_1 \rightarrow t}B), \]for all instants of time \(t \gt t_1\) and all subsets \(A\) and \(B\)of the system’s phase space. \(T_{t_1 \rightarrow t}B\) is theimage of the set \(B\) under the dynamics of the system from \(t_1\)to \(t\). We refer to this postulate as theProbabilityPostulate (P), which is illustrated in Figure 8. Again, thiscondition is naturally generalised to cases of joint probabilities of\(A^t\) with multiple events \(B^{t_i}\). Granted (P) and itsgeneralization, (R) reflects the dynamical properties of systems.

Figure 8: Condition (P).
Before briefly introducing the next element of the analysis let usmention that there is a question about whether the association ofprobabilities with the measure of the system is reasonable. Primafacie, a measure on a phase space can have a purely geometricalinterpretation and need not necessarily have anything to do with thequantification of uncertainty. For instance, we can use a measure todetermine the length of a table, but this measure need not haveanything to do with uncertainty. Whether or not such an association islegitimate depends on the cases at hand and the interpretation of themeasure. However, for systems of interest in statistical physics it isnatural and indeed standard to assume that the probability of thesystem’s state to be in a particular subset of the phase space\(X\) is proportional to the measure of \(A\).
The last element to be introduced is the notion of the correlationbetween two subsets \(A\) and \(B\) of the system’s phase space,which is defined as follows:
\[\tag{C} C(A, B) = \mu(A \cap B) - \mu(A)\mu(B). \]If the value of \(C(A, B)\) is positive (negative), there is positive(negative) correlation between \(A\) and \(B\); if it is zero, then\(A\) and \(B\) are uncorrelated. It then follows immediately from theabove that
\[\tag{RC} R(B^{t_1}, A^t) = C(T_{t_1\rightarrow t}B, A). \](RC) constitutes the basis for the interpretation of EH as a hierarchyof objective randomness. Granted this equation, the subjectiveprobabilistic relevance of the event \(B^{t_1}\) for the event \(A^t\)reflects objective dynamical properties of the system since fordifferent transformations \(T\ R(B^{t_1}, A^t)\) will indicatedifferent kinds of probabilistic relevance of \(B^{t_1}\) for\(A^t\).
To put (RC) to use, it is important to notice that the equationsdefining the various levels of EH above can be written in terms ofcorrelations. Taking into account that we are dealing with discretesystems (and hence we have \(T_{t_1\rightarrow t}B = T_k B\) where\(k\) is the number of time steps it takes to get from \(t_1\) and\(t)\), these equations read:
Applying (RC) to these expressions, we can explicate the nature of theunpredictability that each of the different levels of EH involves.
Let us start at the top of EH. In Bernoulli systems the probabilitiesof the present state are totally independent of whatever happened inthe past, even if the past is only one time step back. So knowing thepast of the system does not improve our predictive abilities in theleast; the past is simply irrelevant to predicting the future. Thisfact is often summarised in the slogan that Bernoulli systems are asrandom as a coin toss. We should emphasise, however, that this is trueonly for events in the Bernoulli partition; the characterisation of aBernoulli system is silent about what random properties partitionsother than the Bernoulli partition have.
K-mixing is more difficult to analyse. We now have to tackle thequestion of how to understand \(\sigma(n, r)\), the minimal\(\sigma\)-algebra generated by the set
\[ \{T_k A_j \mid k \ge n ; j = 1, \ldots ,r\} \]that we sidestepped earlier on. What matters for our analysis is thatthe following types of sets are members of
\[\sigma(n, r): T_k A_{j_0} \cap T_{k +1}A_{j_1} \cap T_{k +2}A_{j_2} \cap \ldots,\]where the indices \(j_i\) range over \(1, \ldots ,r\). Since we arefree to choose the sets A\(_0\), A\(_1,\ldots\), A\(_r\) as we please,we can always chose them so that they are the past history of thesystem: the system was in \(A_{j_0}\) \(k\) time steps back, in\(A_{j_1}\) \( k+1\) time steps back, etc. Call this the(coarse-grained)remote past of thesystem—‘remote’ because we only consider states thatare more than \(k\) time steps back. The K-mixing condition then saysthat the system’sentire remote past history becomesirrelevant to predicting what happens in the future as time tendstowards infinity. Typically, Bernoulli systems are compared withK-systems by focussing on the events in the Bernoulli partition. Withrespect to that partition K is weaker than Bernoulli. The differenceis both in the limit and the remote history. In a Bernoulli system thefuture is independent of theentire past (not only the remotepast), and this is true without taking a limit (in the case ofK-mixing independence only obtains in the limit). However, this onlyholds for the Bernoulli partition; it may or may not hold for otherpartitions—the definition of a Bernoulli system says nothingabout that case.[22]
The interpretation of strong mixing is now straightforward. It saysthat for any two sets \(A\) and \(B\), having been in \(B\) \(k\) timesteps back becomes irrelevant to the probability of being in \(A\)sometime in the future if time tends towards infinity (i.e. when ntends to infinity). In other words, past events \(B\) becomeincreasingly irrelevant for the probability of \(A\) as the temporaldistance between \(A\) and \(B\) becomes larger. This condition isweaker than K-mixing because it only states that the future isindependent of isolated events in the remote past, while K-mixingimplies independence of theentire remote past history.
In weakly mixing systems the past may be relevant to predicting thefuture, even in the remote past. The weak mixing condition only saysthat this influence must be weak enough for it to be the case that theabsolute value of the correlations between a future event and pastevents vanishes on average; but this does not mean that all individualcorrelations vanish. So in weakly mixing systems events in the pastcan remain relevant to the future.
Ergodicity, finally, implies no decay of correlation at all. Theergodicity condition only says that the average of the correlations(and this time without an absolute value) of all past events with afuture event is zero. But this is compatible with there being strongcorrelations between every instant in the past and the future,provided that positive and negative correlations average out. So inergodic systems the past does not become irrelevant. For this reason,ergodic system are not random at all (in the sense of randomintroduced above).
How relevant are these insights to understanding the behaviour ofactual systems? A frequently heard objection (which we have alreadyencountered in Section 4) is that EH and more generally ergodic theoryare irrelevant since most systems (including those that we areultimately interested in) are not ergodic at all.[23]
This charge is less acute than it appears at first glance. First, itis important to emphasise that it is not the sheer number ofapplications that make a physical concept important, but whether thereare some important systems that are ergodic. And there are examples ofsuch systems. For example, so-called ‘hard-ball systems’(and some more sophisticated variants of them) are effectiveidealizations of the dynamics of gas molecules, and these systems seemto be ergodic;for details, see Berkovitz, Frigg and Kronz 2006,Section 3.2, Vranas (1998) and Frigg and Werndl (2011).
Furthermore, EH can be used to characterize randomness and chaos inboth ergodic and non-ergodic systems. Even if a systemas awhole is not ergodic (i.e., if it fails to be ergodic withrespect to theentire phase space \(X)\), there can be (andusually there are) subsets of \(X\) on which the system is ergodic.This is what Lichtenberg and Liebermann (1992, p. 295) have in mindwhen they observe that ‘[i]n a sense, ergodicity is universal,and the central question is to define the subspace over which itexists’. In fact, non-ergodic systems may have subsets that arenot only ergodic, but even Bernoulli! It then becomes interesting toask what these subsets are, what their measures are, and whattopological features they have. These are questions studied in partsof dynamical systems theory, most notably KAM theory. Hence, KAMtheory does not demonstrate that ergodic theory is not useful inanalyzing the dynamical behavior of real physical systems. Indeed, KAMsystems have regions in which the system manifest either merelyergodic or Bernoulli behaviour, and accordingly EH is useful forcharactering the dynamical properties of such systems (Berkovitz,Frigg and Kronz 2006, Section 4). Further, as we have mentioned inSection 4.1, almost all Hamiltonian systems are non-integrable, andaccordingly they have large regions of the phase space in which theirmotion is ergodic-like. So EH is a useful tool in studying thedynamical properties of systems even if the system fails to be ergodictout court.
Another objection is that EH is irrelevant in practice because mostlevels of EH (in fact, all except Bernoulli) are defined in terms ofinfinite time limits and hence remain silent about whathappens infinite time. But all we ever observe are finitetimes and so EH is irrelevant to physics as practiced by actualscientists.
This charge can be dispelled by a closer look at the definition of alimit, which shows thatinfinite limits in fact haveimportant implications for the dynamical behaviour of the system infinite times. The definition of a limit is as follows (where \(f\) isan arbitrary function of time): lim\(_{t\rightarrow \infty} f(t) = c\)iff for every \(\varepsilon \gt 0\) there exists an \(t’ \gt 0\)so that for all \(t \gt t’\) we have \(\lvert f(t) - c\rvert \lt\varepsilon\). In words, for every number \(\varepsilon\), no matterhow small, there is afinite time \(t’\) after whichthe values of \(f\) differ from \(c\) by less than \(\varepsilon\).That is, once we are past \(t’\) the values of \(f\) never movemore than \(\varepsilon\) away from \(c\). With this in mind strongmixing, for instance, says that for a given threshold \(\varepsilon\)there exists afinite time \(t_n (n\) units of time after thecurrent time) after which \(C(T_n B, A)\) is always smaller than\(\varepsilon\). We are free to choose \(\varepsilon\) to be anempirically relevant margin, and so we know that if a system ismixing, we should expect the correlations between the states of thesystem aftert\(_n\) and its current state to be below\(\varepsilon\). The upshot is that in strong mixing systems, being ina state \(B\) at some past time becomes increasingly irrelevant forits probability of being in the state \(A\) now, as the temporaldistance between \(A\) and \(B\) becomes larger. Thus, the fact thatsystem is strong mixing clearly has implications for its dynamicalbehaviour in finite times. Furthermore, often (although not always)convergence proofs provide effective bounds on rates of convergenceand these bounds can be used to inform expectations about behaviour ata given time.
Since different levels of EH correspond to different degrees ofrandomness, each explicated in terms of a different type of asymptoticdecay of correlations between states of systems at different times,one might suspect that a similar pattern can be found in the rates ofdecay. That is, one might be tempted to think that EH can equally becharacterized as a hierarchy of increasing rates of decay ofcorrelations: a K-system, for instance, which exhibits exponentialdivergence of trajectories, would be characterized by an exponentialrate of decay of correlations, while a SM-system would exhibit apolynomial rate of decay.
This, unfortunately, does not work. Natural as it may seem, EH cannotbe interpreted as a hierarchy of increasing rates of decay ofcorrelations. It is a mathematical fact that there is no particularrate of decay associated with each level of EH. For instance, one canconstruct K-systems in which the decay is as slow as one wishes it tobe. So the rate of decay is a feature of particular systems ratherthan of a level of EH.
The question of how to characterisechaos has beencontroversially discussed ever since the inception of chaos theory;for a survey see Smith (1998, Ch. 10). An important family ofapproaches defines chaos using EH. Belot and Earman (1997, 155) statethat being strong mixing is a necessary condition and being a K-systemis a sufficient condition for a system to be chaotic. The view thatbeing a K-system is the mark of chaos and that any lower degree ofrandomness is not chaotic is frequently motivated by two ideas. Thefirst is the idea that chaotic behaviour involves dynamicalinstability in the form of exponential divergence of nearbytrajectories. Thus, since a system involves an exponential divergenceof nearby trajectories only if it is a K-system, it is concluded thatbeing a K-system is the crucial condition. It is noteworthy, however,that SM is compatible with there being polynomial divergence of nearbytrajectories and that such divergence sometimes exceeds exponentialdivergence in the short run. Thus, if chaos is to be closelyassociated with the rate of divergence of nearby trajectories, also SMsystems seem to be candidates to exhibit chaotic behaviour.
The second common motivation for the view that being a K-system is themark of chaos is the idea that the shift from zero to positiveKS-entropy marks the transition from a ‘regular’ to‘chaotic’ behaviour. This may suggest that having positiveKS-entropy is both necessary and sufficient condition for chaoticbehaviour. Thus, since K-systems have positive KS-entropy while SMsystems don’t, it is concluded that K-systems are chaoticwhereas SM-systems are not. Why is KS-entropy a mark of chaos? Thereare three motivations, corresponding to three differentinterpretations of KS-entropy. First, KS-entropy could be interpretedas entailing dynamical instability in the sense of having nearbydivergence of nearby trajectories (see Lichtenberg & Liebermann,1992, p. 304). Second, KS-entropy could be connected to algorithmiccomplexity (Brudno 1978). Yet, while such a complexity is sometimesmentioned as an indication of chaos, it is more difficult to connectit to physical intuitions about chaos. Third, KS-entropy could beinterpreted as a generalized version of Shannon’s informationtheoretic entropy (see Frigg 2004). According to this approach,positive KS-entropy entails a certain degree of unpredictability,which is sufficiently high to deserve the title chaotic.[24]
Werndl (2009b) argues that a careful review of all systems that onecommonly regards as chaotic shows that strong mixing is the crucialcriterion: a system is chaotic just in case it is strong mixing. Asshe is careful to point out, this claim needs to be qualified: systemsare rarely mixing on the entire phase space, but neither are theychaotic on the entire phase space. The crucial move is to restrictattention to those regions of phase space where the system is chaotic,and it then turns out that in these same regions the systems are alsostrong mixing. Hence Werndl concludes that strong mixing is thehallmark of chaos. And surprisingly this is true also of dissipativesystems (i.e., systems that are not measure preserving). These systemshave attractors, and they are chaotic on their attractors rather thanon the entire phase space. The crucial point then is that one candefine an invariant (preserved) measureon the attractor andshow that the system is strongly mixing with respect to that measure.So strong mixing can define chaos in both conservative and dissipativesystems.
The search for necessary and sufficient conditions for chaospresupposes that there is a clear-cut divide between chaotic andnon-chaotic systems. EH may challenge this view, as every attempt todraw a line somewhere to demarcate the chaotic from non-chaoticsystems is bound to be somewhat arbitrary. Ergodic systems are prettyregular, mixing systems are less regular and the higher positions inthe hierarchy exhibit still more haphazard behaviour. But is there oneparticular point where the transition from ‘non-chaos’ tochaos takes place? Based on the argument that EH is a hierarchy ofincreasing degrees of randomness and degrees of randomness correspondto different degrees of unpredictability (see Section 5), Berkovitz,Frigg and Kronz (2006, Section 5.3) suggest that chaos may well beviewed as a matter of degree rather than an all-or-nothing affair.Bernoulli systems are very chaotic, K-systems are slightly lesschaotic, SM-systems are still less chaotic, and ergodic systems arenon-chaotic. This suggestion connects well with the idea that chaos isclosely related to unpredictability.
The ergodic hierarchy has also been used to understand quantum chaos.Castagnino and Lombardi (2007) analyze the problem of quantum chaos asa particular case of the classical limit of quantum mechanics andidentify mixing in the classical limit as the condition that a quantumsystem must satisfy to be nonintegrable. Gomez and Castagnino (2014,2015) generalize the entire ergodic hierarchy to the quantum contextand argue that EH thus generalized is a helpful tool to understandquantum chaos; Fortin and Lombardi (2018) use EH to understanddecoherence; and Gomez (2018) discusses the KS entropy in quantummixing systems.
Mixing, finally, has also been invoked in understanding the effects ofstructural model error. Frigg, Bradley, Du and Smith (2014) argue thatthe distinction between parameter error and structural model error iscrucial, and that the latter has significant and hithertounappreciated impact on the predictive ability of a model. Wilson-Mayo(2015) points out that to put this observation on a solid foundationwe need a notion of structural chaos. He proposes such a notion byappealing to topological mixing.
EH is often regarded as relevant for explicating the nature ofrandomness in deterministic dynamical systems. It is not clear,however, what notion of randomness this claim invokes. The formaldefinitions of EH do not make explicit appeal to randomness and theusual ways of presenting EH do not involve any specification of thenotion of randomness that is supposed to underlie EH. As suggested inSection 5, EH can be interpreted as a hierarchy of randomness ifdegrees of randomness are explicated in terms of degrees ofunpredictability, which in turn are explicated in terms of (coherent)conditional degrees of beliefs. In order for these degrees of beliefto be indicative of the system’s dynamical properties, they haveto be updated according to a system’s dynamical law. The idea isthen that the different levels of EH, except for merely ergodicsystems, correspond to different kinds of unpredictability, whichcorrespond to different patterns of decay of correlations betweensystems’ past states and their present states. Merely ergodicsystems seem to display no randomness, as the correlations betweentheir past and present states need not decay at all.
Ergodic theory plays an important role in statistical physics, and EH,or some modification of it, constitutes an important measure ofrandomness in both Hamiltonian and dissipative systems. It issometimes argued that EH is by and large irrelevant for physicsbecause real physical systems are not ergodic. But, this charge isunwarranted, and a closer look at non-ergodic systems reveals a ratherdifferent picture, because EH can be fruitfully be used in thefoundations of statistical mechanics, analyses of randomness, andchaos theory. More recently it has also played a role in understandinglaws of nature (Filomeno 2019, List and Pivato 2019).
How to cite this entry. Preview the PDF version of this entry at theFriends of the SEP Society. Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entryatPhilPapers, with links to its database.
View this site from another server:
The Stanford Encyclopedia of Philosophy iscopyright © 2025 byThe Metaphysics Research Lab, Department of Philosophy, Stanford University
Library of Congress Catalog Data: ISSN 1095-5054