1. Introduction
This paper investigates a formal measure and a corresponding criterion we developed in order to capture the notion ofwholes orentities within Bayesian networks in general and multivariate Markov chains in particular. The main focus of this paper is to establish some formal properties of this criterion.
The main intuition behind wholes or entities is that combinations of some events/phenomena in space(-time) can be considered as more of asingle or coherent “thing” than combinations of other events in space(-time). For example, the two halves of a soap bubble (The authors thank Eric Smith for pointing out the example of a soap bubble.) together seem to form more of a single thing than one half of a floating soap bubble together with a piece of rock on the ground. Similarly, the soap bubble at time and the “same” soap bubble at seem more like temporal parts of thesame thing than the soap bubble at and the piece of rock at. We are trying to formally define and quantify what it is that makes some spatially and temporally extended combinations of parts entities but not others.
We envisage spatiotemporal entities as a way to establish not only the problem of
spatial identity but also that of
temporal identity (also called
identity over time [
1]). In other words, in addition to determining which events in “space” (e.g., which values of different degrees of freedom) belong to the same structure spatiotemporal entities should allow the identification of the structure at a time
that is the future (or past if
) of a structure at time
. Given a notion of identity over time, it becomes possible to capture which things persist and in what way they persist. Without a notion of identity over time, it seems persistence is not defined. The problem is how to decide whether something persisted from
to
if we cannot tell what at
would count as the future of the original thing.
In everyday experience problems concerning identity over time are not of great concern. Humans routinely and unconsciously connect perceived events to spatially and temporally extended entities. Nonetheless, the problem has been known since ancient times, in particular with respect to artefacts that exchange their parts over time. A famous example is the Ship of Theseus which has all of its planks exchanged over time. This leads to the question whether it is still the
same ship. From the point of view of physics and chemistry living organisms also exchange their parts (e.g., the constituting atoms or molecules) over time. In the long term we hope our theory can help to understand identity over time for these cases. For the moment, we are particularly interested in identity over time in formal settings like cellular automata, multivariate Markov chains, and more generally dynamical Bayesian networks. In these cases a formal notion of spatiotemporal entities (i.e., one defining spatial and temporal identity) would allow us to investigate persistence of entities/individuals formally. The persistence (and disappearance) of individuals are in turn fundamental to Darwinian evolution [
2,
3]. This suggests that spatiotemporal entities may be important for the understanding of the emergence of Darwinian evolution in dynamical systems.
Another area in which a formal solution to the problem of identity over time, and thereby entities (In the following, if not stated otherwise, we always mean
spatiotemporal entities when we refer to entities.), might become important is a theory of intelligent agents that are space-time embedded as described by Orseau and Ring [
4]. Agents are examples of entities fulfilling further properties e.g., exhibition of actions, and goal-directedness (cf. e.g., [
5]). Using the formalism of reinforcement learning Legg and Hutter [
6] proposes a definition of intelligence. Orseau and Ring [
4] argue that this definition is insufficient. They dismiss the usual assumption that the environment of the reinforcement agent cannot overwrite the agent’s memory (which in this case is seen as the memory/tape of a Turing machine). They conclude that in the most realistic case there only ever is one memory that the agent’s (and the environment’s) data is embedded in. They note that the difference between agent and environment then disappears. Furthermore, that the policy of the agent cannot be freely chosen anymore, only the initial condition. In order to measure intelligence according to Legg and Hutter [
6] we must be able to define reward functions. This seems difficult without the capability to distinguish the agent according to some criterion. Towards the end of their publication Orseau and Ring [
4] propose to define a “heart” pattern and use the duration of its existence as a reward. This seems a too specific approach to us since it basically defines identity over time (of the heart pattern) as invariance. In more general settings a pattern that maintains a more general criterion of identity over time would be desirable. Ideally, this criterion would also not need a specifically designed heart pattern. Another advantage would be that reward functions different from lifetime could be used if the agent were identifiable. An entity criterion in the sense of this paper would be a step in this direction.
1.1. Illustration
In order to introduce the contributions of this paper we illustrate the setting of our work further.
This illustration should only be taken as a motivation for what follows and not be confused with a result. The reason we don’t use a concrete example is simply that we lack the necessary computational means (which are considerable as we will discuss in
Section 5).
Let us assume we are given the entire time-evolution (what we will call a
trajectory) of some known multivariate dynamical system or stochastic process. For example, a trajectory of a one-dimensional elementary cellular automaton showing a glider collision like
Figure 1a (This is produced by the rule 62 elementary cellular automaton with time increasing from left to right. However, this does not matter here. For more about this system see e.g., Boccara et al. [
7]).
We take the point of view here argued for in previous work [
8] that entities are phenomena that occur within trajectories and that they can be represented by (spatiotemporal) patterns. Patterns in this sense fix a part of the variables in a trajectory to definite values and leave the rest undetermined. In
Figure 1b–d we show such patterns that occur in
Figure 1a with the undetermined variables coloured grey and the determined ones taking on those of the trajectory. Visually speaking, a pattern is a snippet from a trajectory that it occurs in.
From
Figure 1a we would probably expect that what we are seeing are two gliders colliding and forming a third. However, it may also be that one of the gliders absorbs the other, maintains some form of identity, and only changes its appearance (e.g., it “grows”). This highlights the problem of identity over time. While the spatial identity of such patterns has been treated multiple times in the literature their identity of over time is rarely dealt with.
Our approach evaluates the “integration” of spatiotemporally extended patterns at once. According to our proposed entity-criterion a pattern is an
-entity if, due to the dynamics of the system, every part of this pattern (which is again a pattern) makes all other parts more probable. Identity over time is then included since future parts have to make past parts more probable and vice versa. In principle this would allow us to detect if one of the gliders absorbs another one without loosing its identity. For example, this could result in an entity as in
Figure 1e.
In order to detect entities the straightforward approach is to evaluate the entity-criterion for every spatiotemporal pattern in a given trajectory. Evaluating our entity-criterion of positive complete local integration (CLI) for a given pattern corresponds to splitting the pattern into parts in every possible way and calculating whether all the resulting parts make each other more probable. This means evaluating the specific local integration (SLI) with respect to allpartitions of the set of variables occupied by the pattern.
1.2. Contributions
This paper contains four contributions.
We first give a more formal definition of patterns. Since each pattern uniquely specifies a set of trajectories (those trajectories that the pattern occurs in) one might be tempted to reduce the analysis to that of sets of trajectories. We show that this is not possible since not all sets of trajectories have a pattern that specifies them.
Second, we try to get a general intuition for the patterns whose parts make all other parts more probable. For this we show how to construct patterns that, for given probability of the whole pattern, achieve the least upper bound of specific local integration (SLI). These turn out to be patterns for which each part only occurs if and only if the whole pattern occurs. We also construct a pattern that, again for given probability of the whole pattern, hasnegative SLI. These pattern (which may achieve the greatest lower bound of SLI) occur if either the whole pattern occurs or the pattern occurs up to exactly one part of it, which does not occur.
Third, we prove the disintegration theorem. This is the main contribution. We saw that patterns are snippets of trajectories. We can also look at the whole trajectory as a single pattern. Like all patterns the trajectory can be split up into parts, i.e., partitioned, resulting in a set of patterns. Among the partitions we find examples such as those in
Figure 1g,h. These are very particular partitions picking out the gliders among all possible parts. This suggests that finding such special partitions provides a (possibly different) notion of entities.
One intuition we might have is that entities are the most “independent” parts of a trajectory. In other words we could look for the partition whose parts make the other partsless probable. The disintegration theorem then shows that this approach again leads to the-entities. This shows that-entities do not only have an intuitive motivation but also play a particular role in the structure of probabilities of entire trajectories.
It is not directly the parts of the partitions that minimise SLI for a trajectory which are-entities. To get-entities we first classify all partitions of the trajectory according to their SLI value. Then within each such class we choose the partitions for which no refining partition (A refining partition is one that further partitions any of the parts of the original partition.) achieves an even lower level of SLI.
So according to the disintegration theorem a-entity is not only a pattern that is integrated with respect to every possible partition of the pattern but also a pattern that occurs in partitions that minimise (in a certain sense) the integration of trajectories.
A side effect of the disintegration theorem is that we naturally get a kind of hierarchy of-entities called the disintegration hierarchy. For each trajectory and its different levels of SLI we find different decompositions of the trajectory into-entities.
Fourth, we calculate the-entities and disintegration hierarchy for two simple example systems. Our example systems show that in general the partitions at a particular disintegration level are not unique. This means that there are overlapping-entities at those levels. Furthermore, the same-entity can occur on multiple levels of the disintegration.
We do not thoroughly discuss the disintegration hierarchies in this paper and postpone this to future publications. Here we only note that many entities in the real world occur within hierarchies as well. For example, animals are entities that are composed of cells which are themselves entities.
1.3. Related Work
We now give a quick overview of related work. More in depth discussions will be provided after we formally introduce our definitions.
To our knowledge the measure of CLI has been proposed for the first time by us in [
8]. However, this publication contained none of the formal or numerical results in the present paper. From a formal perspective the measures of SLI and CLI are a combination of existing concepts. SLI localises multi-information [
9,
10] in the way proposed by Lizier [
11] for other information theoretic measures. In order to get the CLI we apply the weakest-link approach proposed by Tononi and Sporns [
12], Balduzzi and Tononi [
13] to SLI.
Conceptually, our work is most closely related to Beer [
14]. The notion of spatiotemporal patterns used there to capture blocks, blinkers, and gliders is equivalent to the
patterns we define more formally here. This work also contains an informal entity-criterion that directly deals with identity over time (not only space). It differs significantly from our proposal as it depends on the re-occurrence of certain transitions at later times in a pattern whereas our criterion only depends on the probabilities of parts of the patterns without the need for any re-occurrences.
The
organisations of chemical organisation theory [
15] may also be interpreted as entity-criteria. In Fontana and Buss [
15] these are defined in the following way:
The observer will conclude that the system is an organisation to the extent that there is a compressed description of its objects and of their relations.
The direct intuition is different from ours and it is not clear to us in how far our entity-criterion is equivalent to this. This will be further investigated in the future.
It is worth noting that viewing entities/objects/individuals as patterns occurring within a trajectory is in contrast to an approach that models them as sets of random variables/stochastic processes (e.g., a set of cells in a CA in contrast to a set of specific values of a set of cells). An example of the latter approach are the information theoretic individuals of Krakauer et al. [
16]. These individuals are identified using an information theoretic notion of autonomy due to Bertschinger et al. [
17]. The latter notion of autonomy is also somewhat related to the idea of integration here. Autonomy contains a term that measures the degree to which a random variable representing an individual at timestep
t determines the random variable representing it at
. Similarly, CLI requires that every part of an entity pattern makes every other part more probable, in the extreme case this means that every part determines that every other part of the pattern also occurs. However, formally autonomy evaluates random variables and not patterns directly.
At the most basic level the intuition behind entities is that some spatiotemporal patterns are more special than others. Defining (and usually finding) more important spatiotemporal patterns or structures (also called coherent structures) has a long history in the theory of cellular automata and distributed dynamical systems. As Shalizi et al. [
18] have argued most of the earlier definitions and methods [
19,
20,
21,
22] require previous knowledge about the patterns being looked for. They are therefore not suitable for a general definition of entities. More recent definitions based on information theory [
18,
23,
24] do not have this limitation anymore. The difference to our entity-criterion is that they do not treat identity over time. They are well suited to identify gliders at each time-step for example, but if two gliders collide and give rise to a third glider as in
Figure 1a these methods (by design) say nothing about the identity of the third glider. i.e., they cannot make a difference between a glider absorbing another one and two gliders producing a new one. While we have not been able to show that our approach actually makes such distinctions for gliders, it could do so in principle.
We note here that the approach of identifying individuals by Friston [
25] using Markov blankets has the same shortcoming as the spatiotemporal filters. For each individual time-step it returns a partition of all degrees of freedom into internal, sensory, active, and external degrees. However, it does not provide a way to resolve ambiguities in the case of multiple such partitions colliding.
Among research related to integrated information theory (IIT) there are approaches (a first one by Balduzzi [
26] and a more recently by Hoel et al. [
27]) that can be used to determine specific spatiotemporal patterns in a trajectory. They can therefore be interpreted to define a notion of entities even if that is not their main goal. These approaches are aimed at establishing the optimal spatiotemporal coarse-graining to describe the dynamics of a system. For a given trajectory we can then identify the patterns that instantiate a macro-state/coarse-grain that is optimal according to their criterion.
In contrast to our approach the spatiotemporal grains are determined by their interactions with other grains. In our case the entities are determined first and foremost by their internal relations.
The consequence seems to be that a pattern can be an entity in one trajectory and not an entity in another even if it occurs in both. In our conception a pattern is an entity in all trajectories it occurs in.
2. Notation and Background
In this section we briefly introduce our notation for sets of random variables (Since every set of jointly distributed random variables can be seen as a Bayesian network and vice versa we use these terms interchangeably.) and their partition lattices.
In general, we use the convention that upper-case letters are random variables, lower-case letters are specific values/outcomes of random variables, and calligraphic letters are state spaces that random variables take values in. Furthermore:
Definition 1.Let be a set of random variables with totally ordered finite index set V and state spaces respectively. Then for define:
- 1.
as the joint random variable composed of the random variables indexed by A, where A is ordered according to the total order of V,
- 2.
as the state space of,
- 3.
as a value of,
- 4.
as the probability distribution (or more precisely probability mass function) of which is the joint probability distribution over the random variables indexed by A. If i.e., a singleton set, we drop the parentheses and just write,
- 5.
as the probability distribution over. Note that in general for arbitrary,, and this can be rewritten as a distribution over the intersection of A and B and the respective complements. The variables in the intersection have to coincide:Here δ is the Kronecker delta (seeAppendix A). If and we also write to keep expressions shorter. - 6.
with as the conditional probability distribution over given:We also just write if it is clear from context what variables we are conditioning on.
If we are given
we can obtain every
through marginalisation. In the notation of Definition 1 this is formally written:
Next we define the partition lattice of a set of random variables. Partition lattices occur as a structure of the set of possible ways to split an object/pattern into parts. Subsets of the partition lattices play an important role in the disintegration theorem.
Definition 2 (Partition lattice of a set of random variables)
.Let be a set of random variables.
- 1.
Then its partition lattice is the set of partitions of V partially ordered by refinement (see alsoAppendix B). - 2.
For two partitions we write if π refines ρ and if π covers ρ. The latter means that,, and there is no with such that.
- 3.
We write for the zero element of a partially ordered set (including lattices) and for the unit element.
- 4.
Given a partition and a subset we define the restricted partition of π to A via:
For some examples of partition lattices see
Appendix B and for more background see e.g., Grätzer [
28]. For our purpose it is important to note that the partitions of sets of random variables or Bayesian networks we are investigating are partitions of the index set
V of these and not partitions of their state spaces
.
3. Patterns, Entities, Specific, and Complete Local Integration
This section contains the formal part of this contribution.
First we introducepatterns. Patterns are the main structures of interest in this publication. Entities are seen as special kinds of patterns. The measures of specific local integration and complete local integration, which we use in our criterion for-entities, quantify notions of “oneness” of patterns. We give a brief motivation and show that while each pattern defines a set of “trajectories” of a set of random variables not every such set is defined by a pattern. This justifies studying patterns for their own sake.
Then we motivate briefly the use of specific and complete local integration (SLI and CLI) for an entity criterion on patterns. We then turn to more formal aspects of SLI and CLI. We first prove an upper bound for SLI and construct a candidate for a lower bound. We then go on to define the disintegration hierarchy and its refinement-free version. These structures are used to prove the main result, thedisintegration theorem. This relates the SLI of whole trajectories of a Bayesian network to the CLI of parts of these trajectories and vice versa.
3.1. Patterns
This section introduces the notion of patterns. These form the basic candidate structures for entities.
The structures we are trying to capture by entities should be analogous to spatially and temporally extended objects we encounter in everyday life (e.g., soap bubbles, living organisms). These objects seem to occur in the single history of the universe that also contains us. The purpose of patterns is then to capture arbitrary structures that occur within single trajectories or histories of a multivariate discrete dynamical system (see
Figure 2 for an example of a Bayesian network of such a system, any cellular automaton is also such a system).
We emphasise the single trajectory since many structures of interest (e.g., gliders) occur in some trajectories in some “places”, in other trajectories in other “places” (compare e.g.,
Figure 1a and
Figure 3a), and in some trajectories not at all. We explicitly want to be able to capture such trajectory dependent structures and therefore choose patterns. Examples of formal structures for which it makes no sense to say that they occur within a trajectory are for example the random variables in a Bayesian network and, as we will see, general sets of trajectories of the Bayesian network.
Unlike entities, which we conceive of as special patterns that fulfil further criteria, patterns are formed by
any combination of events at arbitrary times and positions. As an example, we might think of cellular automaton again. The time evolutions over multiple steps of the cells attributed to a glider see [
14] for a principled way to attribute cells to theseas in
Figure 1b,e should be patterns but also arbitrary choices of events in a trajectory as in
Figure 3b.
In the more general context of (finite) Bayesian networks there may be no interpretation of time or space. Nonetheless, we can define that a trajectory in this case fixes every random variable to a particular value. We then define patterns formally in the following way.
Definition 3 (Patterns and trajectories)
.Let be set of random variables with index set V and state spaces respectively.
- 1.
A pattern at is an assignmentwhere. If there is no danger of confusion we also just write for the pattern at A. - 2.
The elements of the joint state space are isomorphic to the patterns at V which fix the complete set of random variables. Since they will be used repeatedly we refer to them as the trajectoriesof.
- 3.
A pattern is said to occur in trajectory if.
- 4.
Each pattern uniquely defines (or captures) a set of trajectories viai.e., the set of trajectories that occurs in. - 5.
It is convenient to allow the empty pattern for which we define.
Remarks:
Since every pattern defines a subset of
, one could think that every subset of
is also a pattern. In that case studying patterns in a set of random variables
would be the same as studying subsets of its set of trajectories
. However, the set of subsets of
defined by patterns and the set of all subsets
(i.e., the power set) of
of a set of random variables
are not identical. Formally:
While patterns define subsets of, not every subset of is captured by a pattern. The difference of the two sets is characterised in Theorem 1 below. We first present a simple example of a subset that cannot be captured by a pattern.
Let
and
the set of random variables. Let
. Then
. Now let
, choose pattern
and pattern
. Then let
In this case we can easily list the set of all patterns
:
and verify that
is not among them. Before we formally characterise the difference, we define some extra terminology.
Definition 4.Let be set of random variables with index set V and state spaces respectively. For a subset the set of all patterns at A that occur in one of the trajectories in is defined as So in the previous example,,. In then get the following theorem which establishes the difference between the subsets of captured by patterns and general subsets.
Theorem 1.Given a set of random variables, a subset cannot be represented by a pattern of if and only if there exists with (proper subset) and, i.e., if neither all patterns at A are possible nor a unique pattern at A is specified by.
We saw that in the previous example the subset cannot be captured by a pattern. For we have and for we have so these do not fulfil the conditions of Theorem 1. However, for we have and so the conditions of Theorem 1 are fulfilled and as expected cannot be captured by a pattern.
The proof of the following corollary shows how to construct a subset that cannot be represented by a pattern for all sets of random variables with.
Corollary 1.Given a set of random variables, if then(proper subset). Proof. Choose with. Then for all we have and. So cannot be represented by a pattern according to Theorem 1 and so. ☐
This means that in every set of random variables that not only consists of a single binary random variable there are subsets of that cannot be captured by a pattern. We can interpret this result in the following way. Patterns were constructed to be structures that occur within trajectories. It then turned out that each pattern also defines a subset of all trajectories of a system. So for sets of trajectories captured by patterns it could make sense to say they “occur” within one trajectory. However, there are sets of trajectories that are not captured by patterns. For these sets of trajectories it would then not be well-defined to say that they occur within a trajectory. This is the reason we choose to investigate patterns specifically and not sets of trajectories.
3.2. Motivation of Complete Local Integration as an Entity Criterion
We proposed to use patterns as the candidate structures for entities since patterns comprise arbitrary structures that occur within single trajectories of multivariate systems. Here we heuristically motivate our choice of using positive complete local integration as a criterion to select entities among patterns. In general such a criterion would give us, for any Bayesian network a subset of the patterns.
So what is an entity? We can rephrase the problem of finding an entity criterion by saying an entity is composed of parts that share the same identity. So if we can define when parts share the same identity we also define entities by finding all parts that share identity with some given part. For the moment, let us decompose (as is often done [
33]) the problem of identity into two parts:
spatial identity and
temporal identity.
Our solution will make no distinction between these two aspects in the end. We note here that conceiving of entities (or objects) as composite of spatial and temporal parts as we do in this paper is referred to as four-dimensionalism or perdurantism in philosophical discussions (see e.g., [
34]). The opposing view holds that entities are spatial only and endure over time. This view is called endurantism. Here we will not go into the details of this discussion.
The main intuition behind complete local integration is that every part of an entity should make every other part more probable.
This seems to hold for example for the spatial identity of living organisms. Parts of living organisms rarely exist without the rest of the living organisms also existing. For example, it is rare that an arm exists without a corresponding rest of a human body existing compared to an arm and the rest of a human body existing. The body (without arm) seems to make the existence of the arm more probable and vice versa. Similar relations between parts seem to hold for all living organisms but also for some non-living structures. The best example of a non-living structure we know of for which this is obvious are soap bubbles. Half soap bubbles (or thirds, quarters,...) only ever exist for split seconds whereas entire soap bubbles can persist for up to minutes. Any part of a soap bubble seems to make the existence of the rest more probable. Similarly, parts of hurricanes or tornadoes are rare. So what about spatial parts of structures that are not so entity-like? Does the existence of an arm make things more probable that are not parts of the corresponding body? For example, does the arm make the existence of some piece of rock more probable? Maybe to a small degree as without the existence of any rocks in the universe humans are probably impossible. However, this effect is much smaller than the increase of probability of the existence of the rest of the body due to the arm.
These arguments concerned the spatial identity problem. However, for temporal identity similar arguments hold. The existence of a living organism at one point in time makes it more probable that there is a living organism (in the vicinity) at a subsequent (and preceding) point in time. If we look at structures that are not entity-like with respect to the temporal dimension we find a different situation. An arm at some instance of time does not make the existence of a rock at a subsequent instance much more probable. It does make the existence of a human body at a subsequent instance much more probable. So the human body at the second instance seems to be more like a future part of the arm than the rock. Switching now to patterns in sets of random variables we can easily formalise such intuitions. We required that for an entity every part of the structure, which is now a pattern
, makes every other part more probable. A part of a pattern is a pattern
with
. If we require that every part of a pattern makes every other part more probable then we can write that
is an entity if:
If we write
for the set of all bipartitions of
O we can rewrite this further as
We can interpret this form as requiring that for every possible partition
into two parts
the probability of the whole pattern
is bigger than its probability would be if the two parts were independent. To see this, note that if the two parts
were independent we would have
Which would give us
for this partition.
From this point of view the choice of bipartitions only seems arbitrary. For example, the existence a partition
into three parts such that
seems to suggest that the pattern
is not an entity but instead composite of three parts. We can therefore generalise Equation (
16) to include all partitions
(see Definition 2) of
O except the unit partition
. Then we would say that
is an entity if
This measure already results in the same entities as the measure we propose.
However, in order to connect with information theory, log-likelihoods, and related literature we formally introduce the logarithm into this equation. We then arrive at the following entity-criterion
where the left hand side is the complete local integration (CLI), the function minimised is the specific local integration (SLI), and the inequality provides the criterion for
-entities. For reference, we define these notions formally. We begin with SLI which quantifies for a given partition
of a pattern in how far the probability of the whole pattern is bigger than its probability would be if the blocks of the partition would be independent.
Definition 5 (Specific local integration (SLI))
.Given a Bayesian network and a pattern the specific local integration of with respect to a partition π of is defined as In this paper we use the convention that.
Definition 6. ((Complete) local integration)
.Given a Bayesian network and a pattern of this network the complete local integration of is the minimum SLI over the non-unit partitions:We call a pattern completely locally integrated if. Remarks:
Definition 7 (
ι-entity)
.Given a multivariate Markov chain a pattern is a ι-entity if The entire set of-entities is then defined as follows.
Definition 8 (
ci-entity-set)
.Given a multivariate Markov chain the ι-entity-set is the entity-set Next, we look at some interpretations that the introduction of the logarithm allows.
3.3. Properties of Specific Local Integration
This section investigates the specific local integration (SLI) (see Definition 5). After giving its expression for deterministic systems it proves upper bounds constructively and constructs an example of negative SLI.
3.3.1. Deterministic Case
Theorem 2 (Deterministic specific local integration)
.Given a deterministic Bayesian network (Definition A10), a uniform initial distribution over ( is the set of nodes without parents), and a pattern with the SLI of with respect to partition π can be expressed more specifically: Let refer to the number of trajectories in which occurs. Then The first term in Equation (
30) is always positive if the partition and the set of random variables are not trivial (i.e., have cardinality larger than one) and is a constant for partitions of a given cardinality. The second term is also always non-negative for patterns
that actually occur in the system and rises with the number of trajectories that lead to it. The third term is always non-positive and becomes more and more negative the higher the number of trajectories that lead to the parts of the pattern occurring.
This shows that to maximise SLI for fixed partition cardinality we need to find patterns that have a high number of trajectories leading to them and a low number of occurrences for all their parts. Since the number of occurrences of the parts cannot be lower than the number of occurrences of the whole, we should get a maximum SLI for patterns whose parts occur only if the whole occurs. This turns out to be true also for the non-deterministic systems as we prove in Theorem 4.
Conversely, if we can increase the number of occurrences of the parts of the pattern without increasing the occurrences of the whole pattern occurring we minimise the SLI. This leads to the intuition that as often as possible as many parts as possible (i.e., all but one) should co-occur without the whole pattern occurring. This consistently leads to negative SLI as we will show for the non-deterministic case in Theorem 5.
3.3.2. Upper Bounds
In this section we present the upper bounds of SLI. These are of general interest, but the constructive proof also provides an intuition for what kind of patterns have large SLI.
We first show constructively that if we can choose the Bayesian network and the pattern then SLI can be arbitrary large. This construction sets the probabilities of all blocks equal to the probability of the pattern and implies that each of the parts of the pattern occurs only if the entire pattern occurs. The simplest example is one binary random variable determining another to always be in the same state, then the two patterns with both variables equal have this property. In the subsequent theorem we show that this property in general gives the upper bound of SLI if the cardinality of the partition is fixed. A simple extension of this example is used in the proof of the least upper bound. First we prove that there are Bayesian networks that achieve a particular SLI value. This will be used in the proofs that follow. For this we first define the anti-patterns which are patterns that differ to a given pattern at every random variable that is specified.
Definition 9 (Anti-pattern)
.Given a pattern define its set of anti-patterns that have values different from those of on all variables in O: Remark:
Theorem 3 (Construction of a pattern with maximum SLI)
.Given a probability and a positive natural number n there is a Bayesian network with and a pattern such that Proof. We construct a Bayesian network which realises two conditions on the probability. From these two conditions (which can also be realised by other Bayesian networks) we can then derive the theorem.
Choose a Bayesian network with binary random variables for all. Choose all nodes inO dependent only on node, the dependence of the nodes in is arbitrary:
for all let, i.e., nodes inO have no parents in the complement ofO,
for a specific and all other let, i.e., all nodes inO apart fromj have as a parent,
for all let, i.e., the state of all nodes inO is always the same as the state of nodej,
also choose and.
Then it is straightforward to see that:
,
.
Note that there are many Bayesian networks that realise the latter two conditions for some. These latter two conditions are the only requirements for the following calculation.
Next note that the two conditions imply that
if neither
nor
. Then for every partition
of
O with
and
we have
☐
Theorem 4 (Upper bound of SLI)
.For any Bayesian network and pattern with fixed
- 1.
The tight upper bound of the SLI with respect to any partition π with fixed is - 2.
The upper bound is achieved if and only if for all we have - 3.
The upper bound is achieved if and only if for all we have that occurs if and only if occurs.
Proof. - ad 1
Now note that for any
and
Plugging this into Equation (
41) for every
we get
This shows that is indeed an upper bound. To show that it is tight we have to show that for a given and there are Bayesian networks with patterns such that this upper bound is achieved. The construction of such a Bayesian network and a pattern was presented in Theorem 3.
- ad 2
If for all
we have
then clearly
and the least upper bound is achieved. If on the other hand
then
and because
(Equation (
44)) any deviation of any of the
from
leads to
such that for all
we must have
.
- ad 3
By definition for any
we have
such that
always occurs if
occurs. Now assume
occurs and
does not occur. In that case there is a positive probability for a pattern
with
i.e.,
. Recalling Equation (
43) we then see that
which contradicts the fact that
so
cannot occur without
occurring as well.
☐
Remarks:
Note that this is the least upper bound for Bayesian networks in general. For a specific Bayesian network there might be no pattern that achieves this bound.
The least upper bound of SLI increases with the improbability of the pattern and the number of parts that it is split into. If then we can have.
Using this least upper bound it is easy to see the least upper bound for the SLI of a pattern across all partitions. We just have to note that.
Since it is the minimum value of SLI with respect to arbitrary partitions the least upper bound of SLI is also an upper bound for CLI. It may not be the least upper bound however.
3.3.3. Negative SLI
This section shows that SLI of a pattern with respect to partition can be negativeindependently of the probability of (as long as it is not 1) and the cardinality of the partition (as long as that is not 1). The construction which achieves this also serves as an example of patterns with low SLI. We conjecture that this construction might provide the greatest lower bound but have not been able to prove this yet. An intuitive description of the construction is that patterns which either occur as a whole or missing exactly one part always have negative SLI.
Theorem 5.For any given probability and cardinality of a partition π there exists a Bayesian network with a pattern such that and Proof. We construct the probability distribution and ignore the behaviour of the Bayesian network outside of. In any case is also by itself a Bayesian network. We define (see remarks below for some intuitions behind these definitions and Definition 9 for):
Hered parameterises the probability of any pattern in occurring. We will carry it through the calculation but then end up setting it to zero.
Next we calculate the SLI. First note that, according to 1. and 2., we have
for all
and therefore also
for all
. So let
. Then note that, according to 3, for all
Plug this into the SLI definition:
Then we can use Bernoulli’s inequality (The authors thank von Eitzen [
39] for pointing this out. An example reference for Bernoulli’s inequality is Bullen [
40]). to prove that this is negative for
and
. Bernoulli’s inequality is
for
and
n a natural number. Replacing
x by
we see that
such that the argument of the logarithm is smaller than one which gives us negative SLI. ☐
Remarks:
The achieved value in Equation (
53) is also our best candidate for a greatest lower bound of SLI for given
and
. However, we have not been able to prove this yet.
The construction equidistributes the probability (left to be distributed after the probabilityq of the whole pattern occurring is chosen) to the patterns that arealmost the same as the pattern. These are almost the same in a precise sense: They differ in exactly one of the blocks of, i.e., they differ by as little as can possibly be resolved/revealed by the partition.
In order to achieve the negative SLI of Equation (
64) the requirement is only that Equation (
59) is satisfied. Our construction shows one way how this can be achieved.
For a pattern and partition such that
is not a natural number, the same bound might still be achieved however a little extra effort has to go into the construction 3. of the proof such that Equation (
59) still holds. This is not necessary for our purpose here as we only want to show the existence of patterns achieving the negative value.
Since it is the minimum value of SLI with respect to arbitrary partitions the candidate for the greatest lower bound of SLI is also a candidate for the greatest lower bound of CLI.
3.4. Disintegration
In this section we define the disintegration hierarchy and its refinement-free version. We then prove the disintegration theorem which is the main formal result of this paper. It exposes a connection between partitions minimising the SLI of a trajectory and the CLI of the blocks of such partitions. More precisely for a given trajectory the blocks of the
finest partitions among those leading to a particular value of SLI consist only of completely locally integrated blocks. Conversely,
each completely locally integrated pattern is a block in such a finest partition leading to a particular value of SLI. The theorem therefore reveals that
-entities can not only be motivated heuristically as we tried to do in
Section 3.2 but in fact play a special role within the trajectories they occur in. Furthermore, this theorem allows additional interpretations of the
-entities which will be discussed in
Section 3.5.
The main tool we use for the proof, the disintegration hierarchy and especially its refinement free version are also interesting structure in their own right since they define a hierarchy among the partitions of trajectories that we did not anticipate. In the case of the refinement free version the disintegration theorem tells us that this hierarchy among partitions of trajectories turns out to be a hierarchy of splits of the trajectory into-entities.
Definition 10 (Disintegration hierarchy)
.Given a Bayesian network and a trajectory, the disintegration hierarchy of is the set of sets of partitions of with:
where. We call the i-th disintegration level.
Remark:
Note that arg min returns all partitions that achieve the minimum SLI if there is more than one.
Since the Bayesian networks we use are finite, the partition lattice is finite, the set of attained SLI values is finite, and the number of disintegration levels is finite.
In most cases the Bayesian network contains some symmetries among their mechanisms which cause multiple partitions to attain the same SLI value.
For each trajectory the disintegration hierarchy then partitions the elements of into subsets of equal SLI. The levels of the hierarchy have increasing SLI.
Definition 11.Let be the lattice of partitions of set V and let be a subset of. Then for every element we can define the setThat is is the set of partitions in that are refinements of π. Definition 12 (Refinement-free disintegration hierarchy)
.Given a Bayesian network, a trajectory, and its disintegration hierarchy the refinement-free disintegration hierarchy of is the set of sets of partitions of with:
Remark:
Each level in the refinement-free disintegration hierarchy consists only of those partitions that neither have refinements at their own nor at any of the preceding levels. So each partition that occurs in the refinement-free disintegration hierarchy at thei-th level is a finest partition that achieves such a low level of SLI or such a high level of disintegration.
As we will see below, the blocks of the partitions in the refinement-free disintegration hierarchy are the main reason for defining the refinement-free disintegration hierarchy.
Theorem 6 (Disintegration theorem)
.Let be a Bayesian network, one of its trajectories, and the associated refinement-free disintegration hierarchy.
- 1.
Then for every we find for every with that there are only the following possibilities:
- (a)
b is a singleton, i.e., for some, or
- (b)
is completely locally integrated, i.e.,.
- 2.
Conversely, for any completely locally integrated pattern, there is a partition and a level such that and.
Proof. - ad 1
We prove the theorem by contradiction. For this assume that there is block
b in a partition
which is neither a singleton nor completely integrated. Let
and
. Assume
b is not a singleton i.e., there exist
such that
and
. Also assume that
b is not completely integrated i.e., there exists a partition
of
b with
such that
. Note that a singleton cannot be completely locally integrated as it does not allow for a non-unit partition. So together the two assumptions imply
with
. However, then
We treat the cases of “>” and “=” separately. First, let
Then we can define
such that
Second, let
Then we can define
such that
which contradicts
.
- ad 2
By assumption
is completely locally integrated. Then let
. Since
is a partition of
V it is an element of some disintegration level
. Then partition
is also an element of the refinement-free disintegration level
as we will see in the following. This is because any refinements must (by construction of
break up
A into further blocks which means that the local specific integration of all such partitions is higher. Then they must be at lower disintegration level
with
. Therefore,
has no refinement at its own or a higher disintegration level. More formally, let
and
since
only contains singletons apart from
A the partition
must split the block
A into multiple blocks
. Since
we know that
so that
and
Therefore
is on a disintegration level
with
, but this is true for any refinement of
so
and
.
☐
We mentioned in
Section 3.2 that the expectation value of SLI
is the (specific) multi-information
. A positive SLI value of
implies a positive expectation value
. Therefore every
-entity
implies positive specific multi-informations
with respect to any partition
. We put this into the following corollary.
Corollary 2.Under the conditions of Theorem 6 and for every we find for every with that there are only the following possibilities:
- 1.
b is a singleton, i.e., for some, or
- 2.
is completely (not only locally) integrated, i.e.,.
Proof. Since
is a Kullback–Leibler divergence we know from Gibbs’ inequality that
and
if and only if for all
we have
. To see that
is a Kullback–Leibler divergence note:
Now let a specific
be a
-entity. Then for all
we have
which implies that
and therefore
which implies
. ☐
3.5. Disintegration Interpretation
In
Section 3.2 we motivated our choice of positive complete local integration as a criterion for entities. This motivation is purely heuristic and starts from the intuition that an entity is a structure for which every part makes every other part more probable. While this heuristic argument seems sufficiently intuitive to be of a certain value we would much rather have a formal reason why an entity criterion is a “good” entity criterion. In other words we would ideally have a formal problem that is best solved by the entities satisfying the criterion. An example of a measure that has such an associated interpretation is the mutual information whose maximum over the input distributions is the channel capacity. Without a formal problem associated to
-entities there remains a risk that they (and maybe the whole concept of entities and identity over time) are artefacts of an ill-conceived conceptual approach.
Currently, we are not aware of an analogous formal problem that is solved by-entities. However, the different viewpoint provided by the disintegration theorem may be a first step towards finding such a problem. We will now discuss some alternative interpretations of SLI and see how CLI can be seen from a different perspective due to the disintegration theorem. These interpretations also exhibit why we chose to include the logarithm into the definition of SLI.
Using the disintegration theorem (Theorem 6) allows us to take another point of view on-entities. The theorem states that for each trajectory of a multivariate Markov chain the refinement-free disintegration hierarchy only contains partitions whose blocks are completely integrated patterns i.e., they only contain-entities. At the same time the blocks of all those partitions together areall-entities that occur in that trajectory.
A partition in the refinement-free disintegration hierarchy is always a minimal/finest partition reaching such a low specific local integration.
Each-entity is then a block with of a partition for some trajectory of the multivariate Markov chain.
Let us recruit the interpretation from coding theory above. If we want to find the optimal encoding for the entire multivariate Markov chain
this means finding the optimal encoding for the random variable
whose values are the trajectories
. The optimal code has the codeword lengths
for each trajectory
. The partitions in the lowest level
in the refinement-free disintegration hierarchy for
have minimal specific local integration i.e.,
is minimal among all partitions. At the same time these partitions are the finest partitions that achieve this low specific local integration. This implies on the one hand that the codeword lengths of the product codes associated to these partitions are the shortest possible for
among all partitions. On the other hand these partitions split up the trajectory in as many parts as possible while generating these shortest codewords. In this combined sense the partitions in
generate the “best” product codes for the particular trajectory
.
Note that the
expected codeword length of the product code:
which is the more important measure for encoding in general, might not be short at all, i.e., it might not be an efficient code for arbitrary trajectories. The product codes based on partitions in
are specifically adapted to assign a short codeword to
, i.e., to a single trajectory or story of this system. As product codes they are constructed/forced to describe
as a composition of stochastically independent parts. More precisely they are constructed in the way that would be optimal for stochastically independent parts.
Nonetheless, the product codes exist (they can be generated using Huffman coding or arithmetic coding [
37] based on the product probability) and are uniquely decodable. The parts/blocks of them are the
-entities. We mentioned before that we would like to find a problem that is solved by
-entities. This is then equivalent to finding a problem that is solved by the according product codes. Can we construct such a problem? This question is still open. A possible direction for finding such a problem may be the following line of reasoning. Say for some reason the trajectory
is more important than any other and that we want to “tell its story” as a story of as many as possible (stochastically) independent parts (that are maybe not really stochastically independent) i.e., say we wanted to encode the trajectory
as if it were a combination of as many as possible stochastically independent parts/events. And because
is more important than all other trajectories we wanted the codeword for
to be the shortest possible. Then we would use the product codes of partitions in the refinement-free disintegration hierarchy because those combine exactly these two conditions. The pseudo-stochastically-independent parts would then be the blocks of these partitions which according to the disintegration theorem are exactly the
-entities occurring in
.
Speculating about where the two conditions may arise in an actual problem, we mention that the trajectory/history that we (real living humans) live in is more important to us than all other possible trajectories of our universe (if there are any). What happens/happened in this trajectory needs to be communicated more often than what happens/happened in counterfactual trajectories. Furthermore, a good reason to think of a system as composite of as many parts as possible is that this reduces the number of parameters that need to be learned which in turn improves the learning speed (see e.g., [
41]). So the entities that mankind has partitioned its history into might be related to the
-entities of the universe’s history. These would compose the shortest product codes for what actually happened. The disintegration level might be chosen to optimise rates of model learning.
Recall that this kind of product code is not the optimal code in general (which would be the one with shortest expected codeword length). It is possibly more of a naive code that does not require deep understanding of the dynamical system but instead can be learned fast and works. The language of physics for example might be more optimal in the sense of shortest expected codeword lengths reflecting a desire to communicate efficiently about all counterfactual possibilities as well.
3.6. Related Approaches
We now discuss in some more detail than in
Section 1.3 the approaches of Beer [
14] and Balduzzi [
26].
In Beer [
14] the construction of the entities proceeds roughly as follows. First the maps from the Moore neighbourhood to the next state of a cell are classified into five classes of
local processes. Then these are used to reveal the dynamical structure in the transitions from one time-slice (or temporal part) of a pattern to the next. The used example patterns are the famous block, blinker, and glider and they are considered including their temporal extension. Using both the processes and the spatial patterns/values/components (the black and white values of cells are called components) networks characterising the organisation of the spatiotemporally extended patterns are constructed. These can then be investigated for their
organisational closure. Organisational closure occurs if the same process-component relations reoccur at a later time. Boundaries of the spatiotemporal patterns are identified by determining the cells around the pattern that have to be fixed to get re-occurrence of the organisation.
Beer [
14] mentions that the current version of this method of identifying entities has its limitations. If the closure is perturbed or delayed and then recovered the entity still looses its identity according to this definition. Two possible alternatives are also suggested. The first is to define the
potential for closure as enough for the ascription of identity. This is questioned as well since a sequence of perturbations can take the entity further and further away from its “defining” organisation and make it hard to still speak of a defining organisation at all. The second alternative is to define that the persistence of any organisational closure indicates identity. It is suggested that this would allow blinkers to transform to gliders.
We note that using the entity criterion we propose does not need similar choices to be made since it is not based on the re-occurrence of any organisation. Later time-slices of-entities need no organisational (or any other) similarity to earlier ones. Another, possibly only small, advantage is that our criterion is formalised and reasonably simply to state. Whether this is possible for the organisational closure based entities remains to be seen.
This is related to the philosophical discussion about identity across possible worlds [
33].
Some further parallels can be drawn between the present work and Balduzzi [
26] especially if we take into account the disintegration theorem. Given a trajectory (entire time-evolution) of the system in both cases a partition is sought which fulfills a particular trajectory-wide optimality criterion. Also in both cases, each block of the trajectory-wide partition fulfills a condition with respect to its own partitions. For our conditions the disintegration theorem exposes the direct connection between the trajectory-wide and the block-specific conditions. Such a connection is not known for other approaches. The main reason for this might be the simpler formal expression of CLI and SLI compared to the IIT approaches.
In how far our approach and the IIT approaches lead to coinciding or contradicting results is beyond the scope of this paper and constitutes future work. One avenue to pursue here are differences with respect to entities occurring in multiple trajectories as well as the possibility of overlapping entities within single trajectories.
5. Discussion
In
Section 3.1 we have argued for the use of patterns as candidates for entities. Patterns can be composed of arbitrary spatially and temporally extended parts of trajectories. We have seen in Theorem 1 that they are distinct from arbitrary subsets of trajectories. The important insight here is that patterns are structures that occur within trajectories but this cannot be said of sets of trajectories.
One of the main target applications of patterns is in time-unrolled Bayesian networks of cellular automata like those in
Figure 1. Patterns in such Bayesian networks become spatiotemporal patterns like those used to describe the glider, block, and blinker in the Game of Life cellular automaton by Beer [
14]. We would also like to investigate whether the latter spatiotemporal patterns are
-entities. However, at the present state of the computational models and, without approximations, this was out of reach computationally. We will discuss this further below.
In
Section 3.3 we defined SLI and in
Section 3.3 gave its expression for deterministic Bayesian networks (including cellular automata) as well. We also established the least upper bound of SLI with respect to a partition
of cardinality
n for a pattern
with probability
q. This upper bound is achieved if each of the blocks
in the partition
occur if and only if the whole pattern
occurs. This is compatible with our interpretation of entities since in this case clearly the occurrence of any part of the pattern leads necessarily to the occurrence of the entire pattern (and not only vice versa).
We also presented a candidate for a greatest lower bound of SLI with respect to a partition of cardinalityn for a pattern with probabilityq. Whether this is the greatest lower bound or not it shows a case for which SLI is always negative. This happens if either the whole pattern occurs (with probabilityq) or one of the “almost equal” patterns occurs, each with identical probability. A pattern is almost equal to with respect to in this sense if it only differs at one of the blocks i.e., if where. This construction makes as many parts as possible (i.e., all but one) occur as many times as possible without the whole pattern occurring. This creates large marginalised probabilities for each part/block which means that their product probability also becomes large.
Beyond these quantitative interpretations an interpretation of the greatest lower bound candidate seems difficult. A more intuitive candidate for the opposite of an integrated pattern seem to be patterns with independent parts. i.e., zero SLI but quantitatively these are not on the opposite end of the SLI spectrum. A more satisfying interpretation of the presented candidate is still to be found.
We also proved the disintegration theorem which relates states that the refinement-free partitions of a trajectory among those partitions achieving a particular SLI value consist of
-entities only, where an
-entity is a pattern with positive CLI. This theorem allows us to interpret the
-entities in new ways and may lead to a more formal or quantitative justification of
-entities. It is already a first step in this direction since it establishes a special role of the
-entities within trajectories of Bayesian networks. A further justification would tell us what in turn the refinement-free partitions can be used for. We have discussed a possible direction for further investigation in detail in
Section 3.5. This tried to connect the
-entities with a coding problem.
In
Section 4 we investigated SLI and CLI in three simple example sets of random variables. We found that if the random variables are all independently distributed the according entities are just all the possible
of each of the random variables
. This is what we would expect from an entity criterion. There are no entities with any further extent than a single random variable and each value corresponds to a different entity.
For the simple Markov chain
composed out of two independent and constant processes we presented the entire disintegration hierarchy and the Hasse diagrams of each disintegration level ordered by refinement. The Hasse diagrams reflected the highly symmetric dynamics of the Markov chain via multiple identical components. For the refinement-free disintegration hierarchy we then get multiple partitions at the same disintegration level as well. Different partitions of the trajectory imply overlapping blocks which in the case of the refinement-free partition are
-entities. So in general the
-entities at a particular disintegration level are not necessarily unique and can overlap. We also saw in
Figure 11 that the same
-entities can occur on multiple disintegration levels.
The-entities of included the expected three timestep constant patterns within each of the two independent processes. It also included the two timestep parts of these constant patterns. This may be less expected. It shows that parts of-entities can be-entities themselves. We note that these “sub-entities (those that are parts of larger entities) are always on a different disintegration level than their” super-entities (the larger entities). We can speculate that the existence of such sub- and super-entities on different disintegration levels may find an interpretation through multicellular organisms or similar structures. However, the overly simplistic examples here only serve as basic models for the potential phenomena, but are still far too simplistic to warrant any concrete interpretation in this direction.
We also looked at a version of
perturbed by noise, denoted
. We found that the entities of
remain the most strongly integrated entities in
. At the same time new entities occur. So we observe that in
the entities vary from one trajectory to another (
Figure 18,
Figure 19 and
Figure 20). We also observe spatially extended entities i.e., entities that extend across both (formerly independent) processes. We also observe entities that switch from one process to the other (from top row to bottom row or vice versa). The capacity of entities to exhibit this behaviour may be necessary to capture the movement or metabolism of entities in more realistic scenarios. In Biehl et al. [
8] we argued that these properties are important and showed that they hold for a crude approximation of CLI (namely for SLI with respect to
) but not for the full CLI measure.
We established that the-entities:
correspond to fixed single random variables for a set of independent random variables,
can vary from one trajectory to another,
and can change the degrees of freedom that they occupy over time,
can be ambiguous at a fixed level of disintegration due to symmetries of the system,
can overlap at the same level of disintegration due to this ambiguity,
can overlap across multiple levels of disintegration i.e., parts of-entities can be-entities again.
In general the examples we investigated concretely are too small to sufficiently support the concept of positive CLI as an entity criterion. Due to the extreme computational burden, this may remain the case for a while. For a straightforward calculation of the minimum SLI of a trajectory of a Bayesian network
with
nodes we have to calculate the SLI with respect to
partitions. According to (Bruijn [
43], p. 108) the Bell numbers
grow super-exponentially. Furthermore, to evaluate the SLI we need the joint probability distribution of the Bayesian network
. Naively, this means we need the probability (a real number between 0 and 1) of each trajectory. If we only have binary random variables, the number of trajectories is
which make the straightforward computation of disintegration hierarchies unrealistic even for quite small systems. If we take a seven by seven grid of the game of life cellular automaton and want to look at three time-steps we have
. If we use 32 bit floating numbers this gives us around
petabytes of storage needed for this probability distribution. We are sceptical that the exact evaluation of reasonably large systems can be achieved even with non-naive methods. This suggests that formal proofs may be the more promising way to investigate SLI and CLI further.