Specific and Complete Local Integration of Patterns in Bayesian Networks

Martin Biehl

^1,2,*

Takashi Ikegami

³ and

Daniel Polani

Araya Incorporation, 2F Mori 15 Building, 2-8-10 Toranomon, Minato-ku, Tokyo 105-0001, Japan

School of Computer Science, University of Hertfordshire, Hatfield AL10 9AB, UK

Department of General Systems Studies, University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan

Author to whom correspondence should be addressed.

Entropy2017,19(5), 230;https://doi.org/10.3390/e19050230

Submission received: 19 March 2017 /Revised: 11 May 2017 /Accepted: 12 May 2017 /Published: 18 May 2017

(This article belongs to the Special IssueComplexity, Criticality and Computation (C³))

Downloadkeyboard_arrow_down

Browse Figures

Versions Notes

Abstract

We present a first formal analysis of specific and complete local integration. Complete local integration was previously proposed as a criterion for detecting entities or wholes in distributed dynamical systems. Such entities in turn were conceived to form the basis of a theory of emergence of agents within dynamical systems. Here, we give a more thorough account of the underlying formal measures. The main contribution is the disintegration theorem which reveals a special role of completely locally integrated patterns (what we callι-entities) within the trajectories they occur in. Apart from proving this theorem we introduce the disintegration hierarchy and its refinement-free version as a way to structure the patterns in a trajectory. Furthermore, we construct the least upper bound and provide a candidate for the greatest lower bound of specific local integration. Finally, we calculate the

ι

-entities in small example systems as a first sanity check and find that

ι

-entities largely fulfil simple expectations.

Keywords:

identity over time;Bayesian networks;multi-information;entity;persistence;integration;emergence;naturalising agency

1. Introduction

This paper investigates a formal measure and a corresponding criterion we developed in order to capture the notion ofwholes orentities within Bayesian networks in general and multivariate Markov chains in particular. The main focus of this paper is to establish some formal properties of this criterion.

The main intuition behind wholes or entities is that combinations of some events/phenomena in space(-time) can be considered as more of asingle or coherent “thing” than combinations of other events in space(-time). For example, the two halves of a soap bubble (The authors thank Eric Smith for pointing out the example of a soap bubble.) together seem to form more of a single thing than one half of a floating soap bubble together with a piece of rock on the ground. Similarly, the soap bubble at time

t_{1}

and the “same” soap bubble at

t_{2}

seem more like temporal parts of thesame thing than the soap bubble at

t_{1}

and the piece of rock at

t_{2}

. We are trying to formally define and quantify what it is that makes some spatially and temporally extended combinations of parts entities but not others.

We envisage spatiotemporal entities as a way to establish not only the problem ofspatial identity but also that oftemporal identity (also calledidentity over time [1]). In other words, in addition to determining which events in “space” (e.g., which values of different degrees of freedom) belong to the same structure spatiotemporal entities should allow the identification of the structure at a time

t_{2}

that is the future (or past if

t_{2} < t_{1}

) of a structure at time

t_{1}

. Given a notion of identity over time, it becomes possible to capture which things persist and in what way they persist. Without a notion of identity over time, it seems persistence is not defined. The problem is how to decide whether something persisted from

t_{1}

t_{2}

if we cannot tell what at

t_{2}

would count as the future of the original thing.

In everyday experience problems concerning identity over time are not of great concern. Humans routinely and unconsciously connect perceived events to spatially and temporally extended entities. Nonetheless, the problem has been known since ancient times, in particular with respect to artefacts that exchange their parts over time. A famous example is the Ship of Theseus which has all of its planks exchanged over time. This leads to the question whether it is still thesame ship. From the point of view of physics and chemistry living organisms also exchange their parts (e.g., the constituting atoms or molecules) over time. In the long term we hope our theory can help to understand identity over time for these cases. For the moment, we are particularly interested in identity over time in formal settings like cellular automata, multivariate Markov chains, and more generally dynamical Bayesian networks. In these cases a formal notion of spatiotemporal entities (i.e., one defining spatial and temporal identity) would allow us to investigate persistence of entities/individuals formally. The persistence (and disappearance) of individuals are in turn fundamental to Darwinian evolution [2,3]. This suggests that spatiotemporal entities may be important for the understanding of the emergence of Darwinian evolution in dynamical systems.

Another area in which a formal solution to the problem of identity over time, and thereby entities (In the following, if not stated otherwise, we always meanspatiotemporal entities when we refer to entities.), might become important is a theory of intelligent agents that are space-time embedded as described by Orseau and Ring [4]. Agents are examples of entities fulfilling further properties e.g., exhibition of actions, and goal-directedness (cf. e.g., [5]). Using the formalism of reinforcement learning Legg and Hutter [6] proposes a definition of intelligence. Orseau and Ring [4] argue that this definition is insufficient. They dismiss the usual assumption that the environment of the reinforcement agent cannot overwrite the agent’s memory (which in this case is seen as the memory/tape of a Turing machine). They conclude that in the most realistic case there only ever is one memory that the agent’s (and the environment’s) data is embedded in. They note that the difference between agent and environment then disappears. Furthermore, that the policy of the agent cannot be freely chosen anymore, only the initial condition. In order to measure intelligence according to Legg and Hutter [6] we must be able to define reward functions. This seems difficult without the capability to distinguish the agent according to some criterion. Towards the end of their publication Orseau and Ring [4] propose to define a “heart” pattern and use the duration of its existence as a reward. This seems a too specific approach to us since it basically defines identity over time (of the heart pattern) as invariance. In more general settings a pattern that maintains a more general criterion of identity over time would be desirable. Ideally, this criterion would also not need a specifically designed heart pattern. Another advantage would be that reward functions different from lifetime could be used if the agent were identifiable. An entity criterion in the sense of this paper would be a step in this direction.

1.1. Illustration

In order to introduce the contributions of this paper we illustrate the setting of our work further.

This illustration should only be taken as a motivation for what follows and not be confused with a result. The reason we don’t use a concrete example is simply that we lack the necessary computational means (which are considerable as we will discuss inSection 5).

Let us assume we are given the entire time-evolution (what we will call atrajectory) of some known multivariate dynamical system or stochastic process. For example, a trajectory of a one-dimensional elementary cellular automaton showing a glider collision likeFigure 1a (This is produced by the rule 62 elementary cellular automaton with time increasing from left to right. However, this does not matter here. For more about this system see e.g., Boccara et al. [7]).

We take the point of view here argued for in previous work [8] that entities are phenomena that occur within trajectories and that they can be represented by (spatiotemporal) patterns. Patterns in this sense fix a part of the variables in a trajectory to definite values and leave the rest undetermined. InFigure 1b–d we show such patterns that occur inFigure 1a with the undetermined variables coloured grey and the determined ones taking on those of the trajectory. Visually speaking, a pattern is a snippet from a trajectory that it occurs in.

FromFigure 1a we would probably expect that what we are seeing are two gliders colliding and forming a third. However, it may also be that one of the gliders absorbs the other, maintains some form of identity, and only changes its appearance (e.g., it “grows”). This highlights the problem of identity over time. While the spatial identity of such patterns has been treated multiple times in the literature their identity of over time is rarely dealt with.

Our approach evaluates the “integration” of spatiotemporally extended patterns at once. According to our proposed entity-criterion a pattern is an

ι

-entity if, due to the dynamics of the system, every part of this pattern (which is again a pattern) makes all other parts more probable. Identity over time is then included since future parts have to make past parts more probable and vice versa. In principle this would allow us to detect if one of the gliders absorbs another one without loosing its identity. For example, this could result in an entity as inFigure 1e.

In order to detect entities the straightforward approach is to evaluate the entity-criterion for every spatiotemporal pattern in a given trajectory. Evaluating our entity-criterion of positive complete local integration (CLI) for a given pattern corresponds to splitting the pattern into parts in every possible way and calculating whether all the resulting parts make each other more probable. This means evaluating the specific local integration (SLI) with respect to allpartitions of the set of variables occupied by the pattern.

1.2. Contributions

This paper contains four contributions.

We first give a more formal definition of patterns. Since each pattern uniquely specifies a set of trajectories (those trajectories that the pattern occurs in) one might be tempted to reduce the analysis to that of sets of trajectories. We show that this is not possible since not all sets of trajectories have a pattern that specifies them.

Second, we try to get a general intuition for the patterns whose parts make all other parts more probable. For this we show how to construct patterns that, for given probability of the whole pattern, achieve the least upper bound of specific local integration (SLI). These turn out to be patterns for which each part only occurs if and only if the whole pattern occurs. We also construct a pattern that, again for given probability of the whole pattern, hasnegative SLI. These pattern (which may achieve the greatest lower bound of SLI) occur if either the whole pattern occurs or the pattern occurs up to exactly one part of it, which does not occur.

Third, we prove the disintegration theorem. This is the main contribution. We saw that patterns are snippets of trajectories. We can also look at the whole trajectory as a single pattern. Like all patterns the trajectory can be split up into parts, i.e., partitioned, resulting in a set of patterns. Among the partitions we find examples such as those inFigure 1g,h. These are very particular partitions picking out the gliders among all possible parts. This suggests that finding such special partitions provides a (possibly different) notion of entities.

One intuition we might have is that entities are the most “independent” parts of a trajectory. In other words we could look for the partition whose parts make the other partsless probable. The disintegration theorem then shows that this approach again leads to the

ι

-entities. This shows that

ι

-entities do not only have an intuitive motivation but also play a particular role in the structure of probabilities of entire trajectories.

It is not directly the parts of the partitions that minimise SLI for a trajectory which are

ι

-entities. To get

ι

-entities we first classify all partitions of the trajectory according to their SLI value. Then within each such class we choose the partitions for which no refining partition (A refining partition is one that further partitions any of the parts of the original partition.) achieves an even lower level of SLI.

So according to the disintegration theorem a

ι

-entity is not only a pattern that is integrated with respect to every possible partition of the pattern but also a pattern that occurs in partitions that minimise (in a certain sense) the integration of trajectories.

A side effect of the disintegration theorem is that we naturally get a kind of hierarchy of

ι

-entities called the disintegration hierarchy. For each trajectory and its different levels of SLI we find different decompositions of the trajectory into

ι

-entities.

Fourth, we calculate the

ι

-entities and disintegration hierarchy for two simple example systems. Our example systems show that in general the partitions at a particular disintegration level are not unique. This means that there are overlapping

ι

-entities at those levels. Furthermore, the same

ι

-entity can occur on multiple levels of the disintegration.

We do not thoroughly discuss the disintegration hierarchies in this paper and postpone this to future publications. Here we only note that many entities in the real world occur within hierarchies as well. For example, animals are entities that are composed of cells which are themselves entities.

1.3. Related Work

We now give a quick overview of related work. More in depth discussions will be provided after we formally introduce our definitions.

To our knowledge the measure of CLI has been proposed for the first time by us in [8]. However, this publication contained none of the formal or numerical results in the present paper. From a formal perspective the measures of SLI and CLI are a combination of existing concepts. SLI localises multi-information [9,10] in the way proposed by Lizier [11] for other information theoretic measures. In order to get the CLI we apply the weakest-link approach proposed by Tononi and Sporns [12], Balduzzi and Tononi [13] to SLI.

Conceptually, our work is most closely related to Beer [14]. The notion of spatiotemporal patterns used there to capture blocks, blinkers, and gliders is equivalent to thepatterns we define more formally here. This work also contains an informal entity-criterion that directly deals with identity over time (not only space). It differs significantly from our proposal as it depends on the re-occurrence of certain transitions at later times in a pattern whereas our criterion only depends on the probabilities of parts of the patterns without the need for any re-occurrences.

Theorganisations of chemical organisation theory [15] may also be interpreted as entity-criteria. In Fontana and Buss [15] these are defined in the following way:

The observer will conclude that the system is an organisation to the extent that there is a compressed description of its objects and of their relations.

The direct intuition is different from ours and it is not clear to us in how far our entity-criterion is equivalent to this. This will be further investigated in the future.

It is worth noting that viewing entities/objects/individuals as patterns occurring within a trajectory is in contrast to an approach that models them as sets of random variables/stochastic processes (e.g., a set of cells in a CA in contrast to a set of specific values of a set of cells). An example of the latter approach are the information theoretic individuals of Krakauer et al. [16]. These individuals are identified using an information theoretic notion of autonomy due to Bertschinger et al. [17]. The latter notion of autonomy is also somewhat related to the idea of integration here. Autonomy contains a term that measures the degree to which a random variable representing an individual at timestept determines the random variable representing it at

t + 1

. Similarly, CLI requires that every part of an entity pattern makes every other part more probable, in the extreme case this means that every part determines that every other part of the pattern also occurs. However, formally autonomy evaluates random variables and not patterns directly.

At the most basic level the intuition behind entities is that some spatiotemporal patterns are more special than others. Defining (and usually finding) more important spatiotemporal patterns or structures (also called coherent structures) has a long history in the theory of cellular automata and distributed dynamical systems. As Shalizi et al. [18] have argued most of the earlier definitions and methods [19,20,21,22] require previous knowledge about the patterns being looked for. They are therefore not suitable for a general definition of entities. More recent definitions based on information theory [18,23,24] do not have this limitation anymore. The difference to our entity-criterion is that they do not treat identity over time. They are well suited to identify gliders at each time-step for example, but if two gliders collide and give rise to a third glider as inFigure 1a these methods (by design) say nothing about the identity of the third glider. i.e., they cannot make a difference between a glider absorbing another one and two gliders producing a new one. While we have not been able to show that our approach actually makes such distinctions for gliders, it could do so in principle.

We note here that the approach of identifying individuals by Friston [25] using Markov blankets has the same shortcoming as the spatiotemporal filters. For each individual time-step it returns a partition of all degrees of freedom into internal, sensory, active, and external degrees. However, it does not provide a way to resolve ambiguities in the case of multiple such partitions colliding.

Among research related to integrated information theory (IIT) there are approaches (a first one by Balduzzi [26] and a more recently by Hoel et al. [27]) that can be used to determine specific spatiotemporal patterns in a trajectory. They can therefore be interpreted to define a notion of entities even if that is not their main goal. These approaches are aimed at establishing the optimal spatiotemporal coarse-graining to describe the dynamics of a system. For a given trajectory we can then identify the patterns that instantiate a macro-state/coarse-grain that is optimal according to their criterion.

In contrast to our approach the spatiotemporal grains are determined by their interactions with other grains. In our case the entities are determined first and foremost by their internal relations.

The consequence seems to be that a pattern can be an entity in one trajectory and not an entity in another even if it occurs in both. In our conception a pattern is an entity in all trajectories it occurs in.

2. Notation and Background

In this section we briefly introduce our notation for sets of random variables (Since every set of jointly distributed random variables can be seen as a Bayesian network and vice versa we use these terms interchangeably.) and their partition lattices.

In general, we use the convention that upper-case letters

X, Y, Z

are random variables, lower-case letters

x, y, z

are specific values/outcomes of random variables, and calligraphic letters

X, Y, Z

are state spaces that random variables take values in. Furthermore:

Definition 1.

Let

{X_{i}}_{i \in V}

be a set of random variables with totally ordered finite index set V and state spaces

{X_{i}}_{i \in V}

respectively. Then for

A, B \subseteq V

define:

1.: $X_{A} : = {(X_{i})}_{i \in A}$ as the joint random variable composed of the random variables indexed by A, where A is ordered according to the total order of V,
2.: $X_{A} : = \prod_{i \in A} X_{i}$ as the state space of $X_{A}$ ,
3.: $x_{A} : = {(x_{i})}_{i \in A} \in X_{A}$ as a value of $X_{A}$ ,
4.: $p_{A} : X_{A} \to [0, 1]$ as the probability distribution (or more precisely probability mass function) of $X_{A}$ which is the joint probability distribution over the random variables indexed by A. If $A = {i}$ i.e., a singleton set, we drop the parentheses and just write $p_{A} = p_{i}$ ,
5.: $p_{A, B} : X_{A} \times X_{B} \to [0, 1]$ as the probability distribution over $X_{A} \times X_{B}$ . Note that in general for arbitrary $A, B \subseteq V$ , $x_{A} \in X_{A}$ , and $y_{B} \in X_{B}$ this can be rewritten as a distribution over the intersection of A and B and the respective complements. The variables in the intersection have to coincide:
$\begin{matrix} p_{A, B} (x_{A}, y_{B}) : & = p_{A \ B, A \cap B, B \ A, A \cap B} (x_{A \ B}, x_{A \cap B}, y_{B \ A}, y_{A \cap B}) \end{matrix}$
(1)
$\begin{matrix} = δ_{x_{A \cap B}} (y_{A \cap B}) p_{A \ B, A \cap B, B \ A} (x_{A \ B}, x_{A \cap B}, y_{B \ A}) . \end{matrix}$
(2)
Here δ is the Kronecker delta (seeAppendix A). If $A \cap B = \emptyset$ and $C = A \cup B$ we also write $p_{C} (x_{A}, y_{B})$ to keep expressions shorter.
6.: $p_{B | A} : X_{A} \times X_{B} \to [0, 1]$ with $(x_{A}, x_{B}) \mapsto p_{B | A} (x_{B} | x_{A})$ as the conditional probability distribution over $X_{B}$ given $X_{A}$ :
$\begin{matrix} p_{B | A} (y_{B} | x_{A}) : = \frac{p_{A, B} (x_{A}, y_{B})}{p_{A} (x_{A})} . \end{matrix}$
(3)
We also just write $p_{B} (x_{B} | x_{A})$ if it is clear from context what variables we are conditioning on.

If we are given

p_{V}

we can obtain every

p_{A}

through marginalisation. In the notation of Definition 1 this is formally written:

\begin{matrix} p_{A} (x_{A}) & = \sum_{{\bar{x}}_{V \ A} \in X_{V \ A}} p_{A, V \ A} (x_{A}, {\bar{x}}_{V \ A}) \end{matrix}

(4)

\begin{matrix} = \sum_{{\bar{x}}_{V \ A} \in X_{V \ A}} p_{V} (x_{A}, {\bar{x}}_{V \ A}) . \end{matrix}

(5)

Next we define the partition lattice of a set of random variables. Partition lattices occur as a structure of the set of possible ways to split an object/pattern into parts. Subsets of the partition lattices play an important role in the disintegration theorem.

Definition 2 (Partition lattice of a set of random variables).

Let

{X_{i}}_{i \in V}

be a set of random variables.

1.: Then its partition lattice $L (V)$ is the set of partitions of V partially ordered by refinement (see alsoAppendix B).
2.: For two partitions $π, ρ \in L (V)$ we write $π ⊲ ρ$ if π refines ρ and $π ⊲ : ρ$ if π covers ρ. The latter means that $π \neq ρ$ , $π ⊲ ρ$ , and there is no $ξ \in L (V)$ with $π \neq ξ \neq ρ$ such that $π ⊲ ξ ⊲ ρ$ .
3.: We write $0$ for the zero element of a partially ordered set (including lattices) and $1$ for the unit element.
4.: Given a partition $π \in L (V)$ and a subset $A \subseteq V$ we define the restricted partition ${π |}_{A}$ of π to A via:
${π |}_{A} : = {b \cap A : b \in π} .$
(6)

For some examples of partition lattices seeAppendix B and for more background see e.g., Grätzer [28]. For our purpose it is important to note that the partitions of sets of random variables or Bayesian networks we are investigating are partitions of the index setV of these and not partitions of their state spaces

X_{V}

3. Patterns, Entities, Specific, and Complete Local Integration

This section contains the formal part of this contribution.

First we introducepatterns. Patterns are the main structures of interest in this publication. Entities are seen as special kinds of patterns. The measures of specific local integration and complete local integration, which we use in our criterion for

ι

-entities, quantify notions of “oneness” of patterns. We give a brief motivation and show that while each pattern defines a set of “trajectories” of a set of random variables not every such set is defined by a pattern. This justifies studying patterns for their own sake.

Then we motivate briefly the use of specific and complete local integration (SLI and CLI) for an entity criterion on patterns. We then turn to more formal aspects of SLI and CLI. We first prove an upper bound for SLI and construct a candidate for a lower bound. We then go on to define the disintegration hierarchy and its refinement-free version. These structures are used to prove the main result, thedisintegration theorem. This relates the SLI of whole trajectories of a Bayesian network to the CLI of parts of these trajectories and vice versa.

3.1. Patterns

This section introduces the notion of patterns. These form the basic candidate structures for entities.

The structures we are trying to capture by entities should be analogous to spatially and temporally extended objects we encounter in everyday life (e.g., soap bubbles, living organisms). These objects seem to occur in the single history of the universe that also contains us. The purpose of patterns is then to capture arbitrary structures that occur within single trajectories or histories of a multivariate discrete dynamical system (seeFigure 2 for an example of a Bayesian network of such a system, any cellular automaton is also such a system).

We emphasise the single trajectory since many structures of interest (e.g., gliders) occur in some trajectories in some “places”, in other trajectories in other “places” (compare e.g.,Figure 1a andFigure 3a), and in some trajectories not at all. We explicitly want to be able to capture such trajectory dependent structures and therefore choose patterns. Examples of formal structures for which it makes no sense to say that they occur within a trajectory are for example the random variables in a Bayesian network and, as we will see, general sets of trajectories of the Bayesian network.

Unlike entities, which we conceive of as special patterns that fulfil further criteria, patterns are formed byany combination of events at arbitrary times and positions. As an example, we might think of cellular automaton again. The time evolutions over multiple steps of the cells attributed to a glider see [14] for a principled way to attribute cells to theseas inFigure 1b,e should be patterns but also arbitrary choices of events in a trajectory as inFigure 3b.

In the more general context of (finite) Bayesian networks there may be no interpretation of time or space. Nonetheless, we can define that a trajectory in this case fixes every random variable to a particular value. We then define patterns formally in the following way.

Definition 3 (Patterns and trajectories).

Let

{X_{i}}_{i \in V}

be set of random variables with index set V and state spaces

{X_{i}}_{i \in V}

respectively.

1.: A pattern at $A \subseteq V$ is an assignment
$\begin{matrix} X_{A} = x_{A} \end{matrix}$
(7)
where $x_{A} \in X_{A}$ . If there is no danger of confusion we also just write $x_{A}$ for the pattern $X_{A} = x_{A}$ at A.
2.: The elements $x_{V}$ of the joint state space $X_{V}$ are isomorphic to the patterns $X_{V} = x_{V}$ at V which fix the complete set ${X_{i}}_{i \in V}$ of random variables. Since they will be used repeatedly we refer to them as the trajectoriesof ${X_{i}}_{i \in V}$ .
3.: A pattern $x_{A}$ is said to occur in trajectory ${\bar{x}}_{V} \in X_{V}$ if ${\bar{x}}_{A} = x_{A}$ .
4.: Each pattern $x_{A}$ uniquely defines (or captures) a set of trajectories $T (x_{A})$ via
$\begin{matrix} T (x_{A}) = {{\bar{x}}_{V} \in X_{V} : {\bar{x}}_{A} = x_{A}}, \end{matrix}$
(8)
i.e., the set of trajectories that $x_{A}$ occurs in.
5.: It is convenient to allow the empty pattern $x_{\emptyset}$ for which we define $T (x_{\emptyset}) = X_{V}$ .

Remarks:

Note that for every $x_{A} \in X_{A}$ we can form a pattern $X_{A} = x_{A}$ so the set of all patterns is $⋃_{A \subseteq V} X_{A}$ .
Our notion of patterns is similar to “patterns” as defined in [29] and to “cylinders” as defined in [30]. More precisely, these other definitions concern (probabilistic) cellular automata where all random variables have identical state spaces $X_{i} = X_{j}$ for all $i, j \in V$ . They also restrict the extent of the patterns or cylinders to a single time-step. Under these conditions our patterns are isomorphic to these other definitions. However, we drop both the identical state space assumption and the restriction to single time-steps.
Our definition is inspired by the usage of the term “spatiotemporal pattern” in [14,31,32]. There is no formal definition of this notion given in these publications but we believe that our definition is a straightforward formalisation. Note that these publications only treat the Game of Life cellular automaton. The assumption of identical state space is therefore implicitly made. At the same time the restriction to single time-steps is explicitly dropped.

Since every pattern defines a subset of

X_{V}

, one could think that every subset of

X_{V}

is also a pattern. In that case studying patterns in a set of random variables

{X_{i}}_{i \in V}

would be the same as studying subsets of its set of trajectories

X_{V}

. However, the set of subsets of

X_{V}

defined by patterns and the set of all subsets

2^{X_{V}}

(i.e., the power set) of

X_{V}

of a set of random variables

{X_{i}}_{i \in V}

are not identical. Formally:

\begin{matrix} ⋃_{B \subseteq V} {T (x_{B}) \subseteq X_{V} : x_{B} \in X_{B}} \subseteq 2^{X_{V}} . \end{matrix}

(9)

While patterns define subsets of

X_{V}

, not every subset of

X_{V}

is captured by a pattern. The difference of the two sets is characterised in Theorem 1 below. We first present a simple example of a subset

D \in 2^{X_{V}}

that cannot be captured by a pattern.

Let

V = {1, 2}

and

{X_{i}}_{i \in V} = {X_{1}, X_{2}}

the set of random variables. Let

X_{1} = X_{2} = {0, 1}

. Then

X_{V} = {(0, 0), (0, 1), (1, 0), (1, 1)}

. Now let

A = V = {1, 2}

, choose pattern

x_{A} = (0, 0)

and pattern

{\bar{x}}_{A} = (1, 1)

. Then let

D : = {x_{A} \cup {\bar{x}}_{A}} = {(0, 0), (1, 1)} .

(10)

In this case we can easily list the set of all patterns

⋃_{C \subseteq V} X_{C}

(11)

and verify that

D

is not among them. Before we formally characterise the difference, we define some extra terminology.

Definition 4.

Let

{X_{i}}_{i \in V}

be set of random variables with index set V and state spaces

{X_{i}}_{i \in V}

respectively. For a subset

D \subseteq X_{V}

the set

D_{A}

of all patterns at A that occur in one of the trajectories in

D

is defined as

\begin{matrix} D_{A} : = {x_{A} \in X_{A} : \exists {\bar{x}}_{V} \in D, {\bar{x}}_{A} = x_{A}} . \end{matrix}

(12)

So in the previous example

D_{{1}} = {0, 1}

D_{{2}} = {0, 1}

D_{{1, 2}} = {(0, 0), (1, 1)}

. In then get the following theorem which establishes the difference between the subsets of

X_{V}

captured by patterns and general subsets.

Theorem 1.

Given a set of random variables

{X_{i}}_{i \in V}

, a subset

D \subseteq X_{V}

cannot be represented by a pattern of

{X_{i}}_{i \in V}

if and only if there exists

A \subseteq V

with

D_{A} \subset X_{A}

(proper subset) and

| D_{A} | > 1

, i.e., if neither all patterns at A are possible nor a unique pattern at A is specified by

D

Proof.

SeeAppendix D. ☐

We saw that in the previous example the subset

D

cannot be captured by a pattern. For

A = {1}

we have

D_{{1}} = {0, 1} = X_{{1}}

and for

A = {2}

we have

D_{{2}} = {0, 1} = X_{{2}}

so these do not fulfil the conditions of Theorem 1. However, for

A = {1, 2}

we have

D_{{1, 2}} = {(0, 0), (1, 1)} \subset X_{{1, 2}}

and

| D_{{1, 2}} | > 1

so the conditions of Theorem 1 are fulfilled and as expected

D

cannot be captured by a pattern.

The proof of the following corollary shows how to construct a subset that cannot be represented by a pattern for all sets of random variables

{X_{i}}_{i \in V}

with

| X_{V} | > 2

Corollary 1.

Given a set of random variables

{X_{i}}_{i \in V}

, if

| X_{V} | > 2

then

\begin{matrix} ⋃_{B \in V} {T (x_{B}) \subseteq X_{V} : x_{B} \in X_{B}} \subset 2^{X_{V}} \end{matrix}

(13)

(proper subset).

Proof.

Choose

D = {x_{V}, y_{V}} \in 2^{X_{V}}

with

y_{V} \in {{\bar{x}}_{V} \in X_{V} : \forall i \in V, {\bar{x}}_{i} \neq x_{i}}

. Then for all

A \subseteq V

we have

| D_{A} | = 2

and

D_{A} \subset X_{A}

. So

D

cannot be represented by a pattern according to Theorem 1 and so

D \notin ⋃_{B \in V} {T (x_{B}) \subseteq X_{V} : x_{B} \in X_{B}}

. ☐

This means that in every set of random variables that not only consists of a single binary random variable there are subsets of

X_{V}

that cannot be captured by a pattern. We can interpret this result in the following way. Patterns were constructed to be structures that occur within trajectories. It then turned out that each pattern also defines a subset of all trajectories of a system. So for sets of trajectories captured by patterns it could make sense to say they “occur” within one trajectory. However, there are sets of trajectories that are not captured by patterns. For these sets of trajectories it would then not be well-defined to say that they occur within a trajectory. This is the reason we choose to investigate patterns specifically and not sets of trajectories.

3.2. Motivation of Complete Local Integration as an Entity Criterion

We proposed to use patterns as the candidate structures for entities since patterns comprise arbitrary structures that occur within single trajectories of multivariate systems. Here we heuristically motivate our choice of using positive complete local integration as a criterion to select entities among patterns. In general such a criterion would give us, for any Bayesian network

{X_{i}}_{i \in V}

a subset

E ({X_{i}}_{i \in V}) \subseteq ⋃_{A \subseteq V} X_{A}

of the patterns.

So what is an entity? We can rephrase the problem of finding an entity criterion by saying an entity is composed of parts that share the same identity. So if we can define when parts share the same identity we also define entities by finding all parts that share identity with some given part. For the moment, let us decompose (as is often done [33]) the problem of identity into two parts:

spatial identity and
temporal identity.

Our solution will make no distinction between these two aspects in the end. We note here that conceiving of entities (or objects) as composite of spatial and temporal parts as we do in this paper is referred to as four-dimensionalism or perdurantism in philosophical discussions (see e.g., [34]). The opposing view holds that entities are spatial only and endure over time. This view is called endurantism. Here we will not go into the details of this discussion.

The main intuition behind complete local integration is that every part of an entity should make every other part more probable.

This seems to hold for example for the spatial identity of living organisms. Parts of living organisms rarely exist without the rest of the living organisms also existing. For example, it is rare that an arm exists without a corresponding rest of a human body existing compared to an arm and the rest of a human body existing. The body (without arm) seems to make the existence of the arm more probable and vice versa. Similar relations between parts seem to hold for all living organisms but also for some non-living structures. The best example of a non-living structure we know of for which this is obvious are soap bubbles. Half soap bubbles (or thirds, quarters,...) only ever exist for split seconds whereas entire soap bubbles can persist for up to minutes. Any part of a soap bubble seems to make the existence of the rest more probable. Similarly, parts of hurricanes or tornadoes are rare. So what about spatial parts of structures that are not so entity-like? Does the existence of an arm make things more probable that are not parts of the corresponding body? For example, does the arm make the existence of some piece of rock more probable? Maybe to a small degree as without the existence of any rocks in the universe humans are probably impossible. However, this effect is much smaller than the increase of probability of the existence of the rest of the body due to the arm.

These arguments concerned the spatial identity problem. However, for temporal identity similar arguments hold. The existence of a living organism at one point in time makes it more probable that there is a living organism (in the vicinity) at a subsequent (and preceding) point in time. If we look at structures that are not entity-like with respect to the temporal dimension we find a different situation. An arm at some instance of time does not make the existence of a rock at a subsequent instance much more probable. It does make the existence of a human body at a subsequent instance much more probable. So the human body at the second instance seems to be more like a future part of the arm than the rock. Switching now to patterns in sets of random variables we can easily formalise such intuitions. We required that for an entity every part of the structure, which is now a pattern

x_{O}

, makes every other part more probable. A part of a pattern is a pattern

x_{b}

with

b \subset O

. If we require that every part of a pattern makes every other part more probable then we can write that

x_{O}

is an entity if:

min_{b \subset O} \frac{p_{O \ b} (x_{O \ b} | x_{b})}{p_{O \ b} (x_{O \ b})} > 1 .

(14)

This is equivalent to

min_{b \subset O} \frac{p_{O} (x_{O})}{p_{O \ b} (x_{O \ b}) p_{b} (x_{b})} > 1 .

(15)

If we write

L_{2} (O)

for the set of all bipartitions ofO we can rewrite this further as

min_{π \in L_{2} (O)} \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} > 1 .

(16)

We can interpret this form as requiring that for every possible partition

π \in L_{2} (O)

into two parts

x_{b_{1}}, x_{b_{2}}

the probability of the whole pattern

x_{O} = (x_{b_{1}}, x_{b_{2}})

is bigger than its probability would be if the two parts were independent. To see this, note that if the two parts

x_{b_{1}}, x_{b_{2}}

were independent we would have

p_{O} (x_{O}) = : p_{b_{1}, b_{2}} (x_{b_{1}}, x_{b_{2}}) = p_{b_{1}} (x_{b_{1}}) p_{b_{2}} (x_{b_{2}}) .

(17)

Which would give us

\frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} = 1

(18)

for this partition.

From this point of view the choice of bipartitions only seems arbitrary. For example, the existence a partition

ξ

into three parts such that

p_{O} (x_{O}) = \prod_{c \in ξ} p_{c} (x_{c})

(19)

seems to suggest that the pattern

x_{O}

is not an entity but instead composite of three parts. We can therefore generalise Equation (16) to include all partitions

L (O)

(see Definition 2) ofO except the unit partition

1_{O}

. Then we would say that

x_{O}

is an entity if

min_{π \in L (O) \ 1_{O}} \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} > 1 .

(20)

This measure already results in the same entities as the measure we propose.

However, in order to connect with information theory, log-likelihoods, and related literature we formally introduce the logarithm into this equation. We then arrive at the following entity-criterion

min_{π \in L (O) \ 1_{O}} log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} > 0 .

(21)

where the left hand side is the complete local integration (CLI), the function minimised is the specific local integration (SLI), and the inequality provides the criterion for

ι

-entities. For reference, we define these notions formally. We begin with SLI which quantifies for a given partition

π

of a pattern in how far the probability of the whole pattern is bigger than its probability would be if the blocks of the partition would be independent.

Definition 5 (Specific local integration (SLI)).

Given a Bayesian network

{X}_{i \in V}

and a pattern

x_{O}

the specific local integration

{mi}_{π} (x_{O})

x_{O}

with respect to a partition π of

O \subseteq V

is defined as

{mi}_{π} (x_{O}) : = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} .

(22)

In this paper we use the convention that

log \frac{0}{0} : = 0

Definition 6. ((Complete) local integration).

Given a Bayesian network

{X_{i}}_{i \in V}

and a pattern

x_{O}

of this network the complete local integration

ι (x_{O})

x_{O}

is the minimum SLI over the non-unit partitions

π \in L (O) \ 1_{O}

ι (x_{O}) : = min_{π \in L (O) \ 1_{O}} {mi}_{π} (x_{O}) .

(23)

We call a pattern

x_{O}

completely locally integrated if

ι (x_{O}) > 0

Remarks:

The reason for excluding the unit partition $1_{O}$ of $L (O)$ (where $1_{O} = {O}$ see Definition 2) is that with respect to it every pattern has ${mi}_{1_{O}} (x_{O}) = 0$ .
Looking for a partition that minimises a measure of integration is known as theweakest link approach [35] to dealing with multiple partitions. We note here that this is not the only approach that is being discussed. Another approach is to look at weighted averages of all integrations. For a further discussion of this point in the case of the expected value of SLI see Ay [35] and references therein. For our interpretation taking the average seems less well suited since requiring a positive average will allow SLI to be negative with respect to some partitions.

Definition 7 (ι-entity).

Given a multivariate Markov chain

{X_{i}}_{i \in V}

a pattern

x_{O}

is a ι-entity if

ι (x_{O}) > 0 .

(24)

The entire set of

ι

-entities

E_{ι} ({X_{i}}_{i \in V})

is then defined as follows.

Definition 8 (ci-entity-set).

Given a multivariate Markov chain

{X_{i}}_{i \in V}

the ι-entity-set is the entity-set

E_{ι} ({X_{i}}_{i \in V}) : = {x_{O} \in ⋃_{A \subseteq V} X_{A} : ι (x_{O}) > 0} .

(25)

Next, we look at some interpretations that the introduction of the logarithm allows.

A first consequence of introducing the logarithm is that we can now formulate the condition of Equation (24) analogously to an old phrase attributed to Aristotle that “the whole is more than the sum of its parts”. In our case this would need to be changed to “the log-probability of the (spatiotemporal) whole is greater than the sum of the log-probabilities of its (spatiotemporal) parts”. This can easily be seen by rewriting Equation (22) as:
${mi}_{π} (x_{O}) = log p_{O} (x_{O}) - \sum_{b \in π} log p_{b} (x_{b}) .$
(26)
Another side effect of using the logarithm is that we can interpret Equation (24) in terms of the surprise value (also called information content) $- log p_{O} (x_{O})$ [36] of the pattern $x_{O}$ and the surprise value of its parts with respect to any partition $π$ . Rewriting Equation (22) using properties of the logarithm we get:
${mi}_{π} (x_{O}) = \sum_{b \in π} (- log p_{b} (x_{b})) - (- log p_{O} (x_{O})) .$
Interpreting Equation (24) from this perspective we can then say that a pattern is an entity if the sum of the surprise values of its parts is larger than the surprise value of the whole.
In coding theory, the Kraft-McMillan theorem [37] tells us that the optimal length (in a uniquely decodable binary code) of a codeword for an eventx is $l (x) = - log p (x)$ if $p (x)$ is thetrue probability ofx. If the encoding is not based on the true probability ofx but instead on a different probability $q (x)$ then the difference between the optimal codeword length and the chosen codeword length is
$- log q (x) - (- log p (x)) = log \frac{p (x)}{q (x)} .$
(27)
Then we can interpret the specific local integration as a difference in codeword lengths. Say we want to encode what occurs at the nodes/random variables indexed byO, i.e., we encode the random variable $X_{V}$ . We can encode every event (now a pattern) $x_{O}$ based on $p_{O} (x_{O})$ . Let’s call this thejoint code. Given a partition $π \in L (O)$ we can also encode every event $x_{O}$ based on its product probability $\prod_{b \in π_{O}} p_{b} (x_{b})$ . Let’s call this theproduct code with respect to π. For a particular event $x_{O}$ the difference of the codeword lengths between the joint code and the product code with respect to $π$ is then just the specific local integration with respect to $π$ .
Complete local integration then requires that the joint code codeword is shorter than all possible product code codewords. This means there is no partition with respect to which the product code for the pattern $x_{O}$ has a shorter codeword than the joint code. So $ι$ -entities are patterns that are shorter to encode with the joint code than a product code. Patterns that have a shorter codeword in a product code associated to a partition $π$ have negative SLI with respect to this $π$ and are therefore not $ι$ -entities.
We can relate our measure of identity to other measures in information theory. For this we note that the expectation value of specific local integration with respect to a partition $π$ is the multi-information ${MI}_{π} (X_{O})$ [9,10] with respect to $π$ , i.e.,
$\begin{matrix} {MI}_{π} (X_{O}) : & = \sum_{x_{O} \in X_{O}} p_{O} (x_{O}) log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} \end{matrix}$
(28)
$\begin{matrix} = \sum_{x_{O} \in X_{O}} p_{O} (x_{O}) {mi}_{π} (x_{O}) . \end{matrix}$
(29)
The multi-information plays a role in measures of complexity and information integration [35]. The generalisation from bipartitions to arbitrary partitions is applied to expectation values similar to the multi-information above in Tononi [38]. The relations of our localised measure (in the sense of [11]) to multi-information and information integration measures also motivates the name specificlocal integration. Relations to these measures will be studied further in the future. Here we note that these are not suited for measuring identity of patterns since they are properties of the random variables $X_{O}$ and not of patterns $x_{O}$ . We also show in Corollary 2 that if $x_{O}$ is an $ι$ -entity that $X_{O}$ (the joint random variable) has a positive ${MI}_{π} (X_{O})$ for all partitions $π$ and is therefore a set of “integrated” random variables.

3.3. Properties of Specific Local Integration

This section investigates the specific local integration (SLI) (see Definition 5). After giving its expression for deterministic systems it proves upper bounds constructively and constructs an example of negative SLI.

3.3.1. Deterministic Case

Theorem 2 (Deterministic specific local integration).

Given a deterministic Bayesian network (Definition A10), a uniform initial distribution over

X_{V_{0}}

(

V_{0}

is the set of nodes without parents), and a pattern

x_{O}

with

O \subseteq V

the SLI of

x_{O}

with respect to partition π can be expressed more specifically: Let

N (x_{O})

refer to the number of trajectories in which

x_{O}

occurs. Then

{mi}_{π} (x_{O}) = (| π | - 1) log | X_{V_{0}} | + log N (x_{O}) - \sum_{b \in π} log N (x_{b}) .

(30)

Proof.

SeeAppendix C.2. ☐

The first term in Equation (30) is always positive if the partition and the set of random variables are not trivial (i.e., have cardinality larger than one) and is a constant for partitions of a given cardinality. The second term is also always non-negative for patterns

x_{O}

that actually occur in the system and rises with the number of trajectories that lead to it. The third term is always non-positive and becomes more and more negative the higher the number of trajectories that lead to the parts of the pattern occurring.

This shows that to maximise SLI for fixed partition cardinality we need to find patterns that have a high number of trajectories leading to them and a low number of occurrences for all their parts. Since the number of occurrences of the parts cannot be lower than the number of occurrences of the whole, we should get a maximum SLI for patterns whose parts occur only if the whole occurs. This turns out to be true also for the non-deterministic systems as we prove in Theorem 4.

Conversely, if we can increase the number of occurrences of the parts of the pattern without increasing the occurrences of the whole pattern occurring we minimise the SLI. This leads to the intuition that as often as possible as many parts as possible (i.e., all but one) should co-occur without the whole pattern occurring. This consistently leads to negative SLI as we will show for the non-deterministic case in Theorem 5.

3.3.2. Upper Bounds

In this section we present the upper bounds of SLI. These are of general interest, but the constructive proof also provides an intuition for what kind of patterns have large SLI.

We first show constructively that if we can choose the Bayesian network and the pattern then SLI can be arbitrary large. This construction sets the probabilities of all blocks equal to the probability of the pattern and implies that each of the parts of the pattern occurs only if the entire pattern occurs. The simplest example is one binary random variable determining another to always be in the same state, then the two patterns with both variables equal have this property. In the subsequent theorem we show that this property in general gives the upper bound of SLI if the cardinality of the partition is fixed. A simple extension of this example is used in the proof of the least upper bound. First we prove that there are Bayesian networks that achieve a particular SLI value. This will be used in the proofs that follow. For this we first define the anti-patterns which are patterns that differ to a given pattern at every random variable that is specified.

Definition 9 (Anti-pattern).

Given a pattern

x_{O}

define its set of anti-patterns

\neg (x_{O})

that have values different from those of

x_{O}

on all variables in O:

\neg (x_{O}) : = {{\bar{x}}_{O} \in X_{O} : \forall i \in O, {\bar{x}}_{i} \neq x_{i}} .

(31)

Remark:

It is important to note that for an element of $\neg (x_{O})$ to occur it is not sufficient that $x_{O}$ does not occur. Only ifevery random variable $X_{i}$ with $i \in O$ differs from the value $x_{i}$ specified by $x_{O}$ does an element of $\neg (x_{O})$ necessarily occur. This is why we call $\neg (x_{O})$ the anti-pattern of $x_{O}$ .

Theorem 3 (Construction of a pattern with maximum SLI).

Given a probability

q \in (0, 1)

and a positive natural number n there is a Bayesian network

{X_{i}}_{i \in V}

with

| V | \geq n

and a pattern

x_{O}

such that

{mi}_{π} (x_{O}) = - (n - 1) log q .

(32)

Proof.

We construct a Bayesian network which realises two conditions on the probability

p_{O}

. From these two conditions (which can also be realised by other Bayesian networks) we can then derive the theorem.

Choose a Bayesian network

{X_{i}}_{i \in V}

with binary random variables

X_{i} = {0, 1}

for all

i \in V

. Choose all nodes inO dependent only on node

j \in O

, the dependence of the nodes in

V \ O

is arbitrary:

for all $i \in O \subset V$ let $pa (i) \cap (V \ O) = \emptyset$ , i.e., nodes inO have no parents in the complement ofO,
for a specific $j \in O$ and all other $i \in O \ {j}$ let $pa (i) = {j}$ , i.e., all nodes inO apart fromj have $j \in O$ as a parent,
for all $i \in O \ {j}$ let $p_{i} ({\bar{x}}_{i} | b {\bar{x}}_{j}) = δ_{{\bar{x}}_{j}} ({\bar{x}}_{i})$ , i.e., the state of all nodes inO is always the same as the state of nodej,
also choose $p_{j} (x_{j}) = q$ and $\sum_{{\bar{x}}_{j} \neq x_{j}} p_{j} (x_{j}) = 1 - q$ .

Then it is straightforward to see that:

$p_{O} (x_{O}) = q$ ,
$\sum_{{\bar{x}}_{O} \in \neg (x_{O})} p_{O} ({\bar{x}}_{O}) = 1 - q$ .

Note that there are many Bayesian networks that realise the latter two conditions for some

x_{O}

. These latter two conditions are the only requirements for the following calculation.

Next note that the two conditions imply that

p_{O} ({\bar{x}}_{O}) = 0

if neither

{\bar{x}}_{O} = x_{O}

nor

{\bar{x}}_{O} \in \neg (x_{O})

. Then for every partition

π

ofO with

| π | = n

and

n > 1

we have

\begin{matrix} {mi}_{π} (x_{O}) & = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} \end{matrix}

(33)

\begin{matrix} = log \frac{p_{O} (x_{O})}{\prod_{b \in π} \sum_{{\bar{x}}_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b})} \end{matrix}

(34)

\begin{matrix} = log \frac{p_{O} (x_{O})}{\prod_{b \in π} (p_{O} (x_{O}) + \sum_{{\bar{x}}_{O \ b} \neq x_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b}))} \end{matrix}

(35)

\begin{matrix} = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{O} (x_{O})} \end{matrix}

(36)

\begin{matrix} = log \frac{p_{O} (x_{O})}{p_{O} {(x_{O})}^{n}} \end{matrix}

(37)

\begin{matrix} = - (n - 1) log q . \end{matrix}

(38)

☐

Theorem 4 (Upper bound of SLI).

For any Bayesian network

{X}_{i \in V}

and pattern

x_{O}

with fixed

p_{O} (x_{O}) = q

1.: The tight upper bound of the SLI with respect to any partition π with $| π | = n$ fixed is
$max_{{{X_{i}}_{i \in V} : \exists x_{O}, p_{O} (x_{O}) = q}} max_{{π : | π | = n}} {mi}_{π} (x_{O}) \leq - (n - 1) log q .$
(39)
2.: The upper bound is achieved if and only if for all $b \in π$ we have
$p_{b} (x_{b}) = p_{O} (x_{O}) = q .$
(40)
3.: The upper bound is achieved if and only if for all $b \in π$ we have that $x_{b}$ occurs if and only if $x_{O}$ occurs.

Proof.

ad 1: By Definition 5 we have
${mi}_{π} (x_{O}) = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} .$
(41)
Now note that for any $x_{O}$ and $b \subseteq O$
$\begin{matrix} p_{b} (x_{b}) & = \sum_{{\bar{x}}_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b}) \end{matrix}$
(42)
$\begin{matrix} = p_{O} (x_{O}) + \sum_{{\bar{x}}_{O \ b} \neq x_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b}) \end{matrix}$
(43)
$\begin{matrix} \geq p_{O} (x_{O}) . \end{matrix}$
(44)
Plugging this into Equation (41) for every $p_{b} (x_{b})$ we get
$\begin{matrix} {mi}_{π} (x_{O}) & = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} \end{matrix}$
(45)
$\begin{matrix} \leq log \frac{p_{O} (x_{O})}{p_{O} {(x_{O})}^{| π |}} \end{matrix}$
(46)
$\begin{matrix} = - (| π | - 1) log p_{O} (x_{O}) . \end{matrix}$
(47)
This shows that $- (| π | - 1) log p_{O} (x_{O})$ is indeed an upper bound. To show that it is tight we have to show that for a given $p_{O} (x_{O})$ and $| π |$ there are Bayesian networks with patterns $x_{O}$ such that this upper bound is achieved. The construction of such a Bayesian network and a pattern $x_{O}$ was presented in Theorem 3.
ad 2: If for all $b \in π$ we have $p_{b} (x_{b}) = p_{O} (x_{O})$ then clearly ${mi}_{π} (x_{O}) = - (| π | - 1) log p_{O} (x_{O})$ and the least upper bound is achieved. If on the other hand ${mi}_{π} (x_{O}) = - (| π | - 1) log p_{O} (x_{O})$ then
$\begin{matrix} log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} & = - (| π | - 1) log p_{O} (x_{O}) \end{matrix}$
(48)
$\begin{matrix} \Leftrightarrow & log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} & = log \frac{p_{O} (x_{O})}{p_{O} {(x_{O})}^{| π |}} \end{matrix}$
(49)
$\begin{matrix} \Leftrightarrow & \prod_{b \in π} p_{b} (x_{b}) & = p_{O} {(x_{O})}^{| π |}, \end{matrix}$
(50)
and because $p_{b} (x_{b}) \geq p_{O} (x_{O})$ (Equation (44)) any deviation of any of the $p_{b} (x_{b})$ from $p_{O} (x_{O})$ leads to $\prod_{b \in π} p_{b} (x_{b}) > p_{O} {(x_{O})}^{| π |}$ such that for all $b \in π$ we must have $p_{b} (x_{b}) = p_{O} (x_{O})$ .
ad 3: By definition for any $b \in π$ we have $b \subseteq O$ such that $x_{b}$ always occurs if $x_{O}$ occurs. Now assume $x_{b}$ occurs and $x_{O}$ does not occur. In that case there is a positive probability for a pattern $(x_{b}, {\bar{x}}_{O \ b})$ with ${\bar{x}}_{O \ b} \neq x_{O \ b}$ i.e., $p_{O} (x_{b}, {\bar{x}}_{O \ b}) > 0$ . Recalling Equation (43) we then see that
$\begin{matrix} p_{b} (x_{b}) & = p_{O} (x_{O}) + \sum_{{\bar{x}}_{O \ b} \neq x_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b}) \end{matrix}$
(51)
$\begin{matrix} > p_{O} (x_{O}) . \end{matrix}$
(52)
which contradicts the fact that $p_{b} (x_{b}) = p_{O} (x_{O})$ so $x_{b}$ cannot occur without $x_{O}$ occurring as well.

☐

Remarks:

Note that this is the least upper bound for Bayesian networks in general. For a specific Bayesian network there might be no pattern that achieves this bound.
The least upper bound of SLI increases with the improbability of the pattern and the number of parts that it is split into. If $p_{O} (x_{O}) \to 0$ then we can have ${mi}_{π} (x_{O}) \to \infty$ .
Using this least upper bound it is easy to see the least upper bound for the SLI of a pattern $x_{O}$ across all partitions $| π |$ . We just have to note that $| π | \leq | O |$ .
Since it is the minimum value of SLI with respect to arbitrary partitions the least upper bound of SLI is also an upper bound for CLI. It may not be the least upper bound however.

3.3.3. Negative SLI

This section shows that SLI of a pattern

x_{O}

with respect to partition

π

can be negativeindependently of the probability of

x_{O}

(as long as it is not 1) and the cardinality of the partition (as long as that is not 1). The construction which achieves this also serves as an example of patterns with low SLI. We conjecture that this construction might provide the greatest lower bound but have not been able to prove this yet. An intuitive description of the construction is that patterns which either occur as a whole or missing exactly one part always have negative SLI.

Theorem 5.

For any given probability

q < 1

and cardinality

| π | = n > 1

of a partition π there exists a Bayesian network

{X_{i}}_{i \in V}

with a pattern

x_{O}

such that

q = p_{O} (x_{O})

and

{mi}_{π} (x_{O}) = log \frac{q}{{(1 - \frac{1 - q}{n})}^{n}} < 0 .

(53)

Proof.

We construct the probability distribution

p_{O} : X_{O} \to [0, 1]

and ignore the behaviour of the Bayesian network

{X_{i}}_{i \in V}

outside of

O \subseteq V

. In any case

{X_{i}}_{i \in O}

is also by itself a Bayesian network. We define (see remarks below for some intuitions behind these definitions and Definition 9 for

\neg (x_{A})

for all $i \in O$ let $| X_{i} | = n$
for every block $b \in π$ let $| b | = \frac{| O |}{| π |}$ ,
for ${\bar{x}}_{O} \in X_{O}$ let:
$p_{O} ({\bar{x}}_{O}) : = \{\begin{matrix} q & if {\bar{x}}_{O} = x_{O}, \\ \frac{1 - q - d}{\sum_{b \in π} | \neg (x_{b}) |} & if \exists c \in π s . t . {\bar{x}}_{O \ c} = x_{O \ c} \land {\bar{x}}_{c} \neq x_{c}, \\ \frac{d}{| \neg (x_{O}) |} & if {\bar{x}}_{O} \in \neg (x_{O}), \\ 0 & else . \end{matrix}$
(54)

Hered parameterises the probability of any pattern in

\neg (x_{O})

occurring. We will carry it through the calculation but then end up setting it to zero.

Next we calculate the SLI. First note that, according to 1. and 2., we have

| X_{b} | = | X_{c} |

for all

b, c \in π

and therefore also

| \neg (x_{b}) | = | \neg (x_{c}) |

for all

b, c \in π

. So let

m : = | \neg (x_{b}) |

. Then note that, according to 3, for all

b \in π

\begin{matrix} \sum_{{\bar{x}}_{O \ b} \neq x_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b}) & = \sum_{c \in π \ b} \sum_{{\bar{x}}_{c} \neq x_{c}} p_{O} (x_{b}, x_{O \ (b \cup c)}, {\bar{x}}_{c}) \end{matrix}

(55)

\begin{matrix} = \sum_{c \in π \ b} \sum_{{\bar{x}}_{c} \neq x_{c}} \frac{1 - q - d}{\sum_{b \in π} | \neg (x_{b}) |} \end{matrix}

(56)

\begin{matrix} = \sum_{c \in π \ b} \sum_{{\bar{x}}_{c} \neq x_{c}} \frac{1 - q - d}{m | π |} \end{matrix}

(57)

\begin{matrix} = \sum_{c \in π \ b} \frac{1 - q - d}{m | π |} | \neg (x_{c}) | \end{matrix}

(58)

\begin{matrix} = \frac{| π | - 1}{| π |} (1 - q - d) \end{matrix}

(59)

Plug this into the SLI definition:

\begin{matrix} {mi}_{π} (x_{O}) & = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} \end{matrix}

(60)

\begin{matrix} = log \frac{q}{\prod_{b \in π} q + \sum_{{\bar{x}}_{O \ b} \neq x_{O \ b}} p_{O} (x_{b}, {\bar{x}}_{O \ b})} \end{matrix}

(61)

\begin{matrix} = log \frac{q}{\prod_{b \in π} q + \frac{| π | - 1}{| π |} (1 - q - d)} \end{matrix}

(62)

\begin{matrix} = log \frac{q}{{(q + \frac{| π | - 1}{| π |} (1 - q - d))}^{| π |}} \end{matrix}

(63)

If we now set

d = 0

we get:

\begin{matrix} {mi}_{π} (x_{O}) = log \frac{q}{{(1 - \frac{1 - q}{| π |})}^{| π |}} . \end{matrix}

(64)

Then we can use Bernoulli’s inequality (The authors thank von Eitzen [39] for pointing this out. An example reference for Bernoulli’s inequality is Bullen [40]). to prove that this is negative for

0 < q < 1

and

| π | \geq 2

. Bernoulli’s inequality is

{(1 + x)}^{n} \geq 1 + n x

(65)

for

x \geq - 1

andn a natural number. Replacingx by

- (1 - q) / | π |

we see that

\begin{matrix} {(1 - \frac{1 - q}{| π |})}^{| π |} > q \end{matrix}

(66)

such that the argument of the logarithm is smaller than one which gives us negative SLI. ☐

Remarks:

The achieved value in Equation (53) is also our best candidate for a greatest lower bound of SLI for given $p_{O} (x_{O})$ and $| π |$ . However, we have not been able to prove this yet.
The construction equidistributes the probability $1 - q$ (left to be distributed after the probabilityq of the whole pattern occurring is chosen) to the patterns ${\bar{x}}_{O}$ that arealmost the same as the pattern $x_{O}$ . These are almost the same in a precise sense: They differ in exactly one of the blocks of $π$ , i.e., they differ by as little as can possibly be resolved/revealed by the partition $π$ .
In order to achieve the negative SLI of Equation (64) the requirement is only that Equation (59) is satisfied. Our construction shows one way how this can be achieved.
For a pattern and partition such that $| O | / | π |$ is not a natural number, the same bound might still be achieved however a little extra effort has to go into the construction 3. of the proof such that Equation (59) still holds. This is not necessary for our purpose here as we only want to show the existence of patterns achieving the negative value.
Since it is the minimum value of SLI with respect to arbitrary partitions the candidate for the greatest lower bound of SLI is also a candidate for the greatest lower bound of CLI.

3.4. Disintegration

In this section we define the disintegration hierarchy and its refinement-free version. We then prove the disintegration theorem which is the main formal result of this paper. It exposes a connection between partitions minimising the SLI of a trajectory and the CLI of the blocks of such partitions. More precisely for a given trajectory the blocks of thefinest partitions among those leading to a particular value of SLI consist only of completely locally integrated blocks. Conversely,each completely locally integrated pattern is a block in such a finest partition leading to a particular value of SLI. The theorem therefore reveals that

ι

-entities can not only be motivated heuristically as we tried to do inSection 3.2 but in fact play a special role within the trajectories they occur in. Furthermore, this theorem allows additional interpretations of the

ι

-entities which will be discussed inSection 3.5.

The main tool we use for the proof, the disintegration hierarchy and especially its refinement free version are also interesting structure in their own right since they define a hierarchy among the partitions of trajectories that we did not anticipate. In the case of the refinement free version the disintegration theorem tells us that this hierarchy among partitions of trajectories turns out to be a hierarchy of splits of the trajectory into

c i

-entities.

Definition 10 (Disintegration hierarchy).

Given a Bayesian network

{X_{i}}_{i \in V}

and a trajectory

x_{V} \in X_{V}

, the disintegration hierarchy of

x_{V}

is the set

D (x_{V}) = {D_{1}, D_{2}, D_{3}, . . .}

of sets of partitions of

x_{V}

with:

1.: $D_{1} (x_{V}) : = \underset{π \in L (V)}{arg min} {mi}_{π} (x_{V})$
(67)
2.: and for $i > 1$ :
$D_{i} (x_{V}) : = \underset{π \in L (V) \ D_{≺ i} (x_{V})}{arg min} {mi}_{π} (x_{V}) .$
(68)

where

D_{≺ i} (x_{V}) : = ⋃_{j < i} D_{j} (x_{V})

. We call

D_{i} (x_{V})

the i-th disintegration level.

Remark:

Note that arg min returns all partitions that achieve the minimum SLI if there is more than one.
Since the Bayesian networks we use are finite, the partition lattice $L (V)$ is finite, the set of attained SLI values is finite, and the number $| D |$ of disintegration levels is finite.
In most cases the Bayesian network contains some symmetries among their mechanisms which cause multiple partitions to attain the same SLI value.
For each trajectory $x_{V}$ the disintegration hierarchy $D$ then partitions the elements of $L (V)$ into subsets $D_{i} (x_{V})$ of equal SLI. The levels of the hierarchy have increasing SLI.

Definition 11.

Let

L (V)

be the lattice of partitions of set V and let

E

be a subset of

L (V)

. Then for every element

π \in L (V)

we can define the set

E_{⊲ π} : = {ξ \in E : ξ ⊲ π} .

(69)

That is

E_{⊲ π}

is the set of partitions in

E

that are refinements of π.

Definition 12 (Refinement-free disintegration hierarchy).

Given a Bayesian network

{X_{i}}_{i \in V}

, a trajectory

x_{V} \in X_{V}

, and its disintegration hierarchy

D (x_{V})

the refinement-free disintegration hierarchy of

x_{V}

is the set

D^{◂} (x_{V}) = {D_{1}^{◂}, D_{2}^{◂}, D_{3}^{◂}, . . .}

of sets of partitions of

x_{V}

with:

1.: $D_{1}^{◂} (x_{V}) : = {π \in D_{1} (x_{V}) : D_{1} {(x_{V})}_{⊲ π} = \emptyset},$
(70)
2.: and for $i > 1$ :
$D_{i}^{◂} (x_{V}) : = {π \in D_{i} (x_{V}) : D_{≺ i} {(x_{V})}_{⊲ π} = \emptyset}$
(71)

Remark:

Each level $D_{i}^{◂} (x_{V})$ in the refinement-free disintegration hierarchy $D^{◂} (x_{V})$ consists only of those partitions that neither have refinements at their own nor at any of the preceding levels. So each partition that occurs in the refinement-free disintegration hierarchy at thei-th level is a finest partition that achieves such a low level of SLI or such a high level of disintegration.
As we will see below, the blocks of the partitions in the refinement-free disintegration hierarchy are the main reason for defining the refinement-free disintegration hierarchy.

Theorem 6 (Disintegration theorem).

Let

{X_{i}}_{i \in V}

be a Bayesian network,

x_{V} \in X_{V}

one of its trajectories, and

D^{◂} (x_{V})

the associated refinement-free disintegration hierarchy.

1.

Then for every

D_{i}^{◂} (x_{V}) \in D^{◂} (x_{V})

we find for every

b \in π

with

π \in D_{i}^{◂} (x_{V})

that there are only the following possibilities:

(a): b is a singleton, i.e., $b = {i}$ for some $i \in V$ , or
(b): $x_{b}$ is completely locally integrated, i.e., $ι (x_{b}) > 0$ .

2.

Conversely, for any completely locally integrated pattern

x_{A}

, there is a partition

π^{A} \in L (V)

and a level

D_{i^{A}}^{◂} (x_{V}) \in D^{◂} (x_{V})

such that

A \in π^{A}

and

π^{A} \in D_{i^{A}}^{◂} (x_{V})

Proof.

ad 1

We prove the theorem by contradiction. For this assume that there is blockb in a partition

π \in D_{i}^{◂} (x_{V})

which is neither a singleton nor completely integrated. Let

π \in D_{i}^{◂} (x_{V})

and

b \in π

. Assumeb is not a singleton i.e., there exist

i \neq j \in V

such that

i \in b

and

j \in b

. Also assume thatb is not completely integrated i.e., there exists a partition

ξ

ofb with

ξ \neq 1_{b}

such that

{mi}_{ξ} (x_{b}) \leq 0

. Note that a singleton cannot be completely locally integrated as it does not allow for a non-unit partition. So together the two assumptions imply

p_{b} (x_{b}) \leq \prod_{d \in ξ} p_{d} (x_{d})

with

| ξ | > 1

. However, then

\begin{matrix} {mi}_{π} (x_{V}) & = log \frac{p_{V} (x_{V})}{p_{b} (x_{b}) \prod_{c \in π \ b} p_{c} (x_{c})} \end{matrix}

(72)

\begin{matrix} \geq log \frac{p_{V} (x_{V})}{\prod_{d \in ξ} p_{d} (x_{d}) \prod_{c \in π \ b} p_{c} (x_{c})} \end{matrix}

(73)

We treat the cases of “>” and “=” separately. First, let

{mi}_{π} (x_{V}) = log \frac{p_{V} (x_{V})}{\prod_{d \in ξ} p_{d} (x_{d}) \prod_{c \in π \ b} p_{c} (x_{c})} .

(74)

Then we can define

ρ : = (π \ b) \cup ξ

such that

${mi}_{ρ} (x_{V}) = {mi}_{π} (x_{V})$ which implies that $ρ \in D_{i} (x_{V})$ because $π \in D_{i} (x_{V})$ , and
$ρ ⊲ π$ which contradicts $π \in D_{i}^{◂} (x_{V})$ .

Second, let

{mi}_{π} (x_{V}) > log \frac{p_{V} (x_{V})}{\prod_{d \in ξ} p_{d} (x_{d}) \prod_{c \in π \ b} p_{c} (x_{c})} .

(75)

Then we can define

ρ : = (π \ b) \cup ξ

such that

{mi}_{ρ} (x_{V}) < {mi}_{π} (x_{V}),

(76)

which contradicts

π \in D_{i}^{◂} (x_{V})

ad 2

By assumption

x_{A}

is completely locally integrated. Then let

π^{A} : = {A} \cup {{j}}_{j \in V \ A}

. Since

π^{A}

is a partition ofV it is an element of some disintegration level

D_{i^{A}}

. Then partition

π^{A}

is also an element of the refinement-free disintegration level

D_{i^{A}}^{◂} (x_{V})

as we will see in the following. This is because any refinements must (by construction of

π^{A}

break upA into further blocks which means that the local specific integration of all such partitions is higher. Then they must be at lower disintegration level

D_{k} (x_{V})

with

k \geq i^{A}

. Therefore,

π^{A}

has no refinement at its own or a higher disintegration level. More formally, let

ξ \in L (V), ξ \neq π^{A}

and

ξ ⊲ π^{A}

since

π^{A}

only contains singletons apart fromA the partition

ξ

must split the blockA into multiple blocks

{c \in ξ |}_{A}

. Since

ι (x_{A}) > 0

we know that

{mi}_{{ξ |}_{A}} (x_{A}) = log \frac{p_{A} (x_{A})}{\prod_{{c \in ξ |}_{A}} p_{c} (x_{c})} > 0

(77)

so that

\prod_{{c \in ξ |}_{A}} p_{c} (x_{c}) < p_{A} (x_{A})

and

\begin{matrix} {mi}_{ξ} (x_{V}) & = log \frac{p_{V} (x_{V})}{\prod_{{c \in ξ |}_{A}} p_{c} (x_{c}) \prod_{i \in V \ A} p_{i} (x_{i})} \end{matrix}

(78)

\begin{matrix} > log \frac{p_{V} (x_{V})}{p_{A} (x_{A}) \prod_{i \in V \ A} p_{i} (x_{i})} \end{matrix}

(79)

\begin{matrix} = {mi}_{π^{A}} (x_{V}) . \end{matrix}

(80)

Therefore

ξ

is on a disintegration level

D_{k} (x_{V})

with

k > i^{A}

, but this is true for any refinement of

π^{A}

D_{≺ i^{A}} {(x_{V})}_{⊲ π^{A}} = \emptyset

and

π^{A} \in D_{i^{A}}^{◂} (x_{V})

☐

We mentioned inSection 3.2 that the expectation value of SLI

{mi}_{π} (x_{A})

is the (specific) multi-information

{MI}_{π} (X_{A})

. A positive SLI value of

x_{A}

implies a positive expectation value

{MI}_{π} (X_{A})

. Therefore every

ι

-entity

x_{A}

implies positive specific multi-informations

{MI}_{π} (X_{A})

with respect to any partition

π

. We put this into the following corollary.

Corollary 2.

Under the conditions of Theorem 6 and for every

D_{i}^{◂} (x_{V}) \in D^{◂} (x_{V})

we find for every

b \in π

with

π \in D_{i}^{◂} (x_{V})

that there are only the following possibilities:

1.: b is a singleton, i.e., $b = {i}$ for some $i \in V$ , or
2.: $X_{b}$ is completely (not only locally) integrated, i.e., $I (X_{b}) > 0$ .

here

I (X_{A}) : = min_{π \in L (A) \ 0} {MI}_{π} (X_{A}) .

(81)

Proof.

Since

{MI}_{π} (X_{A})

is a Kullback–Leibler divergence we know from Gibbs’ inequality that

{MI}_{π} (X_{A}) \geq 0

and

{MI}_{π} (X_{A}) = 0

if and only if for all

x_{A} \in X_{A}

we have

p_{A} (x_{A}) = \prod_{b \in π} p_{b} (x_{b})

. To see that

{MI}_{π} (X_{A})

is a Kullback–Leibler divergence note:

\begin{matrix} {MI}_{π} (X_{A}) : & = \sum_{x_{A} \in X_{A}} p_{A} (x_{A}) {mi}_{π} (x_{A}) \end{matrix}

(82)

\begin{matrix} = \sum_{x_{A} \in X_{A}} p_{A} (x_{A}) log \frac{p_{A} (x_{A})}{\prod_{b} p_{b} (x_{b})} \end{matrix}

(83)

\begin{matrix} = KL [p_{A} | | \prod_{b \in π} p_{b}] . \end{matrix}

(84)

Now let a specific

x_{A} \in X_{A}

be a

ι

-entity. Then for all

π \in L (A) \ 0

we have

log \frac{p_{A} (x_{A})}{\prod_{b} p_{b} (x_{b})} > 0,

(85)

which implies that

p_{A} (x_{A}) \neq \prod_{b} p_{b} (x_{b})

(86)

and therefore

KL [p_{A} | | \prod_{b \in π} p_{b}] > 0

(87)

which implies

I (X_{A}) > 0

. ☐

3.5. Disintegration Interpretation

InSection 3.2 we motivated our choice of positive complete local integration as a criterion for entities. This motivation is purely heuristic and starts from the intuition that an entity is a structure for which every part makes every other part more probable. While this heuristic argument seems sufficiently intuitive to be of a certain value we would much rather have a formal reason why an entity criterion is a “good” entity criterion. In other words we would ideally have a formal problem that is best solved by the entities satisfying the criterion. An example of a measure that has such an associated interpretation is the mutual information whose maximum over the input distributions is the channel capacity. Without a formal problem associated to

ι

-entities there remains a risk that they (and maybe the whole concept of entities and identity over time) are artefacts of an ill-conceived conceptual approach.

Currently, we are not aware of an analogous formal problem that is solved by

ι

-entities. However, the different viewpoint provided by the disintegration theorem may be a first step towards finding such a problem. We will now discuss some alternative interpretations of SLI and see how CLI can be seen from a different perspective due to the disintegration theorem. These interpretations also exhibit why we chose to include the logarithm into the definition of SLI.

Using the disintegration theorem (Theorem 6) allows us to take another point of view on

ι

-entities. The theorem states that for each trajectory

x_{V} \in X_{V}

of a multivariate Markov chain the refinement-free disintegration hierarchy only contains partitions whose blocks are completely integrated patterns i.e., they only contain

ι

-entities. At the same time the blocks of all those partitions together areall

ι

-entities that occur in that trajectory.

A partition in the refinement-free disintegration hierarchy is always a minimal/finest partition reaching such a low specific local integration.

Each

ι

-entity is then a block

x_{c}

with

c \in π

of a partition

π \in D^{◂} (x_{V})

for some trajectory

x_{V} \in X_{V}

of the multivariate Markov chain.

Let us recruit the interpretation from coding theory above. If we want to find the optimal encoding for the entire multivariate Markov chain

{X_{i}}_{i \in V}

this means finding the optimal encoding for the random variable

X_{V}

whose values are the trajectories

x_{V} \in X_{V}

. The optimal code has the codeword lengths

- log p_{V} (x_{V})

for each trajectory

x_{V}

. The partitions in the lowest level

D_{1}^{◂} (x_{V})

in the refinement-free disintegration hierarchy for

x_{V}

have minimal specific local integration i.e.,

{mi}_{π} (x_{V}) = log \frac{p_{V} (x_{V})}{\prod_{c \in π} p_{c} (x_{c})}

(88)

is minimal among all partitions. At the same time these partitions are the finest partitions that achieve this low specific local integration. This implies on the one hand that the codeword lengths of the product codes associated to these partitions are the shortest possible for

x_{V}

among all partitions. On the other hand these partitions split up the trajectory in as many parts as possible while generating these shortest codewords. In this combined sense the partitions in

D_{1}^{◂} (x_{V})

generate the “best” product codes for the particular trajectory

x_{V}

Note that theexpected codeword length of the product code:

\sum_{x_{V} \in X_{V}} p_{V} (x_{V}) (- log \prod_{c \in π} p_{c} (x_{c}))

(89)

which is the more important measure for encoding in general, might not be short at all, i.e., it might not be an efficient code for arbitrary trajectories. The product codes based on partitions in

D_{1}^{◂} (x_{V})

are specifically adapted to assign a short codeword to

x_{V}

, i.e., to a single trajectory or story of this system. As product codes they are constructed/forced to describe

x_{V}

as a composition of stochastically independent parts. More precisely they are constructed in the way that would be optimal for stochastically independent parts.

Nonetheless, the product codes exist (they can be generated using Huffman coding or arithmetic coding [37] based on the product probability) and are uniquely decodable. The parts/blocks of them are the

ι

-entities. We mentioned before that we would like to find a problem that is solved by

ι

-entities. This is then equivalent to finding a problem that is solved by the according product codes. Can we construct such a problem? This question is still open. A possible direction for finding such a problem may be the following line of reasoning. Say for some reason the trajectory

x_{V}

is more important than any other and that we want to “tell its story” as a story of as many as possible (stochastically) independent parts (that are maybe not really stochastically independent) i.e., say we wanted to encode the trajectoryas if it were a combination of as many as possible stochastically independent parts/events. And because

x_{V}

is more important than all other trajectories we wanted the codeword for

x_{V}

to be the shortest possible. Then we would use the product codes of partitions in the refinement-free disintegration hierarchy because those combine exactly these two conditions. The pseudo-stochastically-independent parts would then be the blocks of these partitions which according to the disintegration theorem are exactly the

ι

-entities occurring in

x_{V}

Speculating about where the two conditions may arise in an actual problem, we mention that the trajectory/history that we (real living humans) live in is more important to us than all other possible trajectories of our universe (if there are any). What happens/happened in this trajectory needs to be communicated more often than what happens/happened in counterfactual trajectories. Furthermore, a good reason to think of a system as composite of as many parts as possible is that this reduces the number of parameters that need to be learned which in turn improves the learning speed (see e.g., [41]). So the entities that mankind has partitioned its history into might be related to the

ι

-entities of the universe’s history. These would compose the shortest product codes for what actually happened. The disintegration level might be chosen to optimise rates of model learning.

Recall that this kind of product code is not the optimal code in general (which would be the one with shortest expected codeword length). It is possibly more of a naive code that does not require deep understanding of the dynamical system but instead can be learned fast and works. The language of physics for example might be more optimal in the sense of shortest expected codeword lengths reflecting a desire to communicate efficiently about all counterfactual possibilities as well.

3.6. Related Approaches

We now discuss in some more detail than inSection 1.3 the approaches of Beer [14] and Balduzzi [26].

In Beer [14] the construction of the entities proceeds roughly as follows. First the maps from the Moore neighbourhood to the next state of a cell are classified into five classes oflocal processes. Then these are used to reveal the dynamical structure in the transitions from one time-slice (or temporal part) of a pattern to the next. The used example patterns are the famous block, blinker, and glider and they are considered including their temporal extension. Using both the processes and the spatial patterns/values/components (the black and white values of cells are called components) networks characterising the organisation of the spatiotemporally extended patterns are constructed. These can then be investigated for theirorganisational closure. Organisational closure occurs if the same process-component relations reoccur at a later time. Boundaries of the spatiotemporal patterns are identified by determining the cells around the pattern that have to be fixed to get re-occurrence of the organisation.

Beer [14] mentions that the current version of this method of identifying entities has its limitations. If the closure is perturbed or delayed and then recovered the entity still looses its identity according to this definition. Two possible alternatives are also suggested. The first is to define thepotential for closure as enough for the ascription of identity. This is questioned as well since a sequence of perturbations can take the entity further and further away from its “defining” organisation and make it hard to still speak of a defining organisation at all. The second alternative is to define that the persistence of any organisational closure indicates identity. It is suggested that this would allow blinkers to transform to gliders.

We note that using the entity criterion we propose does not need similar choices to be made since it is not based on the re-occurrence of any organisation. Later time-slices of

ι

-entities need no organisational (or any other) similarity to earlier ones. Another, possibly only small, advantage is that our criterion is formalised and reasonably simply to state. Whether this is possible for the organisational closure based entities remains to be seen.

This is related to the philosophical discussion about identity across possible worlds [33].

Some further parallels can be drawn between the present work and Balduzzi [26] especially if we take into account the disintegration theorem. Given a trajectory (entire time-evolution) of the system in both cases a partition is sought which fulfills a particular trajectory-wide optimality criterion. Also in both cases, each block of the trajectory-wide partition fulfills a condition with respect to its own partitions. For our conditions the disintegration theorem exposes the direct connection between the trajectory-wide and the block-specific conditions. Such a connection is not known for other approaches. The main reason for this might be the simpler formal expression of CLI and SLI compared to the IIT approaches.

In how far our approach and the IIT approaches lead to coinciding or contradicting results is beyond the scope of this paper and constitutes future work. One avenue to pursue here are differences with respect to entities occurring in multiple trajectories as well as the possibility of overlapping entities within single trajectories.

4. Examples

In this section we investigate the structure of integrated and completely locally integrated spatiotemporal patterns as it is revealed by the disintegration hierarchy. First we take a quick look at the trivial case of a set of independent random variables. Then we look at two very simple multivariate Markov chains. We use the disintegration theorem (Theorem 6) to extract the completely locally integrated spatiotemporal patterns.

4.1. Set of Independent Random Variables

Let us first look at a set

{X_{i}}_{i \in V}

of independently and identically distributed random variables. For each trajectory

x_{V} \in X_{V}

we can then calculate SLI with respect to a partition

π \in L (V)

. For every

A \subseteq V

and every

x_{A} \in X_{A}

we have

p_{A} (x_{A}) = \prod_{i \in A} p_{i} (x_{i})

. Then we find for every

π \in L (V)

\begin{matrix} {mi}_{π} (x_{V}) = 0 . \end{matrix}

(90)

This shows that the disintegration hierarchy for each

x_{V} \in X_{V}

contains only a single disintegration level

D (x_{V}) = {D_{1}}

with

D_{1} = L (V)

. The finest partition of

L (V)

is its zero element

0

which then constitutes the only element of the refinement-free disintegration level

D_{1}^{◂} = {0}

. Recall that the zero element of a partition lattice only consists of singleton sets as blocks. The set of completely locally integrated patterns i.e., the set of

ι

-entities in a given trajectory

x_{V}

is then the set

{x_{i} : i \in V}

Next we will look at more structured systems.

4.2. Two Constant and Independent Binary Random Variables: $M C^{=}$

4.2.1. Definition

Define the time- and space-homogeneous multivariate Markov chain

M C^{=}

with Bayesian network

{X_{j, t}}_{j \in {1, 2}, t \in {0, 1, 2}}

and

pa (j, t) = \{\begin{matrix} \emptyset & if t = 0, \\ {(j, t - 1)} & else, \end{matrix}

(91)

p_{j, t} (x_{j, t} | x_{j, t - 1}) = δ_{x_{j, t - 1}} (x_{j, t}) = \{\begin{matrix} 1 & if x_{j, t} = x_{j, t - 1}, \\ 0 & else, \end{matrix}

(92)

p_{j, 0} (x_{j, 0}) = 1 / 4 .

(93)

The Bayesian network can be seen inFigure 4.

4.2.2. Trajectories

In order to get the disintegration hierarchy

D (x_{V})

we have to choose a trajectory

x_{V}

and calculate the SLI of each partition

π \in L (V)

. There are only four different trajectories possible in

M C^{=}

and they are:

x_{V} = (x_{1, 0}, x_{2, 0}, x_{1, 1}, x_{2, 1}, x_{1, 2}, x_{2, 2}) = \{\begin{matrix} (0, 0, 0, 0, 0, 0) & if x_{1, 0} = 0, x_{2, 0} = 0; \\ (0, 1, 0, 1, 0, 1) & if x_{1, 0} = 0, x_{2, 0} = 1; \\ (1, 0, 1, 0, 1, 0) & if x_{1, 0} = 1, x_{2, 0} = 0; \\ (1, 1, 1, 1, 1, 1) & if x_{1, 0} = 1, x_{2, 0} = 1 . \end{matrix}

(94)

Each of these trajectories has probability

p_{V} (x_{V}) = 1 / 4

and all other trajectories have

p_{V} (x_{V}) = 0

. We call the four trajectories thepossible trajectories. We visualise the possible trajectories as a grid with each cell corresponding to one variable. The spatial indices are constant across rows and time-slices

V_{t}

correspond to the columns. A white cell indicates a 0 and a black cell indicates a 1. This results in the grids ofFigure 5.

4.2.3. Partitions of Trajectories

The disintegration hierarchy is composed out of all partitions in the lattice of partitions

L (V)

. Recall that we are partitioning the entire spatially and temporally extended index setV of the Bayesian network and not only the time-slices. Blocks in the partitions of

L (V)

are then, in general, spatiotemporally and not only spatially extended patterns.

The number of partitions

| L (V) |

of a set of

| V | = 6

elements is

B_{6} = 203

(

B_{n}

is the Bell number ofn). These partitions

π

can be classified according to their cardinality

| π |

(number of blocks in the partition). The number of partitions of a set of cardinality

| V |

into

| π |

blocks is the Stirling number

(| V |, | π |)

. For

| V | = 6

we find the Stirling numbers:

(95)

It is important to note that the partition lattice

L (V)

is the same for all trajectories as it is composed out of partitions ofV. On the other hand the values of SLI

{mi}_{π} (x_{V})

with respect to the partitions in

L (V)

generally depend on the trajectory

x_{V}

4.2.4. SLI Values of the Partitions

We can calculate the SLI

{mi}_{π} (x_{V})

of every trajectory

x_{V}

with respect to each partition

π \in L (V)

according to Definition 5:

{mi}_{π} (x_{V}) : = log \frac{p_{V} (x_{V})}{\prod_{b \in π} p_{b} (x_{b})} .

(96)

In the case of

M C^{=}

the SLI values with respect to each partition do not depend on the trajectories. For an overview we plotted the values of SLI with respect to each partition

π \in L (V)

for any trajectory of

M C^{=}

inFigure 6.

We can see inFigure 6 that the cardinality does not determine the value of SLI. At the same time there seems to be a trend to higher values of SLI with increasing cardinality of the partition. We can also observe that only five different values of SLI are attained by partitions on this trajectory. We will collect these classes of partitions with equal SLI values in the disintegration hierarchy next.

4.2.5. Disintegration Hierarchy

In order to get insight into the internal structure of the partitions of a trajectory

x_{V}

we obtain the disintegration hierarchy

D (x_{V})

(see Definition 10) and look at the Hasse diagrams of each of the disintegration levels

D_{i} (x_{V})

partially ordered by refinement. If we sort the partitions of any trajectory of

M C^{=}

according to increasing SLI value we obtainFigure 7. There we see groups of partitions attaining the SLI values

{0, 1, 2, 3, 4}

(precisely) these groups are the disintegration levels

{D_{1} (x_{V}), D_{2} (x_{V}), D_{3} (x_{V}), D_{4} (x_{V}), D_{5} (x_{V})}

. The exact numbers of partitions in each of the levels are:

(97)

Next we look at the Hasse diagram of each of those disintegration levels. Since the disintegration levels are subsets of the partition lattice

L (V)

, they are in general not lattices by themselves. The Hasse diagrams (seeAppendix B for the definition) visualise the set of partitions in each disintegration level partially ordered by refinement ⊲ . The Hasse diagrams are shown inFigure 8. We see immediately that within each disintegration level apart from the first and the last the Hasse diagrams contain multiple connected components.

Furthermore, within a disintegration level the connected components often have the same Hasse diagrams. For example, in

D_{2}

(Figure 8b) we find six connected components with three partitions each. The identical refinement structure of the connected components is related to the symmetries of the probability distribution over the trajectories. As it requires further notational overhead and is straightforward we do not describe these symmetry properties formally. In order to see the symmetries, however, we visualise the partitions themselves in the Hasse diagrams inFigure 9. We also visualise examples of the different connected components in each disintegration level inFigure 10.

Recall that due to the disintegration theorem (Theorem 6) we are interested especially in partitions that do not have refinements at their own or any preceding (i.e., lower indexed) disintegration level. These partitions consist of blocks that are completely integrated. i.e., all possible partitions of each of the blocks results in a positive SLI value or is a single node of the Bayesian network. The refinement-free disintegration hierarchy

D^{◂} (x_{V})

contains only these partitions and is shown in a Hasse diagram inFigure 11.

4.2.6. Completely Integrated Patterns

Having looked at the disintegration hierarchy we now make use of it by extracting the completely (When it is clear from context that we are talking about complete local integration we drop “local” for the sake of readability.) integrated patterns (

ι

-entities) of the four trajectories of

M C^{=}

. Recall that due to the disintegration theorem (Theorem 6) we know that all blocks in partitions that occur in the refinement-free disintegration hierarchy are either singletons or correspond to

ι

-entities. If we look at the refinement-free disintegration hierarchy inFigure 11 we see that many blocks occur in multiple partitions and across disintegration levels. We also see that there are multiple blocks that are singletons. If we ignore singletons, which are trivially integrated as they cannot be partitioned, we end up with eight different blocks. Since the disintegration hierarchy is the same for all four possible trajectories these blocks are also the same for each of them (note that this is the case for

M C^{=}

but not in general as we will see inSection 4.3). However, the patterns that result are different due to the different values within the blocks. We show the eight

ι

-entities and their complete local integration (Definition 6) on the first trajectory inFigure 12 and on the second trajectory inFigure 13.

Since the disintegration hierarchies are the same for the four possible trajectories of

M C^{=}

we get the same refinement-free partitions and therefore the same blocks containing the

ι

-entities. This is apparent when comparingFigure 12 andFigure 13 and noting that each pattern occurring on the first trajectory has a corresponding pattern on the second trajectory that differs (if at all) only in the values of the cells it fixes and not in what values it fixes. More visually speaking, for each pattern inFigure 12 there is a corresponding pattern inFigure 13 leaving the same cells grey.

If we are not interested in a particular trajectory, we can also look at all different

ι

-entities on any trajectory. For

M C^{=}

these are shown inFigure 14.

We see that all

ι

-entities

x_{O}

have the same value of complete local integration

ι (x_{O}) = 1

. This can be explained using the deterministic expression for the SLI of Equation (30) and noting that for

M C^{=}

if any of the values

x_{j, t}

is fixed by a pattern then

{(x_{j, s})}_{s \in T} = x_{j, T}

are determined since they must be the same value. This means that the number of trajectories

N (x_{j, S})

in which any pattern

x_{j, S}

with

S \subseteq T

occurs is either

N (x_{j, S}) = 0

, if the pattern is impossible, or

N (x_{j, S}) = 2

since there are two trajectories compatible with it. Note that all blocks

x_{b}

in any of the

ι

-entities and all

ι

-entities

x_{O}

themselves are of the form

x_{j, S}

with

S \subseteq T

. Let

N (x_{j, S}) = : N

and plug this into Equation (30) for an arbitrary partition

π

\begin{matrix} {mi}_{π} (x_{O}) & = (| π | - 1) log | X_{V_{0}} | - log \frac{\prod_{b \in π} N (x_{b})}{N (x_{O})} \end{matrix}

(98)

\begin{matrix} = (| π | - 1) log | X_{V_{0}} | - log \frac{N^{| π |}}{N} \end{matrix}

(99)

\begin{matrix} = (| π | - 1) log \frac{| X_{V_{0}} |}{N} . \end{matrix}

(100)

To get the complete local integration value we have to minimise this with respect to

π

where

| π | \geq 2

. So for

| X_{V_{0}} | = 4

and

N = 2

we get

ι (x_{O}) = 1

Another observation is that the

ι

-entities are all limited to one of the two rows. This shows on a simple example that, as we would expect,

ι

-entities cannot extend from one independent process to another.

4.3. Two Random Variables with Small Interactions

In this section we look at a system almost identical to that ofSection 4.2 but with a kind of noise introduced. This allows all trajectories to occur and is designed to test whether the spatiotemporal patterns maintain integration in the face of noise.

4.3.1. Definition

We define the time- and space-homogeneous multivariate Markov chain

M C^{ϵ}

via the Markov matrixP with entries

P_{f (x_{1, t + 1}, x_{2, t + 1}), f (x_{1, t}, x_{2, t})} = p_{J, t + 1} (x_{1, t + 1}, x_{2, t + 1} | x_{1, t}, x_{2, t})

(101)

where we define the function

f : {0, 1}^{2} \to [1 : 4]

via

f (0, 0) = 1, f (0, 1) = 2, f (1, 0) = 3, f (1, 1) = 4 .

(102)

With this conventionP is

P = (\begin{matrix} 1 - 3 ϵ & ϵ & ϵ & ϵ \\ ϵ & 1 - 3 ϵ & ϵ & ϵ \\ ϵ & ϵ & 1 - 3 ϵ & ϵ \\ ϵ & ϵ & ϵ & 1 - 3 ϵ \end{matrix})

(103)

This means that the state ofboth random variables remains the same with probability

1 - 3 ϵ

and transitions into each of the other possible combinations with probability

ϵ

. The noise then does not act independently on both random variables but disturbs the joint state. This makes

ι

-entities possible that extend across the two processes. In the following we set

ϵ = 1 / 100

. The initial distribution is again the uniform distribution

p_{j, 0} (x_{j, 0}) = 1 / 4 .

(104)

Writing this multivariate Markov chain as a Bayesian network is possible but the conversion is tedious. The Bayesian network one obtains can be seen inFigure 15.

4.3.2. Trajectories

In this system all trajectories are possible trajectories. This means there are

2^{6} = 64

possible trajectories, since every one of the six random variables can be in any of its two states. There are three classes of trajectories with equal probability of occurring. The first class with the highest probability of occurring are the four possible trajectories of

M C^{=}

. Then there are 24 trajectories that make a single

ϵ

-transition (i.e., a transition where the next pair is not the same as the current one

(x_{1, t + 1}, x_{2, t + 1}) \neq (x_{1, t}, x_{2, t})

, these transitions occur with probability

ϵ

), and 36 trajectories with two

ϵ

-transitions. We pick only one trajectory from each class. The representative trajectories are shown inFigure 16 and will be denoted

x_{V}^{1}, x_{V}^{2}

, and

x_{V}^{3}

respectively. The probabilities are

p_{V} (x_{V}^{1}) = 0.235225, p_{V} (x_{V}^{2}) = 0.0024250, p_{V} (x_{V}^{3}) = 0.000025

4.3.3. SLI Values of the Partitions

Again we calculate the SLI

{mi}_{π} (x_{V})

of every trajectory

x_{V}

with respect to each partition

π \in L (V)

. In contrast to

M C^{=}

the SLI values with respect to each partition of

M C^{ϵ}

do depend on the trajectories. We plot the values of SLI with respect to each partition

π \in L (V)

for the three representative trajectories inFigure 17.

It turns out that the SLI values of

x_{V}^{1}

are almost the same as those of

M C^{=}

inFigure 6 with small deviations due to the noise. This should be expected as

x_{V}^{1}

is also a possible trajectory of

M C^{=}

. Also note that trajectories

x_{V}^{2}, x_{V}^{3}

exhibit negative SLI with respect to some partitions. In particular,

x_{V}^{3}

has non-positive SLI values with respect to any partition. This is due to the low probability of this trajectory compared to its parts. The blocks of any partition have so much higher probability than the entire trajectory that the product of their probabilities is still greater or equal to the trajectory probability.

4.3.4. Completely Integrated Patterns

In this section we look at the

ι

-entities for each of the three representative trajectories

x_{V}^{k}, k \in {1, 2, 3}

. They are visualised together with their complete local integration values inFigure 18,Figure 19 andFigure 20. In contrast to the situation of

M C^{=}

we now have

ι

-entities with varying values of complete local integration.

On the first trajectory

x_{V}^{1}

we find all the eight patterns that are completely locally integrated in

M C^{=}

(seeFigure 13). These are also more than an order of magnitude more integrated than the rest of the

ι

-entities. This is also true for the other two trajectories.

5. Discussion

InSection 3.1 we have argued for the use of patterns as candidates for entities. Patterns can be composed of arbitrary spatially and temporally extended parts of trajectories. We have seen in Theorem 1 that they are distinct from arbitrary subsets of trajectories. The important insight here is that patterns are structures that occur within trajectories but this cannot be said of sets of trajectories.

One of the main target applications of patterns is in time-unrolled Bayesian networks of cellular automata like those inFigure 1. Patterns in such Bayesian networks become spatiotemporal patterns like those used to describe the glider, block, and blinker in the Game of Life cellular automaton by Beer [14]. We would also like to investigate whether the latter spatiotemporal patterns are

ι

-entities. However, at the present state of the computational models and, without approximations, this was out of reach computationally. We will discuss this further below.

InSection 3.3 we defined SLI and inSection 3.3 gave its expression for deterministic Bayesian networks (including cellular automata) as well. We also established the least upper bound of SLI with respect to a partition

π

of cardinalityn for a pattern

x_{A}

with probabilityq. This upper bound is achieved if each of the blocks

x_{b}

in the partition

π

occur if and only if the whole pattern

x_{O}

occurs. This is compatible with our interpretation of entities since in this case clearly the occurrence of any part of the pattern leads necessarily to the occurrence of the entire pattern (and not only vice versa).

We also presented a candidate for a greatest lower bound of SLI with respect to a partition of cardinalityn for a pattern with probabilityq. Whether this is the greatest lower bound or not it shows a case for which SLI is always negative. This happens if either the whole pattern

x_{A}

occurs (with probabilityq) or one of the “almost equal” patterns occurs, each with identical probability. A pattern

y_{A}

is almost equal to

x_{A}

with respect to

π

in this sense if it only differs at one of the blocks

b \in π

i.e., if

y_{A} = (x_{A \ b}, z_{b})

where

z_{b} \neq x_{b}

. This construction makes as many parts as possible (i.e., all but one) occur as many times as possible without the whole pattern occurring. This creates large marginalised probabilities

p_{b} (x_{b})

for each part/block which means that their product probability also becomes large.

Beyond these quantitative interpretations an interpretation of the greatest lower bound candidate seems difficult. A more intuitive candidate for the opposite of an integrated pattern seem to be patterns with independent parts. i.e., zero SLI but quantitatively these are not on the opposite end of the SLI spectrum. A more satisfying interpretation of the presented candidate is still to be found.

We also proved the disintegration theorem which relates states that the refinement-free partitions of a trajectory among those partitions achieving a particular SLI value consist of

ι

-entities only, where an

ι

-entity is a pattern with positive CLI. This theorem allows us to interpret the

ι

-entities in new ways and may lead to a more formal or quantitative justification of

ι

-entities. It is already a first step in this direction since it establishes a special role of the

ι

-entities within trajectories of Bayesian networks. A further justification would tell us what in turn the refinement-free partitions can be used for. We have discussed a possible direction for further investigation in detail inSection 3.5. This tried to connect the

ι

-entities with a coding problem.

InSection 4 we investigated SLI and CLI in three simple example sets of random variables. We found that if the random variables are all independently distributed the according entities are just all the possible

x_{j} \in X_{j}

of each of the random variables

X_{j} \in {X_{i}}_{i \in V}

. This is what we would expect from an entity criterion. There are no entities with any further extent than a single random variable and each value corresponds to a different entity.

For the simple Markov chain

M C^{=}

composed out of two independent and constant processes we presented the entire disintegration hierarchy and the Hasse diagrams of each disintegration level ordered by refinement. The Hasse diagrams reflected the highly symmetric dynamics of the Markov chain via multiple identical components. For the refinement-free disintegration hierarchy we then get multiple partitions at the same disintegration level as well. Different partitions of the trajectory imply overlapping blocks which in the case of the refinement-free partition are

ι

-entities. So in general the

ι

-entities at a particular disintegration level are not necessarily unique and can overlap. We also saw inFigure 11 that the same

ι

-entities can occur on multiple disintegration levels.

The

ι

-entities of

M C^{=}

included the expected three timestep constant patterns within each of the two independent processes. It also included the two timestep parts of these constant patterns. This may be less expected. It shows that parts of

ι

-entities can be

ι

-entities themselves. We note that these “sub-entities (those that are parts of larger entities) are always on a different disintegration level than their” super-entities (the larger entities). We can speculate that the existence of such sub- and super-entities on different disintegration levels may find an interpretation through multicellular organisms or similar structures. However, the overly simplistic examples here only serve as basic models for the potential phenomena, but are still far too simplistic to warrant any concrete interpretation in this direction.

We also looked at a version of

M C^{=}

perturbed by noise, denoted

M C^{ϵ}

. We found that the entities of

M C^{=}

remain the most strongly integrated entities in

M C^{ϵ}

. At the same time new entities occur. So we observe that in

M C^{ϵ}

the entities vary from one trajectory to another (Figure 18,Figure 19 andFigure 20). We also observe spatially extended entities i.e., entities that extend across both (formerly independent) processes. We also observe entities that switch from one process to the other (from top row to bottom row or vice versa). The capacity of entities to exhibit this behaviour may be necessary to capture the movement or metabolism of entities in more realistic scenarios. In Biehl et al. [8] we argued that these properties are important and showed that they hold for a crude approximation of CLI (namely for SLI with respect to

π = 0

) but not for the full CLI measure.

We established that the

ι

-entities:

correspond to fixed single random variables for a set of independent random variables,
can vary from one trajectory to another,
and can change the degrees of freedom that they occupy over time,
can be ambiguous at a fixed level of disintegration due to symmetries of the system,
can overlap at the same level of disintegration due to this ambiguity,
can overlap across multiple levels of disintegration i.e., parts of $ι$ -entities can be $ι$ -entities again.

In general the examples we investigated concretely are too small to sufficiently support the concept of positive CLI as an entity criterion. Due to the extreme computational burden, this may remain the case for a while. For a straightforward calculation of the minimum SLI of a trajectory of a Bayesian network

{X_{i}}_{i \in V}

with

| V | = k

nodes we have to calculate the SLI with respect to

B_{k}

partitions. According to (Bruijn [43], p. 108) the Bell numbers

B_{n}

grow super-exponentially. Furthermore, to evaluate the SLI we need the joint probability distribution of the Bayesian network

{X_{i}}_{i \in V}

. Naively, this means we need the probability (a real number between 0 and 1) of each trajectory. If we only have binary random variables, the number of trajectories is

2^{| V |}

which make the straightforward computation of disintegration hierarchies unrealistic even for quite small systems. If we take a seven by seven grid of the game of life cellular automaton and want to look at three time-steps we have

| V | = 147

. If we use 32 bit floating numbers this gives us around

10^{30}

petabytes of storage needed for this probability distribution. We are sceptical that the exact evaluation of reasonably large systems can be achieved even with non-naive methods. This suggests that formal proofs may be the more promising way to investigate SLI and CLI further.

Acknowledgments

Part of the research was performed during Martin Biehl’s time as a JSPS International Research Fellow and as an ELSI Origins Network (EON) long term visitor. Daniel Polani was supported in part by the EC Horizon 2020 H2020-641321 socSMCs FET Proactive project.

Author Contributions

Martin Biehl, Takashi Ikegami, and Daniel Polani conceived the problem and the measures, and wrote the paper; Martin Biehl proved the theorems and conceived and calculated the examples. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SLI	Specific local integration
CLI	Complete local integration

Appendix A. Kronecker Delta

The Kronecker–delta is used in this paper to represent deterministic conditional distributions.

Definition A1 (Delta).

Let X be a random variable with state space

X

then for

x \in X

and a subset

C \subset X

define

δ_{x} (C) : = \{\begin{matrix} 1 & if x \in C, \\ 0 & else . \end{matrix}

(A1)

We will abuse this notation if C is a singleton set

C = {\bar{x}}

by writing

\begin{matrix} δ_{x} (\bar{x}) : & = \{\begin{matrix} 1 & if x \in {\bar{x}}, \\ 0 & else . \end{matrix} \end{matrix}

(A2)

\begin{matrix} = \{\begin{matrix} 1 & if x = \bar{x}, \\ 0 & else . \end{matrix} \end{matrix}

(A3)

The second line is a more common definition of the Kronecker–delta.

Remark:

Let $X, Y$ be two random variables with state spaces $X, Y$ and $f : X \to Y$ a function such that
$\begin{matrix} p (y | x) = δ_{f (x)} (y), \end{matrix}$
(A4)
then
$\begin{matrix} p (y) & = \sum_{x} p_{Y} (y | x) p_{X} (x) \end{matrix}$
(A5)
$\begin{matrix} = \sum_{x} δ_{f (x)} (y) p_{X} (x) \end{matrix}$
(A6)
$\begin{matrix} = \sum_{x} δ_{x} (f^{- 1} (y)) p_{X} (x) \end{matrix}$
(A7)
$\begin{matrix} = \sum_{x \in f^{- 1} (y)} p_{X} (x) \end{matrix}$
(A8)
$\begin{matrix} = p_{X} (f^{- 1} (y)) . \end{matrix}$
(A9)

Appendix B. Refinement and Partition Lattice Examples

This appendix recalls the definitions of set partitions, refinement and coarsening of set partitions, and Hasse diagrams. It also shows the Hasse diagram of an example partition lattices. The definitions are due to Grätzer [28].

Definition A2.

A (set) partitionπ of a set

X

is a set of non-empty subsets (called blocks) of

X

satisfying

1.: for all $x_{1}, x_{2} \in π$ , if $x_{1} \neq x_{2}$ , then $x_{1} \cap x_{2} = \emptyset$ ,
2.: $⋃_{x \in π} = X$ .

We write

L (X)

for the set of all partitions of

X

Remark:

In words, a partition of a set is a set of disjoint non-empty subsets whose union is the whole set.

Definition A3.

If two elements

x_{1}, x_{2} \in X

belong to the same block of a partition π of

X

write

x_{1} \equiv_{π} x_{2}

. Also write

x_{1} / π

for the block

{x_{2} \in X : x_{2} \equiv_{π} x_{1}}

Definition A4 (Refinement and coarsening).

We define the binary relation ⊴between partitions

π, ρ \in L (X)

as:

π ⊴ ρ if x_{1} \equiv_{π} x_{2} implies x_{1} \equiv_{ρ} x_{2} .

(A10)

In this case π is called a refinementof ρ and ρ is called a coarseningof π.

Remarks:

More intuitively, $π$ is a refinement of $ρ$ if all blocks of $π$ can be obtained by further partitioning the blocks of $ρ$ . Conversely, $ρ$ is a coarsening of $π$ if all blocks in $ρ$ are unions of blocks in $π$ .
Examples are contained in the Hasse diagrams (defined below) shown inFigure A1.

Definition A5 (Hasse diagram).

A Hasse diagram is a visualisation of a poset (including lattices). Given a poset A ordered by ⊲the Hasse diagram represents the elements of A by dots. The dots representing the elements are arranged in such a way that if

a, b \in A

a \neq b

, and

a ⊲ b

then the dot representing a is drawn below the dot representing b. An edge is drawn between two elements

a, b \in A

a ⊲ : b

, i.e., if b covers a. If edges cross in the diagram this does not mean that there is an element of A where they cross and edges never pass through a dot representing an element.

Remarks:

No edge is drawn between two elements $a, b \in A$ if $a ⊲ b$ but not $a ⊲ : b$ .
Only drawing edges for the covering relation does not imply a loss of information about the poset since the covering relation determines the partial order completely.
For an example Hasse diagrams seeFigure A1.

Figure A1. Hasse diagrams of the partition lattice of the four element set.

Appendix C. Bayesian Networks

Intuitively a Bayesian network is a graph representation of the inter-dependencies of a set of random variables. Recall that any joint probability distribution over a set

{X_{i}}_{i \in I}

with

I = {1, . . ., n}

of random variables can always be written as a product of conditional probabilities:

\begin{matrix} p_{I} (x_{1}, . . ., x_{n}) = \prod_{i = 1}^{n - 1} p_{i} (x_{i} | x_{i + 1}, . . ., x_{n}) p (x_{n}) . \end{matrix}

(A11)

This also holds for any reordering of the indices

i \mapsto f (i)

with

f : {1, . . . n} \to {1, . . . n}

bijective.

In many cases however this factorisation can be simplified. Often some of the conditional probabilities

p (x_{i} | x_{i + 1}, . . ., x_{n})

do not depend on all variables

{x_{i + 1}, . . ., x_{n}}

listed in the product of Equation (A11). For example,

X_{1}

might only depend on

X_{2}

so we would have

p (x_{1} | x_{2}, . . ., x_{n}) = p (x_{1} | x_{2})

. Note that the latter conditional probability is determined by fixing

| X_{2} | (| X_{1} | - 1)

probabilities whereas the former needs

\prod_{i = 1}^{n} | X_{i + 1} | (| X_{1} | - 1)

probabilities to be fixed. This means the number of parameters (the probabilities) of the joint distribution

p (x_{1}, . . ., x_{n})

is often much smaller than suggested by Equation (A11). One way to encode this simplification and make sure that we are dealing only with joint probabilities that reflect the dependencies we allow are Bayesian networks. These can be defined as follows. First we define graphs that are factorisation compatible with joint probability distributions over a set of random variables and then define the Bayesian networks as pairs of joint probability distributions and a factorisation compatible graph.

Definition A6.

A directed acyclic graph

G = (V, E)

with nodes V and edges E is factorisation compatiblewith the joint probabilities the probabilities of a probability distribution

p_{V} : X_{V} \to [0, 1]

iff the latter can be factorised in the way suggested by G which means:

\begin{matrix} p_{V} (x_{V}) = \prod_{i \in V} p (x_{i} | x_{pa (i)}) . \end{matrix}

(A12)

where

pa (i)

denotes the parents of node i according to G.

Remark:

In general there are multiple directed acyclic graphs that are factorisation compatible with the same probability distribution. For example, if we choose any total order for the nodes inV and define a graph by $pa (i) = {j \in V : j < i}$ then Equation (A12) becomes Equation (A11) which always holds.

Definition A7 (Bayesian network).

A Bayesian networkis a (here assumed finite) set of random variables

{X_{i}}_{i \in V}

and a directed acyclic graph

G = (V, E)

with nodes indexed by V such that the joint probability distribution

p_{V} : X_{V} \to [0, 1]

{X_{i}}_{i \in V}

is factorisation compatible with G. We also refer to the graph set of random variables

{X_{i}}_{i \in V}

as a Bayesian network implying the graph G.

Remarks:

On top of constituting the vertices of the graphG the setV is also assumed to be totally ordered in an (arbitrarily) fixed way. Whenever we use a subset $A \subset V$ to index a sequence of variables in the Bayesian network (e.g., in $p_{A} (x_{A})$ ) we orderA according to this total order as well.
Since ${X_{i}}_{i \in V}$ is finite andG is acyclic there is a set $V_{0}$ of nodes without parents.

Definition A8 (Mechanism).

Given a Bayesian network

{X_{i}}_{i \in V}

with index set V for each node with parents i.e., for each node

i \in V \ V_{0}

(with

V_{0}

the set of nodes without parents) the mechanism of node i or also called the mechanism of random variable

X_{i}

is the conditional probability (also called a transition kernel)

p_{i} : X_{pa (i)} \times X_{i} \to [0, 1]

mapping

(x_{pa (i)}, x_{i}) \mapsto p_{i} (x_{i} | x_{pa (i)})

. For each

x_{pa (i)}

the mechanism defines a probability distribution

p_{i} (. | x_{pa (i)}) : X_{i} \to [0, 1]

satisfying (like any other probability distribution)

\sum_{x_{i} \in X_{i}} p_{i} (x_{i} | x_{pa (i)}) = 1 .

(A13)

Remarks:

We could define the set of all mechanisms to formally also include the mechanisms of the nodes without parents $V_{0}$ . However, in practice it makes sense to separate the nodes without parents as those that we choose an initial probability distribution over (similar to a boundary condition) which is then turned into a probability distribution $p_{V}$ over the entire Bayesian network ${X_{i}}_{i \in V}$ via Equation (A12). Note that in Equation (A12) the nodes in $V_{0}$ are not explicit as they are just factors $p_{i} (x_{i} | x_{pa (i)})$ with $pa (i) = \emptyset$ .
To construct a Bayesian network, take graph $G = (V, E)$ and equip each node $i \in (V \ V_{0})$ with a mechanism $p_{i} : X_{pa (i)} \times X_{i} \to [0, 1]$ and for each node $i \in V_{0}$ choose a probability distribution $p_{i} : X_{i} \to [0, 1]$ . The joint probability distribution is then calculated by the according version of Equation (A12):
$\begin{matrix} p_{V} (x_{V}) = \prod_{i \in V \ V_{0}} p_{i} (x_{i} | x_{pa (i)}) \prod_{j \in V_{0}} p_{j} (x_{j}) . \end{matrix}$
(A14)

Appendix C.1. Deterministic Bayesian Networks

Definition A9 (Deterministic mechanism).

A mechanism

p_{i} : X_{pa (i)} \times X_{i} \to [0, 1]

is deterministic if there is a function

f_{i} : X_{pa (i)} \to X_{i}

such that

p_{i} (x_{i} | x_{pa (i)}) = δ_{f_{i} (x_{pa (i)})} (x_{i}) = \{\begin{matrix} 1 & i f x_{i} = f_{i} (x_{pa (i)}), \\ 0 & e l s e . \end{matrix}

(A15)

Definition A10 (Deterministic Bayesian network).

A Bayesian network

{X_{i}}_{i \in V}

is deterministic if all its mechanisms are deterministic.

Theorem A1.

Given a deterministic Bayesian network

{X_{i}}_{i \in V}

there exists a function

f_{V \ V_{0}} : X_{V_{0}} \to X_{V \ V_{0}}

which given a value

x_{V_{0}}

of the random variables without parents

X_{V_{0}}

returns the value

x_{V \ V_{0}}

fixing the values of all remaining random variables in the network.

Proof.

According to Equation (A12), the definition of conditional probabilities, and using the deterministic mechanisms we have:

\begin{matrix} p_{V \ V_{0}} (x_{V \ V_{0}} | x_{V_{0}}) & = \prod_{i \in V \ V_{0}} p_{i} (x_{i} | x_{pa (i)}) \end{matrix}

(A16)

\begin{matrix} = \prod_{i \in V \ V_{0}} δ_{f_{i} (x_{pa (i)})} (x_{i}) . \end{matrix}

(A17)

For every

x_{V_{0}}

the product on the right hand side is a probability distribution and therefore is always greater or equal to zero and maximally one. Also for every

x_{V_{0}}

the sum of the probabilities over all

x_{V \ V_{0}} \in X_{V \ V_{0}}

is equal to one. As a product of zeros and/or ones the right hand side on the second line can only either be zero or one. This means for every

x_{V_{0}}

there must be a unique

x_{V \ V_{0}}

such that the right hand side is equal to one. Define this as the value of the function

f_{V \ V_{0}} (x_{V_{0}})

. ☐

Theorem A2 (Pattern probability in a deterministic Bayesian network).

Given a deterministic Bayesian network (Definition A10) and uniform initial distribution

p_{V_{0}} : X_{V_{0}} \to [0, 1]

the probability of the occurrence of a pattern

x_{A}

is:

p_{A} (x_{A}) = \frac{N (x_{A})}{| X_{V_{0}} |}

(A18)

where

N (x_{A})

is the number of trajectories

{\bar{x}}_{V}

in which

x_{A}

occurs.

Proof.

Recall that in a deterministic Bayesian network we have a function

f_{V \ V_{0}} : X_{V_{0}} \to X_{V \ V_{0}}

(see Theorem A1) which maps a given value of

x_{V_{0}}

to the value of the rest of the network

x_{V \ V_{0}}

. We calculate

p_{A} (x_{A})

for an arbitrary subset

A \subset V

. To make this more readable let

A \cap V_{0} = A_{0}

A \ V_{0} = A_{r}

B : = V \ A

B \cap V_{0} = B_{0}

, and

B \ V_{0} = B_{r}

. Then

\begin{matrix} p_{A} (x_{A}) & = \sum_{{\bar{x}}_{B}} p_{V} (x_{A}, {\bar{x}}_{B}) \end{matrix}

(A19)

\begin{matrix} = \sum_{{\bar{x}}_{B_{0}}, {\bar{x}}_{B_{r}}} p_{V} (x_{A_{r}}, {\bar{x}}_{B_{r}} | x_{A_{0}}, {\bar{x}}_{B_{0}}) p_{V_{0}} (x_{A_{0}}, {\bar{x}}_{B_{0}}) \end{matrix}

(A20)

\begin{matrix} = \sum_{{\bar{x}}_{B_{0}}, {\bar{x}}_{B_{r}}} δ_{f_{V \ V_{0}} (x_{A_{0}}, {\bar{x}}_{B_{0}})} (x_{A_{r}}, {\bar{x}}_{B_{r}}) p_{V_{0}} (x_{A_{0}}, {\bar{x}}_{B_{0}}) \end{matrix}

(A21)

\begin{matrix} = \sum_{{\bar{x}}_{B_{r}}} \sum_{{{\bar{x}}_{B_{0}} : (x_{A_{0}}, {\bar{x}}_{B_{0}}) \in f_{V \ V_{0}}^{- 1} (x_{A_{r}}, {\bar{x}}_{B_{r}})}} p_{V_{0}} (x_{A_{0}}, {\bar{x}}_{B_{0}}) \end{matrix}

(A22)

\begin{matrix} = \frac{1}{| X_{V_{0}} |} \sum_{{\bar{x}}_{B_{r}}} | {{\bar{x}}_{B_{0}} \in X_{B_{0}} : (x_{A_{0}}, {\bar{x}}_{B_{0}}) \in f_{V \ V_{0}}^{- 1} (x_{A_{r}}, {\bar{x}}_{B_{r}})} | \end{matrix}

(A23)

\begin{matrix} = \frac{1}{| X_{V_{0}} |} N (x_{A}) \end{matrix}

(A24)

In the second to last line we used the uniformity of the initial distribution

p_{V_{0}}

. The second sum in the second to last line counts all initial conditions that are compatible with

x_{A_{0}}

and lead to the occurrence of

x_{A_{r}}

together with some

{\bar{x}}_{B_{r}}

. The first one then sums over all such

{\bar{x}}_{B_{r}}

to get all initial conditions that are compatible with

x_{A_{0}}

and lead to the occurrence of

x_{A_{r}}

. Together these are all initial conditions compatible with

x_{A}

. In a deterministic system the number of initial conditions that lead to the occurrence of a pattern

x_{A}

is equal to the number of trajectories

N (x_{A})

since every different initial condition will produce a single, unique trajectory. ☐

Remark:

Due to the finiteness of the network, deterministic mechanisms, and chosen uniform initial distribution the minimum possible non-zero probability for a pattern $x_{A}$ is $1 / | X_{V_{0}} |$ . This happens for any pattern that only occurs in a single trajectory. Furthermore, the probability of any pattern is a multiple of $1 / | X_{V_{0}} |$ .

Appendix C.2. Proof of Theorem 2

Proof.

Follows by replacing the probabilities

p_{O} (x_{O})

and

p_{b} (x_{b})

in Equation (22) with their deterministic expressions from (Theorem A2), i.e.,

p_{A} (x_{A}) = N (X_{A}) / | X_{V_{0}} |

. Then:

\begin{matrix} {mi}_{π} (x_{O}) : & = log \frac{p_{O} (x_{O})}{\prod_{b \in π} p_{b} (x_{b})} \end{matrix}

(A25)

\begin{matrix} = log \frac{\frac{N (x_{O})}{| X_{V_{0}} |}}{\prod_{b \in π} \frac{N (x_{b})}{| X_{V_{0}} |}} \end{matrix}

(A26)

\begin{matrix} = log \frac{\frac{N (x_{O})}{| X_{V_{0}} |}}{| X_{V_{0}} |^{- | π |} \prod_{b \in π} N (x_{b})} \end{matrix}

(A27)

\begin{matrix} = log \frac{| X_{V_{0}} |^{| π | - 1} N (x_{O})}{\prod_{b \in π} N (x_{b})} \end{matrix}

(A28)

\begin{matrix} = (| π | - 1) log | X_{V_{0}} | - log \frac{\prod_{b \in π} N (x_{b})}{N (x_{O})} . \end{matrix}

(A29)

☐

Appendix D. Proof of Theorem 1

Proof.

Given a set of random variables

{X_{i}}_{i \in V}

, a subset

D \subseteq X_{V}

cannot be represented by a pattern of

{X_{i}}_{i \in V}

if and only if there exists

A \subseteq V

with

D_{A} \subset X_{A}

(proper subset) and

| D_{A} | > 1

, i.e., if neither all patterns atA are possible nor a unique pattern atA is specified by

D

We first show that if there exists

A \subseteq V

with

D_{A} \subset X_{A}

and

| D_{A} | > 1

then there is no pattern

{\tilde{x}}_{B} \in ⋃_{C \subseteq V} X_{C}

with

D = T ({\tilde{x}}_{B})

. Then we show that if no suchA exists then there is such a pattern

{\tilde{x}}_{B}

Since

D_{A} > 1

we have

x_{A}, {\bar{x}}_{A} \in D_{A} \subset X_{A}

with

x_{A} \neq {\bar{x}}_{A}

. Next note that we can write any pattern

{\tilde{x}}_{B}

\begin{matrix} {\tilde{x}}_{B} = ({\tilde{x}}_{B \ A}, {\tilde{x}}_{B \cap A}) . \end{matrix}

(A30)

B \cap A \neq \emptyset

we can see since

{\tilde{x}}_{B \cap A}

must take a single value it cannot contain

D

since there are trajectories in

D

taking value

x_{B \cap A}

B \cap A

and trajectories in

D

taking values

{\bar{x}}_{B \cap A}

. More formally, we must have either

{\tilde{x}}_{B \cap A} = x_{A}

{\tilde{x}}_{B \cap A} \neq x_{A}

. First, let

{\tilde{x}}_{B \cap A} = x_{A}

but then

T ({\bar{x}}_{A}) ⊈ T ({\tilde{x}}_{B})

D ⊈ T ({\tilde{x}}_{B})

. Next choose

{\tilde{x}}_{B \cap A} \neq x_{A}

but then

T (x_{A}) ⊈ T ({\tilde{x}}_{B})

so also

D ⊈ T ({\tilde{x}}_{B})

. So we must have

B \cap A = \emptyset

Now we show that if

B \cap A = \emptyset

there are trajectories in

{\tilde{x}}_{B}

that are not in

D

. We construct one explicitly by fixing its value onA to the value in

X_{A}

that is not in

D_{A}

and its value onB to

{\tilde{x}}_{B}

. More formally: choose

y_{A} \in X_{A} \ D_{A}

, then

y_{A} \neq x_{A}

and

y_{A} \neq {\bar{x}}_{A}

. This is always possible since

D_{A} \subset X_{A}

(proper subset). Then consider a trajectory

{\hat{x}}_{V} = ({\tilde{x}}_{B}, y_{A}, {\overset{ˇ}{x}}_{D})

with arbitrary

{\overset{ˇ}{x}}_{D} \in X_{D}

where

D = V \ (B \cup A)

. Then

{\hat{x}}_{V} \in T ({\tilde{x}}_{B})

but

{\hat{x}}_{V} \notin D

Conversely, we show how to construct

{\tilde{x}}_{B}

if no suchA exists. the idea is just to fix all random variables where

| D_{A} | = 1

and leave them unspecified where

D_{A} = X_{A}

. More formally: if there exists no

A \subseteq V

with

D_{A} \subset X_{A}

and

| D_{A} | > 1

, then for each

C \subseteq V

either

D_{C} = X_{C}

| D_{C} | = 1

. Then let

B = ⋃ {C \subseteq V : | D_{C} | = 1}

then

| D_{B} | = 1

so that we can define

{\tilde{x}}_{B}

as the unique element in

D_{B}

. Then if

y_{V} \in D

we have

y_{B} = {\tilde{x}}_{B}

D \subseteq T ({\tilde{x}}_{B})

. If

z_{V} \in T ({\tilde{x}}_{B})

we have

z_{B} = {\tilde{x}}_{B} \in D_{B}

and for

A \subseteq V

with

A \cap B = \emptyset

by construction ofB we have

D_{A} = X_{A}

such that

D_{V \ B} = X_{V \ B}

which means

z_{V \ B} \in D_{V \ B}

and therefore

z_{V} \in D

and

T ({\tilde{x}}_{B}) \subseteq D

. So this gives

T ({\tilde{x}}_{B}) = D

. ☐

References

Gallois, A. Identity over Time. InThe Stanford Encyclopedia of Philosophy; Zalta, E.N., Ed.; Metaphysics Research Laboratory, Stanford University: Stanford, CA, USA, 2012. [Google Scholar]
Grand, S.Creation: Life and How to Make It; Harvard University Press: Harvard, MA, USA, 2003. [Google Scholar]
Pascal, R.; Pross, A. Stability and its manifestation in the chemical and biological worlds.Chem. Commun.2015,51, 16160–16165. [Google Scholar] [CrossRef] [PubMed]
Orseau, L.; Ring, M. Space-Time Embedded Intelligence. InArtificial General Intelligence; Number 7716 in Lecture Notes in Computer Science; Bach, J., Goertzel, B., Iklé, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 209–218. [Google Scholar]
Barandiaran, X.E.; Paolo, E.D.; Rohde, M. Defining Agency: Individuality, Normativity, Asymmetry, and Spatio-temporality in Action.Adapt. Behav.2009,17, 367–386. [Google Scholar] [CrossRef]
Legg, S.; Hutter, M. Universal Intelligence: A Definition of Machine Intelligence.arXiv, 2007; arXiv: 0712.3329. [Google Scholar]
Boccara, N.; Nasser, J.; Roger, M. Particlelike structures and their interactions in spatiotemporal patterns generated by one-dimensional deterministic cellular-automaton rules.Phys. Rev. A1991,44, 866–875. [Google Scholar] [CrossRef] [PubMed]
Biehl, M.; Ikegami, T.; Polani, D. Towards information based spatiotemporal patterns as a foundation for agent representation in dynamical systems. In Proceedings of the Artificial Life Conference, Cancun, Mexico, 2016; The MIT Press: Cambridge, MA, USA, 2016; pp. 722–729. [Google Scholar]
McGill, W.J. Multivariate information transmission.Psychometrika1954,19, 97–116. [Google Scholar] [CrossRef]
Amari, S.I. Information geometry on hierarchy of probability distributions.IEEE Trans. Inf. Theory2001,47, 1701–1711. [Google Scholar] [CrossRef]
Lizier, J.T.The Local Information Dynamics of Distributed Computation in Complex Systems; Springer: Berlin/Heidelberg: Germany, 2012. [Google Scholar]
Tononi, G.; Sporns, O. Measuring information integration.BMC Neurosci.2003,4, 31. [Google Scholar] [CrossRef] [PubMed]
Balduzzi, D.; Tononi, G. Integrated Information in Discrete Dynamical Systems: Motivation and Theoretical Framework.PLoS Comput. Biol.2008,4, e1000091. [Google Scholar] [CrossRef] [PubMed]
Beer, R.D. Characterizing autopoiesis in the game of life.Artif. Life2014,21, 1–19. [Google Scholar] [CrossRef] [PubMed]
Fontana, W.; Buss, L.W. “The arrival of the fittest”: Toward a theory of biological organization.Bull. Math. Biol.1994,56, 1–64. [Google Scholar]
Krakauer, D.; Bertschinger, N.; Olbrich, E.; Ay, N.; Flack, J.C. The Information Theory of Individuality.arXiv, 2014; arXiv:1412.2447. [Google Scholar]
Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective.Biosystems2008,91, 331–345. [Google Scholar] [CrossRef] [PubMed]
Shalizi, C.R.; Haslinger, R.; Rouquier, J.B.; Klinkner, K.L.; Moore, C. Automatic filters for the detection of coherent structure in spatiotemporal systems.Phys. Rev. E2006,73, 036104. [Google Scholar] [CrossRef] [PubMed]
Wolfram, S. Computation theory of cellular automata.Commun. Math. Phys.1984,96, 15–57. [Google Scholar] [CrossRef]
Grassberger, P. Chaos and diffusion in deterministic cellular automata.Phys. D Nonlinear Phenom.1984,10, 52–58. [Google Scholar] [CrossRef]
Hanson, J.E.; Crutchfield, J.P. The attractor—Basin portrait of a cellular automaton.J. Stat. Phys.1992,66, 1415–1462. [Google Scholar] [CrossRef]
Pivato, M. Defect particle kinematics in one-dimensional cellular automata.Theor. Comput. Sci.2007,377, 205–228. [Google Scholar] [CrossRef]
Lizier, J.T.; Prokopenko, M.; Zomaya, A.Y. Local information transfer as a spatiotemporal filter for complex systems.Phys. Rev. E2008,77, 026110. [Google Scholar] [CrossRef] [PubMed]
Flecker, B.; Alford, W.; Beggs, J.M.; Williams, P.L.; Beer, R.D. Partial information decomposition as a spatiotemporal filter.Chaos Interdiscip. J. Nonlinear Sci.2011,21, 037104. [Google Scholar] [CrossRef] [PubMed]
Friston, K. Life as we know it.J. R. Soc. Interface2013,10, 20130475. [Google Scholar] [CrossRef] [PubMed]
Balduzzi, D. Detecting emergent processes in cellular automata with excess information.arXiv, 2011; arXiv:1105.0158. [Google Scholar]
Hoel, E.P.; Albantakis, L.; Marshall, W.; Tononi, G. Can the macro beat the micro? Integrated information across spatiotemporal scales.Neurosci. Conscious.2016,2016, niw012. [Google Scholar] [CrossRef]
Grätzer, G.Lattice Theory: Foundation; Springer: New York, NY, USA, 2011. [Google Scholar]
Ceccherini-Silberstein, T.; Coornaert, M. Cellular Automata and Groups. InEncyclopedia of Complexity and Systems Science; Meyers, R.A., Ed.; Springer: New York, NY, USA, 2009; pp. 778–791. [Google Scholar]
Busic, A.; Mairesse, J.; Marcovici, I. Probabilistic cellular automata, invariant measures, and perfect sampling.arXiv, 2010; arXiv: 1010.3133. [Google Scholar]
Beer, R.D. The cognitive domain of a glider in the game of life.Artif. Life2014,20, 183–206. [Google Scholar] [CrossRef] [PubMed]
Beer, R.R.Autopoiesis and Enaction in the Game of Life; The MIT Press: Cambridge, MA, USA, 2016; p. 13. [Google Scholar]
Noonan, H.; Curtis, B. Identity. InThe Stanford Encyclopedia of Philosophy; Zalta, E.N., Ed.; Metaphysics Research Laboratory, Stanford University: Stanford, CA, USA, 2014. [Google Scholar]
Hawley, K. Temporal Parts. InThe Stanford Encyclopedia of Philosophy; Zalta, E.N., Ed.; Metaphysics Research Laboratory, Stanford University: Stanford, CA, USA, 2015. [Google Scholar]
Ay, N. Information Geometry on Complexity and Stochastic Interaction.Entropy2015,17, 2432–2458. [Google Scholar] [CrossRef]
MacKay, D.J.Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Cover, T.M.; Thomas, J.A.Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Tononi, G. An information integration theory of consciousness.BMC Neurosc.2004,5, 42. [Google Scholar] [CrossRef] [PubMed]
Von Eitzen, H. Prove (1 − (1 −q)/n)ⁿ ≥q for 0 <q < 1 andn ≥ 2 a Natural Number. Mathematics Stack Exchange. 2016. Available online:http://math.stackexchange.com/q/1974262 (accessed on 18 October 2016).
Bullen, P.S.Handbook of Means and Their Inequalities; Springer Science+Business Media: Dordrecht, The Netherlands, 2003. [Google Scholar]
Kolchinsky, A.; Rocha, L.M. Prediction and modularity in dynamical systems. InAdvances in Artificial Life, ECAL; The MIT Press: Cambridge, MA, USA, 2011; pp. 423–430. [Google Scholar]
Pemmaraju, S.; Skiena, S.Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica^®; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
De Bruijn, N.G.Asymptotic Methods in Analysis; Dover Publications: New York, NY, USA, 2010. [Google Scholar]

Figure 1. Illustration of concepts from this paper on the time-evolution (trajectory) of a one-dimensional elementary cellular automaton. Time-steps increase from left to right. None of the shown structures are derived from principles. They are manually constructed for illustrative purposes. In (a) we show the complete (finite) trajectory. Naively, two gliders can be seen to collide and give rise to a third glider; In (b–d) we show (spatiotemporal) patterns fixing the variables (allegedly) pertaining to a first, second, and a third glider; In (e) we show a pattern fixing the variables of what could be a glider that absorbs the first glider from before and maintains its identity; In (f) we show a partition into the time-slices of the pattern of the first glider; In (g) we show a partition of the trajectory with three parts coinciding with the gliders and one part encompassing the rest; In (h) we show again a partition with three parts coinciding with the gliders but now all other variables are considered as individual parts.

Figure 2. First time steps of a Bayesian network representing a multivariate dynamical system (or multivariate Markov chain)

{X_{i}}_{i \in V}

. Here we used

V = J \times T

withJ indicating spatial degrees of freedom andT the temporal extension. Then each node is indexed by a tuple

(j, t)

as shown. The shown edges are just an example, edges are allowed to point from any node to another one within the same or in the subsequent column.

Figure 2. First time steps of a Bayesian network representing a multivariate dynamical system (or multivariate Markov chain)

{X_{i}}_{i \in V}

. Here we used

V = J \times T

withJ indicating spatial degrees of freedom andT the temporal extension. Then each node is indexed by a tuple

(j, t)

as shown. The shown edges are just an example, edges are allowed to point from any node to another one within the same or in the subsequent column.

Figure 3. In (a) we show a trajectory of the same cellular automaton as inFigure 1 with a randomly chosen initial condition. The set of gliders and their paths occurring in this trajectory is clearly different from those inFigure 1a. In (b) we show an example of a random pattern that occurs in the trajectory of (a) and is probably not an entity in any sense.

Figure 4. Bayesian network of

M C^{=}

. There is no interaction between the two processes.

Figure 4. Bayesian network of

M C^{=}

. There is no interaction between the two processes.

Figure 5. Visualisation of the four possible trajectories of

M C^{=}

Figure 5. Visualisation of the four possible trajectories of

M C^{=}

Figure 6. Specific local integrations

{mi}_{π} (x_{V})

of any of the four trajectories

x_{V}

seen inFigure 5 with respect to all

π \in L (V)

. The partitions are ordered according to an enumeration with increasing cardinality

| π |

((see Pemmaraju and Skiena [42], Chapter 4.3.3) for the method). We indicate with vertical lines at what partitions the cardinality

| π |

increases by one.

Figure 6. Specific local integrations

{mi}_{π} (x_{V})

of any of the four trajectories

x_{V}

seen inFigure 5 with respect to all

π \in L (V)

. The partitions are ordered according to an enumeration with increasing cardinality

| π |

((see Pemmaraju and Skiena [42], Chapter 4.3.3) for the method). We indicate with vertical lines at what partitions the cardinality

| π |

increases by one.

Figure 7. Same asFigure 6 but with the partitions sorted according to increasing SLI.

Figure 8. Hasse diagrams of the five disintegration levels of the trajectories of

M C^{=}

. Every vertex corresponds to a partition and edges indicate that the lower partition refines the higher one.

Figure 8. Hasse diagrams of the five disintegration levels of the trajectories of

M C^{=}

. Every vertex corresponds to a partition and edges indicate that the lower partition refines the higher one.

Figure 9. Hasse diagram of

D_{2}

M C^{=}

trajectories. Here we visualise the partitions at each vertex. The blocks of a partition are the cells of equal colour. Note that we can obtain all six disconnected components from one by symmetry operations that are respected by the joint probability distribution

p_{V}

. For example, we can shift each row individually to the left or right since every value is constant in each row. We can also switch top and bottom row since they have the same probability distributions even if 1 and 0 are exchanged.

Figure 9. Hasse diagram of

D_{2}

M C^{=}

p_{V}

Figure 10. For each disintegration level of the trajectories of

M C^{=}

we here show example connected components of Hasse diagrams with the partitions at each vertex visualised. The disintegration level increases clockwise from the top left. The blocks of a partition are the cells of equal colour.

Figure 10. For each disintegration level of the trajectories of

M C^{=}

Figure 11. Hasse diagrams of the refinement-free disintegration hierarchy

D^{◂}

M C^{=}

trajectories. Here we visualise the partitions at each vertex. The blocks of a partition are the cells of equal colour. It turns out that partitions that are on the samehorizontal level in this diagram correspond exactly to a level in the refinement-free disintegration hierarchy

D^{◂}

. Thei-th horizontal level starting from the top corresponds to

D_{i}^{◂}

. Take for example the second horizontal level from the top. The partitions on this level are just the minimal elements of the poset

D_{2}

which was visualised inFigure 9. To connect this toFigure 8 note that for each disintegration level

D_{i}

shown there as a Hasse diagram, the partitions on thei-th horizontal level (counting from the top) in the present figure are the minimal elements of that disintegration level.

Figure 11. Hasse diagrams of the refinement-free disintegration hierarchy

D^{◂}

M C^{=}

D^{◂}

. Thei-th horizontal level starting from the top corresponds to

D_{i}^{◂}

. Take for example the second horizontal level from the top. The partitions on this level are just the minimal elements of the poset

D_{2}

which was visualised inFigure 9. To connect this toFigure 8 note that for each disintegration level

D_{i}

shown there as a Hasse diagram, the partitions on thei-th horizontal level (counting from the top) in the present figure are the minimal elements of that disintegration level.

Figure 12. All distinct completely integrated composite patterns (singletons are not shown) on the first possible trajectory of

M C^{=}

. The value of complete local integration is indicated above each pattern. We display patterns by colouring the cells corresponding to random variables that are not fixed to any value by the pattern in grey. Cells corresponding to random variables that are fixed by the pattern are coloured according to the value i.e., white for 0 and black for 1.

Figure 12. All distinct completely integrated composite patterns (singletons are not shown) on the first possible trajectory of

M C^{=}

Figure 13. All distinct completely integrated composite patterns on the second possible trajectory of

M C^{=}

. The value of complete local integration is indicated above each pattern.

Figure 13. All distinct completely integrated composite patterns on the second possible trajectory of

M C^{=}

. The value of complete local integration is indicated above each pattern.

Figure 14. All distinct completely integrated composite patterns on all four possible trajectories of

M C^{=}

. The value of complete local integration is indicated above each pattern.

Figure 14. All distinct completely integrated composite patterns on all four possible trajectories of

M C^{=}

. The value of complete local integration is indicated above each pattern.

Figure 15. Bayesian network of

M C^{ϵ}

Figure 15. Bayesian network of

M C^{ϵ}

Figure 16. Visualisation of three trajectories of

M C^{ϵ}

. In each trajectory the time index increases from left to right. There are two rows corresponding to the two random variables at each time step and three columns corresponding to the three time-steps we are considering here. We can see that the first trajectory (in (a)) makes no e-transitions, the second (in (b)) makes one fromt = 2 tot = 3, and the third (in (c)) makes two.

Figure 16. Visualisation of three trajectories of

M C^{ϵ}

Figure 17. Specific local integrations

{mi}_{π} (x_{V})

of one of the four trajectories of

M C^{=}

(measured w.r.t. the probability distribution of

M C^{=}

), here denoted

x_{V}^{M C^{=}}

(this is the same data as inFigure 6), and the three representative trajectories

x_{V}^{k}, x \in {1, 2, 3}

M C^{ϵ}

(measured w.r.t. the probability distribution of

M C^{ϵ}

) seen inFigure 16 with respect to all

π \in L (V)

. The partitions are ordered as inFigure 6 with increasing cardinality

| π |

. Vertical lines indicate partitions where the cardinality

| π |

increases by one. Note that the values of

x_{V}^{M C^{=}}

are almost completely hidden from view by those of

x_{V}^{1}

Figure 17. Specific local integrations

{mi}_{π} (x_{V})

of one of the four trajectories of

M C^{=}

(measured w.r.t. the probability distribution of

M C^{=}

), here denoted

x_{V}^{M C^{=}}

(this is the same data as inFigure 6), and the three representative trajectories

x_{V}^{k}, x \in {1, 2, 3}

M C^{ϵ}

(measured w.r.t. the probability distribution of

M C^{ϵ}

) seen inFigure 16 with respect to all

π \in L (V)

. The partitions are ordered as inFigure 6 with increasing cardinality

| π |

. Vertical lines indicate partitions where the cardinality

| π |

increases by one. Note that the values of

x_{V}^{M C^{=}}

are almost completely hidden from view by those of

x_{V}^{1}

Figure 18. All distinct completely integrated composite patterns on the first trajectory

x_{V}^{1}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern. SeeFigure 12 for colouring conventions.

Figure 18. All distinct completely integrated composite patterns on the first trajectory

x_{V}^{1}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern. SeeFigure 12 for colouring conventions.

Figure 19. All distinct completely integrated composite patterns on the second trajectory

x_{V}^{2}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern.

Figure 19. All distinct completely integrated composite patterns on the second trajectory

x_{V}^{2}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern.

Figure 20. All distinct completely integrated composite patterns on the third trajectory

x_{V}^{3}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern.

Figure 20. All distinct completely integrated composite patterns on the third trajectory

x_{V}^{3}

M C^{ϵ}

. The value of complete local integration is indicated above each pattern.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Biehl, M.; Ikegami, T.; Polani, D. Specific and Complete Local Integration of Patterns in Bayesian Networks.Entropy2017,19, 230. https://doi.org/10.3390/e19050230

AMA Style

Biehl M, Ikegami T, Polani D. Specific and Complete Local Integration of Patterns in Bayesian Networks.Entropy. 2017; 19(5):230. https://doi.org/10.3390/e19050230

Chicago/Turabian Style

Biehl, Martin, Takashi Ikegami, and Daniel Polani. 2017. "Specific and Complete Local Integration of Patterns in Bayesian Networks"Entropy 19, no. 5: 230. https://doi.org/10.3390/e19050230

APA Style

Biehl, M., Ikegami, T., & Polani, D. (2017). Specific and Complete Local Integration of Patterns in Bayesian Networks.Entropy,19(5), 230. https://doi.org/10.3390/e19050230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further detailshere.

Article Metrics

Article Access Statistics

For more information on the journal statistics, clickhere.

Multiple requests from the same IP address are counted as one view.

Movatterモバイル変換

Article Menu

Specific and Complete Local Integration of Patterns in Bayesian Networks

Abstract

1. Introduction

1.1. Illustration

1.2. Contributions

1.3. Related Work

2. Notation and Background

3. Patterns, Entities, Specific, and Complete Local Integration

3.1. Patterns

3.2. Motivation of Complete Local Integration as an Entity Criterion

3.3. Properties of Specific Local Integration

3.3.1. Deterministic Case

3.3.2. Upper Bounds

3.3.3. Negative SLI

3.4. Disintegration

3.5. Disintegration Interpretation

3.6. Related Approaches

4. Examples

4.1. Set of Independent Random Variables

4.2. Two Constant and Independent Binary Random Variables:MC=

4.2.1. Definition

4.2.2. Trajectories

4.2.3. Partitions of Trajectories

4.2.4. SLI Values of the Partitions

4.2.5. Disintegration Hierarchy

4.2.6. Completely Integrated Patterns

4.3. Two Random Variables with Small Interactions

4.3.1. Definition

4.3.2. Trajectories

4.3.3. SLI Values of the Partitions

4.3.4. Completely Integrated Patterns

5. Discussion

Acknowledgments

Author Contributions

Conflicts of Interest

Abbreviations

Appendix A. Kronecker Delta

Appendix B. Refinement and Partition Lattice Examples

Appendix C. Bayesian Networks

Appendix C.1. Deterministic Bayesian Networks

Appendix C.2. Proof of Theorem 2

Appendix D. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Two Constant and Independent Binary Random Variables: $M C^{=}$