## Motivation

The main idea behind this article is the following:

How do we quantify the maximum amount of information we can get out of incomplete observations, without assuming anything about the underlying dynamical system that produced them?

*This is a very basic tutorial where visualisations serve as replacements for the underlying mathematical concepts. Interested readers are referred to, for example, the works of Gary Froyland*.

*Note: the interactive visualisations in this article may take some time to load. Please be patient!*

## Reconstructing dynamics from observations

Imagine you are given a time series consisting of \(n\) observations. It may look like this:

This time series is a record of some physical phenomenon, measured at a regular time interval \(\Delta t\). You ask the question: what can I say about the long-term probabilities of my system being in any state represented by the range of values of my observations? This is not an easy question.

### Dynamical systems approach

If a time series was produced by some system that is at least partly deterministic, it is possible to reconstruct a “shadow” view of the true dynamics from that time series. This is done by representing each observation as a point in m-dimensional space, where values along all coordinate axis are the same time series data, but lagged relative to each other with some lag \(\tau\).

This “shadow” view of the dynamics is called an embedding. For the time series above, the embedding could look like this:

### How do points in the embedding relate to the time series?

The answer to this question is given by Takens’s theorem.

Roughly speaking, instead of viewing each point by itself, we’re now considering the point *along with two of its neighbouring points in the time series*. For any generic embedding, we would write the coordinates of the embedding points as \(p_t = (x(t), x(t-\tau), x(t - 2\tau))\). In our case, \(\tau = 1\), so we write \(p_t = (x(t), x(t-1), x(t-2))\) with \(t\) running from \(1\) to \(n-2\). The information about the two last points is lost when shifting the vectors relative to each other.

Convince yourself that each point in this three-dimensional space is actually represented by the current value, the next value and the next-next value in the original time series (Figure 1). You can do this by hovering over points in Figure 1 and Figure 2, comparing coordinates between the plots.

### Why bother with the embedding?

A single point in the time series carries information only about itself. Considering points individually would be natural if we knew they were the result of some completely random process, where any value in the time series was independent of all the previous values.

Should the observations be the result of some process that is at least partly deterministic, a reasonable expectation would be that values of the time series are not independent of preceding values. Equivalently, states of the system are influenced by preceding states of the system. This is the motivation for constructing the embedding. It is more useful (yields more information) to regard multidimensional *states* consisting of information about several past values, rather than single values of the time series.

## From individual observations to collections of states

With the concept of an embedding in place, we can start visualising how our system would evolve through time. Instead of following single values of the time series, we view a collection of states consisting of *current and historical values* of the system evolving with time. For this reason, we call the space in which we have embedded our time series the *state space* (or phase space for continous systems).

Our system was sampled at regular, but finitely small time intervals. The information *between* the points is then lost, and we need some way to figure out how our system most likely moved between those points.

In the plot above, points are connected by straight lines. This is a case of linear interpolation, which is commonly used to represent missing values in data sets. For some systems, this may be a reasonable approximation to reality. But most system don’t evolve in straight lines (between any two instances of time arbitrarily far apart, the relationship between past and present values of the time series is not constant).

For anything but the simplest phenomena, we expect there to have been much more variability between states. Look at Figure 1, for example. There is no obvious pattern to the observations, so how could we justify linearly interpolating between the points?

For our system, any line connecting two points in the state spaces would likely be squiggly, wandering around and creating irregular lines in the state space. If we had taken our measurement at different times, and with a different time interval \(\Delta t\), the state space would look very different.

This is where it all comes together. We now ask the following question:

**Can we estimate the long-term probability of our system being in any state, given the finite number of observations of our system? **

The answer is yes.

### Naive approach

To get an estimate of probability of your system being in a given state \(S_x\), you could start counting how many points are similar to \(S_x\). With enough observations, this would be a reliable way to estimate the long-term probability of your system being in state \(S_x\). For a long enough time series, we would apply this procedure for all states \(S_i\) to obtain a probability distribution over the state space.

However, in real applications and experiments, data often consist of incomplete observations sampled over finite time. In our case, we only have \(n = 28\) observations. Estimating state probabilities based on counting would in this case not be reliable, but strongly biased towards the observations that *are actually in the data* (sampling bias). So how do we obtain a more reliable estimate of long-term probabilities given finite, sparsely sampled data?

To answer this, we need to address what determinism actually means.

## A note on determinism and governing equations

If our observations were the result of some deterministic process, then these underlying processes can (in principle) be described by a differential equations. These equations relate all dynamical variables possibly influencing our system. Our observations can be considered the output of these equations/processes.

This is what we mean when we say the embedding is a “shadow view” of the true dynamics. The observations do not reveal exactly what the underlying system is like, but they are directly related to its structure, because they are the output of the underlying system. In general, we may not be able to recreate the governing equations, but we may study the properties of the system using its geometry inferred from the observations.

Takens’ theorem guarantees us that the embedding formed by our single time series is a faithful representation of these equations (whatever they actually are). This means that the information obtained by sampling only a single coordinate axis of the total state space, which may be infinitely-dimensional, can be used to reconstruct the underlying dynamics.

The core of the following sections is to perform a finite approximation of the governing equations based on the dynamical information inherent in our observations. The resulting approximation comes in the form of a Markov matrix acting on separate chunks of the state space, rather than a set of equations acting on single points of the state space.

With this in mind, we can return to the problem of estimating probabilities of states in our embedding.

## How to estimate long-term probabilities of states based on data?

To answer this question, we need to do a mathematical trick. Because our data are finite, we need to discretize our state space. This means dividing it into non-overlapping chunks that we can consider separately.

Instead of considering only *where points are located* in the state space, we use the information inherent in our sampled states to see how volumes of the state space are *moved around and distorted* by the action of the underlying governing equations. In other words: how are collections of states (given by defined regions in the state space) moved around as time goes by?

The obvious thing to do would be to divide the state space into equally sized boxes. However, it is more convenient to divide the space into simplices^{1}. As the name suggests, the *simplex* is the simplest convex volume. In our 3-dimensional case, the simplices are tetrahedra, but the concept generalises to any dimension. We let the sizes and shapes of the simplices be dictated by our data by performing a Delaunay triangulation of the points in our state space:

The triangulated state space now forms a set set of disjoint simplices, and the convex hull of all these simplices forms the set of possible states our system can visit (as far as we know, given our observations).

## Approximating the underlying governing equations (transfer operator)

We can, roughly speaking, approximate the action of the underlying map (governing equations giving rise to the processes recorded by our time series) by projecting the vertices of each simplex one time step forward. This operation maps simplices into simplices, but changes their shape and volume.

How simplices are deformed and translated into their images (forward projections) can be regarded as a recording of how the underlying map acts on specific volumes of the state space.

Consider the left panel of the visualisation below. Here, each point in the inital embedding is color coded according to its time index, and every simplex in the triangulation is assigned a color.

The right panel also shows the original embedding points, but overlain by the *images* of the simplices of the triangulation. Notice how, in the right panel, the embedding points are surrounded by different simplices. Projected simplices have different volumes and shapes than the simplices they were projected from.

By tracking how simplices are distorted and moved around by the map, we can estimate transition probabilities between regions of the state space. We do this formally below.

Notice how the state corresponding to the first time index does not have any tetrahedron associated with it. This is because no state is mapped to it when shifting the time indices one step forward.

## Transition probabilities

The reason for discretizing the state space using simplices and not a regular grid of rectangular boxes is that, in general, it is not possible to construct a rectangular grid whose vertices are the original points of the embedding. In that case, we would not have any information about how the vertices of the boxes are moved around by the governing equations.

Computing the intersecting volume between two simplices is computationally demanding, but straightforward and well-documented in the literature. We have made a Julia library for computing exact simplex intersections in N dimensions, (Simplices.jl).

From the triangulation and its forward projection, we can represent the underlying map that produced the observations by a row-stochastic Markov matrix \(M\). The entries of \(M\) can be interpreted as follows: The entry (i, j) is the probability for the system to jump from a state contained in simplex \(S_i\) to a state contained in simplex \(S_j\).

The following visualisation demonstrates how \(M\) is estimated for the triangulation shown above.

Before we estimated \(M\), we ensured that the embedding formed an invariant set under the forward map. In practice, this involves checking that the point in the embedding corresponding to the last time index does not fall outside the convex hull of the preceding points. If that was the case, image simplices could fall outside the convex hull of the original simplices. In terms of the Markov matrix, this means that we would be losing information (not all rows would sum to 1).

## An ergodic probability density over the state space

We can now obtain a probability distribution on the simplices forming our discretized state space by repeated application of \(M\) to any initial distribution. The resulting distribution represents the long term visitation frequency of trajectories in each simplex (left panel below). The probability density is is invariant in the sense that it is the left eigenvector of \(M\) with eigenvalue 1. In that sense, \(M\) is stationary.

In the right panel above, we have simulated trajectories of the underlying map by introducing points in the different regions of the state space according to the probability that trajectories will visit that region in the long term. Empty simplices have zero probability of visitation, while simplices containing points have non-zero probabilities of visitations.

## Applications

Generalising to systems of multiple variables, we use the concepts introduced above in a forthcoming paper on estimating transfer entropy. Transfer entropy is an asymmetric measure of directed information flow between time series, and is typically estimated using counting of nearest neighbours (the “naive” approach discussed above). Nearest-neighbour estimation of probabilities is less computationally demanding than the simplex-based estimation, and should be used when computing speed is critical or when long, data-rich time series are available. However, our goal is to improve estimation accuracy for short and noisy time series, which are common across many disciplines.

We have developed Julia packages for computing an invariant distribution using the transfer operator, and for estimating Transfer Operator Transfer Entropy (TOTE). An R package of wrapper functions is forthcoming.

Stay tuned for a more detailed description of the TOTE method for estimating transfer entropy.

## Footnotes

For a more thorough description of the simplex approach, consult

*Froyland, Gary. “Computing physical invariant measures.” Int. Symp. Nonlinear Theory and its Applications, Japan, Research Society of Nonlinear Theory and its Applications (IEICE). Vol. 2. 1997.*.↩