## Excess entropy

**TODO:** This is just me jotting down notes as I'm writing a paper. You can safely disregard this for now. Nothing of value will be learned from reading this.
These ramblings are mostly on the paper "Structural information in
two-dimensional patterns: Entropy convergence and excess entropy" by Feldman
and Crutchfield.

Entropy is another word for information is another word for 'surprise' is another word for probabilities. All measurements that relate to Shannon entropy and imilar, at the core, require you to either have (assume) a (known, like a Boltzmann) distribution of the probabilities for your system under study, or if you can't justify such a distribution, you must estimate this distribution-- usually via observing the system for some time and keeping track of how often you see a certain symbol.

Let's say we have a system, X, that generates the following: 010101010101010101010101010101010101010101010101. That is, it generates 2 symbols, 0 and 1 in a periodic fasion. How surprised are you when you predict that the next symbol is going to be 0? Not very surprised? I guess that's fair. How surprised are you though? Well, you should be about -P(0) * log2(P(0)) + -P(1) * log2(P(1)) = -0.5 * log2(0.5) + -0.5 * log2(0.5) = 1. 1 bits surprised. Or said differently, you are about 1 bit uncertain about the system. This makes sense: you only need 1 bit of information to gain full predictive power of the system. The system has 1 bit memory.

Excess entropy is a difference between entropy rates. An entropy rate is entropy per block size, also called entropy density. Basically, you measure the entropy of a block, this is called the block entropy and it's pretty well defined for 1D length L blocks:

H(L) = - sum_s^L Pr(s^L) * log2 Pr(s^L)

What happens if we let L grow large now? Increasingly we get the entropy of the whole series. But if we divide the block entropy on the length of the series, L, we get a density measurement:

h_mu = lim L->inf H(L)/L

This is also called the entropy rate, Shannon entropy rate, etc. It turns out
we can do something clever with this: we can measure the density by looking at
how much entropy is present at a site, conditioned on the neighbor sites. This
does seem to make sense, since we are in fact interested in a sort of localized measurement.

According to [1], it can be shown that the excess entropy is equivalent to
measuring the mutual information between a length L block of 'futures' and a
length L block of 'past' variables. Lizier, in a Twitter thread noticed:

*
Lizier: If you meant 2D excess entropy + time, that's easier - can be done simply as MI between previous and next values of extended spatial templates (e.g. 2x2 blocks) 4/5
*

### 2D Ising models

Feldman and Crutchfield did 2D excess entropy measurements using both mutual information and convergence excess entropy, and varied connectedness of the spins J.
I plan on doing something similar. I will measure for a large range of temperatures, and for a number of random initializations.
I will measure the 'total' structural entropy in a given probability distribution. I will vary temperature and connectedness.
### Practical issues

Generally speaking, all measurements that somehow invoke information measurements all somehow boil down to the probability of observing a certain pattern. This pattern can have a size, in which case the entropy is called 'block entropy', at least if the block is in 1D. 1D makes conditional probabilities easier to deal with; which is essentially the 'problem' in defining entropy in 2D. The entropy rate can be expressed as H[S_L | S_L-1 S_L-2, ... , S_L-N] -- S_L conditioned on adjacent spins, but what are the adjacent spins in 2D?
It turns out you can condition Shannon block entropy on a specific pattern of sites.
Anyway, it all boils down to cooking up the probabilities of seeing patterns: p(s_L). L can be the dimensions in 1 or 2D.
### References

[1] "Structural information in two-dimensional patterns: Entropy convergence and excess entropy" by Feldman and Crutchfield.