Entropy is another word for information is another word for 'surprise' is another word for probabilities. All measurements that relate to Shannon entropy and imilar, at the core, require you to either have (assume) a (known, like a Boltzmann) distribution of the probabilities for your system under study, or if you can't justify such a distribution, you must estimate this distribution-- usually via observing the system for some time and keeping track of how often you see a certain symbol.
Let's say we have a system, X, that generates the following: 010101010101010101010101010101010101010101010101. That is, it generates 2 symbols, 0 and 1 in a periodic fasion. How surprised are you when you predict that the next symbol is going to be 0? Not very surprised? I guess that's fair. How surprised are you though? Well, you should be about -P(0) * log2(P(0)) + -P(1) * log2(P(1)) = -0.5 * log2(0.5) + -0.5 * log2(0.5) = 1. 1 bits surprised. Or said differently, you are about 1 bit uncertain about the system. This makes sense: you only need 1 bit of information to gain full predictive power of the system. The system has 1 bit memory.
Excess entropy is a difference between entropy rates. An entropy rate is entropy per block size, also called entropy density. Basically, you measure the entropy of a block, this is called the block entropy and it's pretty well defined for 1D length L blocks:
H(L) = - sum_s^L Pr(s^L) * log2 Pr(s^L)
What happens if we let L grow large now? Increasingly we get the entropy of the whole series. But if we divide the block entropy on the length of the series, L, we get a density measurement:
h_mu = lim L->inf H(L)/L
This is also called the entropy rate, Shannon entropy rate, etc. It turns out we can do something clever with this: we can measure the density by looking at how much entropy is present at a site, conditioned on the neighbor sites. This does seem to make sense, since we are in fact interested in a sort of localized measurement.
According to [1], it can be shown that the excess entropy is equivalent to measuring the mutual information between a length L block of 'futures' and a length L block of 'past' variables. Lizier, in a Twitter thread noticed:
Lizier: If you meant 2D excess entropy + time, that's easier - can be done simply as MI between previous and next values of extended spatial templates (e.g. 2x2 blocks) 4/5