Sunday, October 10, 2021

Khalil Sima'an: "Complexity of Disambiguation under DOP1" (2003)

I haven't written here in a while, but I've recently learned a few things I want to write down for later reference. The goal of the current post is to sketch a proof given by Khalil Sima'an for the NP-completeness of the probability-assignment problem under data-oriented parsing.

Background: Data-Oriented Parsing

Data-oriented parsing (DOP) is type of probabilistic language model that can be extracted from a treebank. Unlike probabilistic context-free grammar, the language model does not consist of non-terminal expansion probabilities, but is rather of a distribution over incomplete parse trees. These probabilities are estimated by snipping every subtree of every tree in a data set of manually parsed sentences.

Because the distribution created by the model can assign probability to very large parse trees, this scheme can encode a lot of memory and cross-dependency. Whereas probabilistic context-free grammars are Markovian (the expansion of a non-terminal depends only on the type of that non-terminal), data-oriented parsing models can encode arbitrarily long dependencies within the tree.

 

The tree-joining operation used to derive parse trees in DOP.

DOP models derive parse trees by joining incomplete trees together until all of their leaves are terminals. A single parse tree can therefore have several derivations. DOP models thus have one more source of ambiguity in addition to the structural ambiguity present in context-free grammars. This is the reason why many data-oriented parsing problems are harder than the corresponding parsing problems in formalisms in which parse trees are primitives.

An One-Dimensional Analogy

The difference between a DOP model and a context-free grammar is analogous to difference between a Markov model and a language model that can output words of any length.

We can fit such a distribution to a corpus by setting the probability of a word proportional to its frequency in the corpus, regardless of its length. The corpus ABC then results in a uniform distribution over

{A, B, C, AB, BC, ABC}.

Such a language model can obviously produce lots of long sequences of reasonable-sounding text, at the cost of having enormous memory requirements. The number of substrings is quadratic in the length of the corpus.

Data-oriented parsing similarly encodes long-distance memory within the parse tree by simply memorizing large, ready-made subtrees. An inductive argument shows that the number of subtrees of a tree is exponential in the depth of the tree.

A very large DOP grammar derived from just two sentences.

In both cases, it is possible to introduce rules that bias the distributions to favor shorter or longer segments if that seems desirable, but this makes no difference to the issues of computational complexity.

Background: The 3SAT Problem

Before we go into the mechanics of the complexity proof, a few pieces of specialized logical notation and terminology are necessary.

Sima'an shows that if one can decide whether a given sentence has a high probability under a DOP model, then one can also decide whether a logical formula in three variables is satisfiable (the 3SAT problem). I will start by reformulating the 3SAT problem in a form that lends itself more easily to the complexity proof.

Suppose we are asked if the logical formula

(p v q v ~r) ∧ (p v ~q v r) ∧ ... ∧ (~q v s v u)

consisting of m blocks of three literals, mentioning n ≤ 3m different variables. We are asked whether there is an assignment of truth values to the n logical variables that makes this formula true.

A positive answer to this question should come in the form of a truth assignment like

(T:p v T:q v T:~r) ∧ (T:p v F:~q v F:r) ∧ ... ∧ (F:~q v T:s v F:u)

For this assignment to qualify as a correct and positive answer, it must be

  1. true: assign the value T to at least one of the disjuncts in each group
  2. consistent: assign the same truth value to any variable x throughout the formula and the opposite to ~x

For example, (F:p) ∧ (F:p) is consistent but false, whereas (T:p) ∧ (T:~p) is a true but inconsistent. It will be convenient below to allow inconsistent truth-value assignments as valid proposals for a solution to the 3SAT problem.

Truth-Value Assignments as Parse Trees

Since truth-value assignments can be represented by formulas like (F:p) ∧ (F:~q v T:r), they can be generated by a context-free grammar.

This grammar can be modified so that eventually outputs the original logical formula with the truth values deleted—in the case of this example, (p) ∧ (~q v r). This can be accomplished by interpreting assignments like T:~q as the name of a non-terminal which always produces the terminal ~q.

This modified grammar will always produce the same surface string, but that string can be the result of different parse trees. Since the parse tree contains non-terminal nodes with Ts and Fs in them, it can be interpreted as a (possibly inconsistent) truth-value assignments. Sima'an's proof uses a representation os this type to reduce the 3SAT to the problem of finding high-probability parse trees.

Most-Likely-DOP-Parse-Tree is NP-Complete

Sima'an shows that the 3SAT problem is not easier than the following problem:

Given a DOP language model P, a sentence s, and a threshold t,
decide whether s has a probability larger than t under the model P.

Suppose we know how to solve this problem, and let a new instance of the 3SAT problem with m conjuncts and n variables be given.

As shown above, candidate solutions to the 3SAT problem can be represented as parses of the logical formula. A parse is a correct solutions when it satisfies both a truth requirement and a consistency requirement. Sima'an's proof constructs a DOP model that generates all the candidates solutions to the given problem, but generates the true and consistent solutions more frequently.

His grammar is constructed such that the formula only has two types of derivations:

  1. always-true derivations:
    1. generate a shallow tree leaving the disjunctive groups unspecified
    2. within each disjunctive group, pick one of the three disjuncts and expand it such that it becomes true
    3. expand the remaining disjuncts of each group randomly
  2. probably-consistency derivations:
    1. pick one of the n variables and generate a deep tree in which that variable is assigned a consistent truth value throughout the parse tree
    2. expand the remaining disjuncts of each group randomly

Parse trees from the first group always make the formula true and are sometimes consistent. Parse trees from the second group are more likely than chance to be consistent and may occasionally be true. All parse trees result in the same surface string, which is equal to the logical formula.

If a parse tree is both true and consistent, it can derived in at least one way using the first method, and in at least n ways using the second method. The probability of a parse tree that solves the 3SAT problem is therefore higher than the probability of a parse tree that does not. By choosing the probabilities of the various subtrees well, it is possible to pick a threshold such that the existence of a high-probability tree is exactly equivalent to a positive resolution of the 3SAT problem.

The grammar constructed by Sima'an to translate 3SAT to most-likely-parse.

Obviously, I am leaving out some details here, but this is the idea of the proof. I should note that Sima'an's own proof has a what might be considered a minor error, in that he reversed the roles of the truth values and the literals in the parse tree, so that the surface sentence is a string of truth values. This makes no sense because it assumes a solution to the 3SAT problem is given as part of the input, but fortunately, it is an easy error to fix.

Thursday, October 22, 2015

Bell: "Bertlmann's Socks and the Nature of Reality" (1980)

There seems to be a lot of mystery surrounding Bell's theorems. Scholarpedia has a whole section devoted to "controversy and common misunderstandings" surrounding the argument, and a recent xkcd comic took up the topic (without mentioning any specific misunderstanding).

Source: xkcd.com/1591

I've also been told in person that I understood the theorem wrong. So it seems about time for some studying.

This time, rather than pick up Bell's original article, I read this more popular account of the argument, which covers more or less the same ground. If I understand it correctly, it's actually simpler than what I first thought, although my hazy understanding of the physics stood in the way of extracting the purely statistical part of the argument.

Background

Here's what I take to be the issue: We have a certain experiment in which two binary observables, $A$ and $B$, follow conditional distributions that depend on two control variables, $a$ and $b$:\begin{eqnarray}
a &\longrightarrow& A \\
b &\longrightarrow& B
\end{eqnarray}Although the experiment is designed to prevent statistical dependencies between $A$ and $B$, we still observe a marked correlation between them for many settings of $a$ and $b$. This has to be explained somehow, either by postulating
  • an unobserved common cause: $\lambda\rightarrow A,B$;
  • an observed common effect: $A,B \rightarrow \gamma$ (i.e., a sampling bias);
  • or a direct causal link: $A \leftrightarrow B$.
The purpose of Bell's paper is to rule out the most plausible and attractive of these three options, the hidden common cause. This explanation is ruled out by showing that a certain measure of dependence would exceed a logically necessary bound under this type of explanation.

Measurable Consequences

The measure in question is the following:
\begin{eqnarray}
C(a,b) &=& +P(A=1,B=1\,|\,a,b) \\
           &   & +P(A=0,B=0\,|\,a,b) \\
           &  &  -P(A=1,B=0\,|\,a,b) \\
           &  &  -P(A=0,B=1\,|\,a,b).
\end{eqnarray}This statistic is related to the correlation between $A$ and $B$ but different due to the absence of marginal probabilities $P(A)$ and $P(B)$. It evaluates to $+1$ if and only if the two are perfectly correlated, and $-1$ if and only if they are perfectly anti-correlated.

Contours of $C(a,b)$ when $A$ and $B$ are independent with $x=P(A)$ and $y=P(B)$.

In a certain type of experiment, where $a$ and $b$ are angles of two magnets used to reveal something about the spin of a particle, quantum mechanics predicts that
$$
C(a,b) \;=\; -\cos(a-b).
$$When the control variables only differ little, $A$ and $B$ are thus strongly anti-correlated, but when the control variables are on opposite sides of the unit circle, $A$ and $B$ are closely correlated. This is a prediction based on physical considerations.

Bounds on Joint Correlations

However, let's stick with the pure statistics a bit longer. Suppose again $A$ depends only on $a$, and $B$ depends only on $b$, possibly given some fixed, shared background information which is independent of the control variables.

The statistical situation when the background information is held constant.

Then $C(a,b)$ can be expanded to
\begin{eqnarray}
C(a,b) &=& +P(A=1\,|\,a) \, P(B=1\,|\,b) \\
           &   & +P(A=0\,|\,a) \, P(B=0\,|\,b) \\
           &   & - P(A=1\,|\,a) \, P(B=0\,|\,b) \\
           &   & - P(A=0\,|\,a) \, P(B=1\,|\,b) \\
           &=& [P(A=1\,|\,a) - P(A=0\,|\,a)] \times [P(B=1\,|\,b) - P(B=0\,|\,b)],
\end{eqnarray}that is, the product of two statistics which measure how stochastic the variables $A$ and $B$ are given the control parameter settings. Using obvious abbreviations,
$$
C(a,b) \; = \; (A_1 - A_0) (B_1 - B_0),
$$and thus
\begin{eqnarray}
C(a,b) + C(a,b^\prime) &=&
     (A_1 - A_0) (B_1 - B_0 + B_1^\prime - B_0^\prime)
     & \leq & (B_1 - B_0 + B_1^\prime - B_0^\prime); \\
C(a^\prime,b) - C(a^\prime,b^\prime) &=& (A_1^\prime - A_0^\prime) (B_1 - B_0 - B_1^\prime + B_0^\prime)
    & \leq & (B_1 - B_0 - B_1^\prime + B_0^\prime).
\end{eqnarray}It follows that
$$
C(a,b) + C(a,b^\prime) + C(a^\prime,b) - C(a^\prime,b^\prime) \;\leq\; 2(B_1 - B_0) \;\leq\; 2.
$$Since $(B_1 - B_0)\geq-1$, a similar derivation shows that
$$
| C(a,b) + C(a,b^\prime) + C(a^\prime,b) - C(a^\prime,b^\prime) | \;\leq\; 2|B_1 - B_0| \;\leq\; 2.
$$In fact, all 16 variants of this inequality, with the signs alternating in all possible ways, can be derived using the same idea.

Violations of Those Bounds

But now look again at
$$
C(a,b) \;=\; -\cos(a-b).
$$We then have, for $(a,b,a^\prime,b^\prime)=(0,\pi/4,\pi/2,-\pi/4)$,
$$
\left| C\left(0, \frac{\pi}{4}\right) + C\left(0, -\frac{\pi}{4}\right) + C\left(\frac{\pi}{4}, \frac{\pi}{4}\right) - C\left(\frac{\pi}{4}, -\frac{\pi}{4}\right) \right|  \;=\;  -2\sqrt{2},
$$which is indeed outside the interval $[-2,2]$. $C$ can thus not be of the predicted functional form and at the same time satisfy the bound on the correlation statistics. Something's gotta give.

Introducing Hidden Variables

This entire derivation relied on $A$ and $B$ depending on nothing other than their own private control variables, $a$ and $b$.

However, suppose that a clever physicist proposes to explain the dependence between $A$ and $B$ by postulating some unobserved hidden cause influencing them both. There is then some stochastic variable $\lambda$ which is independent of the control variables, yet causally influences both $A$ and $B$.

The statistical situation when the background information varies stochastically.

However, even if this is the case, we can go through the entire derivation above, adding "given $\lambda$" to every single step of the process. As long as we condition on a fixed value of lambda, each of the steps still hold. But since the inequality thus is valid for every single value of $\lambda$, it is also valid in expectation, and we can thus integrate $\lambda$ out; the result is that even under such a "hidden variable theory," the inequality still holds.

Hence, the statistical dependency cannot be explained by a shared cause alone, since the functional form of the probability densities for $A$ given $a$ and $B$ given $b$ are of a wrong form. We will therefore need to either postulate direct causality between $A$ and $B$ or an observed downstream variable (sampling bias) instead.

Note that the only thing we really need to prove this result is the assumption that the probability $P(A,B \, | \, a,b,\lambda)$ factors into the product $P(A \, | \, a,b,\lambda)\, P(B \, | \, a,b,\lambda)$. This corresponds to the assumption that there is no direct causal connection between $A$ and $B$.

Wednesday, September 16, 2015

Billsingley: Ergodic Theory and Information (1965), pp. 12–14

In the process of proving the ergodic theorem, Billingsley also provides in passing a proof for a weaker theorem related to mixing operations: This states that for a class of processes that satisfy a certain limiting independence condition, visiting frequencies converge to probabilities.

Definitions

More precisely, a time shift $T$ is said to be mixing (p. 12) if$$
P(A \cap T^{-n}B)  \;  \rightarrow  \;  P(A)P(B)
$$for all sets $A$ and $B$. The process is thus mixing if information about the present is almost entirely independent of the far future. (A sample path $\omega$ belongs to the set $T^{-n}B$ if $\omega$ will visit $B$ after $n$ time shifts forward.) Billsingley doesn't comment on the choice of the term "mixing," but I assume it should be read as an ill-chosen synonym for "shuffling" in this context.

We will now be interested in the behavior of the visiting frequencies$$
P_n(A) \; := \;  \frac{1}{n} \sum_{k=0}^{n-1} \mathbb{I}_A(T^k\omega),
$$where $\mathbb{I}_A$ is the indicator function of some bundles of sample paths. This quantity is a random variable with a distribution defined by the distribution on the sample paths $\omega$.

Note that the traditional notion of visiting frequency (i.e., how often is the ace of spades on top of the deck?) is a special case of this concept of visiting frequency, in which the bundle $A$ is defined entirely in terms of the 0th coordinate of the sample path $\omega = (\ldots, \omega_{-2}, \omega_{-2}, \omega_{0}, \omega_{1}, \omega_{2}, \ldots)$. In the general case, the question of whether $\omega\in A$ could involve information stored at arbitrary coordinates of $\omega$ (today, tomorrow, yesterday, or any other day).

Theorem and Proof Strategy

The mixing theorem now says the following: Suppose that $P$ is a probability measure on the set of sample paths, and suppose that the time shift $T$ is mixing and measure-preserving with respect to this measure. Then$$
P_n(A)  \;  \rightarrow  \;  P(A)
$$in probability.

The proof of this claim involves two things:
  1. Showing that $E[P_n(A)] = P(A)$;
  2. Showing that $Var[P_n(A)] \rightarrow 0$ for $n \rightarrow \infty$.
These two assumptions collectively imply convergence in probability (by Markov's inequality).

Identity in Mean

The time shift $T$ is assumed to preserve the measure $P$. We thus have $$
E[f(\omega)] \; = \; E[f(T\omega)] \; = \; E[f(T^2\omega)] \; = \; \cdots
$$for any measurable function $f$. It follows that$$
E[\mathbb{I}_A(\omega)] \; = \;
E[\mathbb{I}_A(T\omega)] \; = \;
E[\mathbb{I}_A(T^2\omega)] \; = \; \cdots
$$and that these are all equal to $P(A)$. By the linearity of expectations, we therefore get that $E[P_n(A)] = P(A)$.

This proves that $P_n(A)$ at least has the right mean. Convergence to any other value than $P(A)$ is therefore out of the question, and it now only remains to be seen that the variance of $P_n(A)$ also goes to 0 as $n$ goes to infinity.

Vanishing Variance: The Idea

Consider therefore the variance$$
Var[P_n(A)]  \; = \;  E\left[ \left(P_n(A) - P(A)\right)^2 \right].
$$In order to expand the square under this expectation, it will be helpful to break the term $P(A)$ up into $n$ equal chunks and then write the whole thing as a sum:$$
P_n(A) - P(A)  \; = \;  \frac{1}{n} \sum_{k=0}^{n-1} (\mathbb{I}_A(T^k\omega) - P(A)).
$$If we square of this sum of $n$ terms and take the expectation of this expasion, we will get a sum of $n^2$ cross-terms, each of which are expected values of the form$$
Cov(i,j)  \; := \;  E\left[ \left(\mathbb{I}_A(T^i\omega) - P(A)\right) \times \left (\mathbb{I}_A(T^j\omega) - P(A)\right) \right].
$$By expanding this product and using that $E[\mathbb{I}_A(T^k\omega)]=P(A)$, we can find that$$
Cov(i,j)  \; = \;  E\left[\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)\right] - P(A)^2.
$$The random variable $\mathbb{I}_A(T^i\omega) \times \mathbb{I}_A(T^j\omega)$ takes the value 1 on $\omega$ if and only if $\omega$ will visit $A$ after both $i$ and $j$ time steps. Hence,$$
Cov(i,j)  \; = \;  P(T^{-i}A \cap T^{-j}A)  -  P(A)^2,
$$and we can write$$
Var[P_n(A)]  \; = \;  \frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j).
$$The goal will now be to show the sum of these $n \times n$ converges to $0$ as we increase $n$.

Vanishing Variance: The Steps

Since $T$ is assumed to be measure-preserving, the value of $Cov(i,j)$ is completely determined by the distance between $i$ and $j$, not their absolute values. We thus have$$
Cov(i,j)  \; = \;  Cov(n+i, n+j)
$$for all $i$, $j$, and $n$.

By the mixing assumption, $Cov(0,j) \rightarrow 0$ for $j \rightarrow \infty$. Together with the observation about $|i-j|$, this implies that $$Cov(i,j) \;\rightarrow\; 0$$ whenever $|i-j| \rightarrow \infty$. If we imagine the values of  $Cov(i,j)$ tabulated in a huge, square grid, the numbers far away from the diagonal are thus tiny. We will now sum up the value of these $n^2$ numbers $Cov(i,j)$ by breaking them into $n$ classes according to the value of $d = |i-j|$.

To do so, we upper-bound the number of members in each class by $2n$: There are for instance $n$ terms with $|i-j|=0$, and $2(n-1)$ terms with $|i-j|=1$. (This upper-bound corresponds graphically to covering the $n\times n$ square with a tilted $n\sqrt{2} \times n\sqrt{2}$ square, except for some complications surrounding $d=0$.)

We thus have$$
Var[P_n(A)]
\; = \;
\frac{1}{n^2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} Cov(i,j)
\;  \leq  \;
\frac{2}{n} \sum_{d=0}^{n-1} Cov(0,d).
$$This last quantity is an average of a list of numbers that converge to $0$; this average itself must therefore also converge to $0$ as $n\rightarrow \infty$. This proves that$$
Var[P_n(A)]  \; \rightarrow \;  0]
$$for $n \rightarrow \infty$, and we can conclude that $P_n(A) \rightarrow P(A)$ in probability.

Applicability

Note that this mixing assumption is very strong. A periodic Markov chain, for instance, will usually not be mixing: Consider for instance a process that alternates between odd and even values; the the joint probability $P(evens \cap T^n odds)$ will alternate between 0 and 1 as $n$ runs through the natural numbers. In many cases of practical interest, we therefore do in fact need the increased generality of the classic ergodic theorem.

Wednesday, September 9, 2015

Billingsley: Probability and Measure, Ch. 1.4

Prompted by what I found to be quite confusing about Patrick Billingsley's presentation of Kolmogorov's zero-one law, I've been reflecting a bit on the essence of the proof, and I think I've come to a deeper understand of the core of the issue now.

The theorem can be stated as follows: Let $S$ be an infinite collection of events equipped with a probability measure $P$ (which, by Kolmogorov's extension theorem, can be defined exhaustively on the finite subsets of $S$). Suppose further that $A$ is some event in the $\sigma$-extension of $S$ which is independent of any finite selection of events from $S$. Then $A$ either has probability $P(A)=0$ or $P(A)=1$.

The proof is almost evident from this formulation: Since $A$ is independent of any finite selection of events from $S$, the measures $P(\,\cdot\,)$ and $P(\,\cdot\,|\,A)$ coincide on all events that can be defined in terms of a finite number of events from $S$. But by Kolmogorov's extension theorem, this means that the conditional and the unconditional measures extend the same way to the infinite collection. Hence, if P$(A)>0$, this equality also applies to $A$, and thus $P(A\,|\,A)=P(A)$. This implies that $P(A)^2=P(A)$, which is only satisfied by $P(A)=1$.

So the core of the proof is that independence of finite selections implies independence in general. What makes Billingsley's discussion of the theorem appear a bit like black magic is that he first goes through a series of steps to define independence in the infinite case before he states the theorem. But this makes things more murky than they are in Kolmogorov's own statement of the theorem, and it hides the crucial limit argument at the heart of the proof.


An example of a "tail event" which is independent of all finite evidence is the occurrence of infinitely many of the events $A_1, A_2, A_3, \ldots$. The reasons that this is independent of an event $B \in \sigma(A_1, A_2, \ldots, A_N)$ is that$$
\forall x\geq 1 \, \exists y\geq x \, : A_y
$$is logically equivalent to$$
\forall x\geq N \, \exists y\geq x \, : A_y.
$$Conditioning on this some initial segment thus does not change the probability of this event.

Note, however, that this is not generally the case for events of the type$$
\forall x\geq 1 \, \exists y\geq 1 \, : A_y.
$$It is only the "$y\geq x$" in the previous case that ensures the equivalence.

A statement like "For all $i$, $X_i$ will be even" is for instance not a tail event, since a finite segment can show a counterexample, e.g., $X_1=17$. Crucially, however, this example fails to be a tail event because the events "$X_i is even$" (the inner quantification) can be written as a disjunctions of finitely many simple events. We can thus give a counterexample to the outer quantification ($\forall i$) by exhibiting a single $i$ for which the negation of "$X_i$ is even" (which is a universal formula) is checkable in finite time.

Reversely, if this were not the case for any of the statements in the inner loop, the event would be a tail event. That is, if the universal quantification were over a list of events which had no upper limit on the potential index of the verifier, then finite data could not falsify the statement. This happens when the existence of a single potential verifier implies the existence of infinitely many (as it does in the case of "infinitely often" statements, since any larger $y$ is an equally valid candidate verifier).

Events of the form $\exists y: A_y$ are also not tail events, since they are not independent of the counterexamples $\neg A_1, \neg A_2, \neg A_3\ldots$. They are, however, independent of any finite selection of positive events (which do not entail the negation of anything on the list).

We thus have a situation in which sets at the two lowest levels of the Borel hierarchy can have probabilities of any value in $[0,1]$, but as soon as we progress beyond that in logical complexity, only the values 0 and 1 are possible.

Illustration by Yoni Rozenshein.

Oddly enough, this means that no probability theory is possible on complex formulas: When a probability measure is defined in terms of a set of simple events, then the $\Delta_2$ events always have probability 0 or 1. This property is conserved at higher orders in the hierarchy, since the quantifications that push us up the hierarchy are countable (and a countable union of null sets is a null set, by the union bound).

Note also that$$
\lim_{N\rightarrow \infty} \cup_{i=1}^{N} \cap_{j=1}^{N} A_{ij}
\;=\;  \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij},
$$and since $$
P\left( \cup_{i=1}^{\infty} \cap_{j=1}^{\infty} A_{ij} \right)  \;\in\;  \{0,1\},
$$the probabilities of the finite approximations must converge to these extremes. As we spend more computational power checking the truth of a tail event on a specific sample, we thus get estimates that approach 0 or 1 (although without any general guarantees of convergence speed).

This sheds some light on what the theorem actually means in practice, and how it relates to the zero-one theorem for finite models of first-order logic.

Thursday, September 3, 2015

Billingsley: Ergodic Theory and Information (1965)

I'm reading this book (now for the second time) for its proof of Birkhoff's ergodic theorem. I was held up by lots of technical details the first time around, but I've gotten a bit further now.

Definitions

The set-up is the following: A set $X$ is given along with a time shift $T: X\rightarrow X$. We further have a probability distribution on the set of sample paths $\Omega=X^{\infty}$. For each integrable function $f: \Omega\rightarrow \mathbb{R}$, we consider the time-average$$
 A_n f(\omega) \;=\; \frac{1}{n} \sum_{i=0}^{n-1} f(T^i \omega).
$$We are interested in investigating whether such averages converge as $n\rightarrow\infty$. If so, we would also like to know whether the limit is the same on all sample paths.

Notice that $f$ is a function of the whole sample path, not just of a single coordinate. The function $f$ could for instance measure the difference between two specific coordinates, or it could return a limiting frequency associated with the sample path.

If $f$ has the same value on the entire orbit $$\omega,\, T\omega,\, T^2\omega,\, T^3\omega,\, \ldots$$we say that the function $f$ is time-invariant. Global properties like quantifications or tail events are time-invariant.

A set $A$ is similarly time-invariant if $TA=A$. (I like to call such invariants as "trapping sets.") This is a purely set-theoretic notion, and doesn't involve the probability measure on the sample paths. However, we say that an invariant set $A$ is a non-trivial if $0 < P(A) < 1$.

Statement

One of the things that are confusing about the so-called ergodic theorem is that it actually doesn't involve ergodicity very centrally. Instead it makes the following statement:
  1. Suppose that the time shift $T$ is measure-preserving in the sense that $P(A)=P(TA)$ for all measurable sets $A$. Then the time-averages of any integrable function converges almost everywhere:$$P\left\{\omega : \lim_{n\rightarrow \infty} A_n f(\omega) \textrm{ exists} \right\} \;=\; 1.$$
  2. Suppose in addition that the time shift is ergodic in the sense that all its trapping are trivial with respect to $P$. Then the time averages are all the same on a set with $P$-probability 1:$$P \left\{(\omega_1, \omega_2) : \lim_{n\rightarrow\infty} A_n f(\omega_1) \,=\, \lim_{n\rightarrow\infty} A_n f(\omega_2) \right\} \;=\; 1.$$
Almost all the mathematical energy is spent proving the first part of this theorem. The second part is just a small addendum: Suppose a function $f$ has time averages which are not almost everywhere constant. Then the threshold set$$
\{A_n f \leq \tau\}
$$must have a probability strictly between 0 and 1 for some threshold $\tau$. (This is a general feature of random variables that are not a.e. constant.)

But since the limiting time-average is a time-invariant property (stating something about the right-hand tail of the sample path), this thresholding set is an invariant set. Thus, if the limiting time-averages are not a.e. unique, the time shift has a non-trivial trapping set.

So again, the heavy lifting lies in proving that the time-averages actually converge on all sample paths when the time shift is measure-preserving. Billingsley gives two proofs of this, one following idea by von Neumann, and one following Birkhoff.

The Von Neumann Proof

Let's say that a function is "flat" if it has converging time-averages. We can then summarize von Neumann's proof along roughly these lines:
  1. Time-invariant functions are always flat.
  2. Increment functions of the form $$s(\omega) \;=\; g(\omega) - g(T\omega),$$ where $g$ is any function, are also flat.
  3. Flatness is preserved by taking linear combinations.
  4. Any square-integrable function can be obtained as the $L^2$ limit of (linear combinations of) invariant functions and increment functions.
  5. Moreover, the $L^2$ limit of a sequence of flat $L^2$ functions is flat; in other words, flatness is preserved by limits.
  6. Hence, all $L^2$ functions are flat.
This argument seems somewhat magical at first, but I think more familiarity with the concept of orthogonality and total sets in the $L^p$ Hilbert spaces could make it seem more natural.

Just as an example of how this might work, consider the coordinate function $f(\omega) = X_0(\omega)$. In order to obtain this function as a limit of our basis functions, fix a decay rate $r < 1$ and consider the function$$
g(\omega) \;=\; X_0(\omega) + r X_1(\omega) + r^2 X_2(\omega) + r^3 X_3(\omega) + \cdots
$$Then for $r \rightarrow 1$, we get a converging series of increment functions, $g(\omega) - g(T\omega) \;\rightarrow\; X_0(\omega)$. If the coordinates are bounded (as in a Bernoulli process), this convergence is uniform across all sample paths. Billingsley uses an abstract and non-constructive orthogonality argument to show that such decompositions are always possible.

In order to show that the limit of flat functions is also flat, suppose that $f_1, f_2, f_3, \ldots$ are all flat, and that $f_k \rightarrow f^*$. We can then prove that the averages $A_1f^*, A_2f^*, A_3f^*, \ldots$ form a Cauchy sequence by using the approximation $f^* \approx f_k$ to show that $$| A_n f^* - A_m f^* | \;\approx\; | A_n f_k - A_m f_k | \;\rightarrow\; 0.$$ Of course, the exact argument needs a bit more detail, but this is the idea.

After going through this $L^2$ proof, Billingsley explains how to extend the proof to all of $L^1$. Again, this is a limit argument, showing that the flatness is preserved as we obtain the $L^1$ functions as limits of $L^2$ functions.

The Birkhoff Proof

The second proof follows Birkhoff's 1931 paper. The idea is here to derive the the ergodic theorem from the so-called maximal ergodic theorem (which Birkhoff just calls "the lemma"). This theorem states that if $T$ preserves the measure $P$, then$$
P\left\{ \exists n: A_n f > \varepsilon \right\}
\;\leq\;
\frac{1}{\varepsilon} \int_{\{f>\varepsilon\}} f \;dP.
$$I am not going to go through the proof, but here a few observations that might give a hint about how to interpret this theorem:
  1. For any arbitrary function, we have the inequality $$P(f>\varepsilon) \;\leq\; \frac{1}{\varepsilon} \int_{\{f>\varepsilon\}}f\;dP.$$This is the generalization of Markov's inequality to random variables that are not necessarily positive.
  2. If $f$ is time-invariant, then $A_n f = f$, because the average $A_n f$ then consists of $n$ copies of the number $$f(\omega) \;=\; f(T\omega) \;=\; f(T^2\omega) \;=\; \cdots \;=\; f(T^{n-1}\omega).$$ In this case, the theorem thus follows from the generalized Markov bound above.
  3. Even though we do not assume that $f(\omega)=f(T\omega)$ for a fixed sample path $\omega$, all distributional statements we can make about $f(\omega)$ also apply to $f(T\omega)$ because $T$ is assumed to be measure-preserving; for instance, the two functions have the same mean when $\omega \sim P$. This fact of course plays a crucial part in the proof.
  4. By putting $(f,\varepsilon) := (-f,-\varepsilon)$, we see that the maximal ergodic theorem is equivalent to$$P\left\{ \exists n: A_n f < \varepsilon \right\}
    \;\leq\;
    \frac{1}{\varepsilon} \int_{\{f<\varepsilon\}} f \;dP.
    $$
The maximal ergodic theorem can be used to prove the general (or "pointwise") ergodic theorem. The trick is to find both an upper and a lower bound on the non-convergence probability
$$
q \;=\; P\{ \omega :\, \liminf A_n f(\omega) < a < b < \limsup A_n f(\omega) \},
$$with the goal of showing that these bounds can only be satisfied for all $a<b$ when $q=0$.

Comments

One of the things I found confusing about the ergodic theorem is that it is often cited as a kind of uniqueness result: It states, in a certain sense, that all sample paths have the same time-averages.

However, the problem is that when the time shift has several trapping sets, there are multiple probability distributions that make the time shift ergodic. Suppose that $T$ is non-ergodic with respect to some measure $P$; and suppose further that $A$ is a minimal non-trivial trapping set in the sense that it does not itself contain any trapping sets $B$ with $0 < P(B) < P(A) < 1$. Then we can construct a measure with respect to which $T$ will be ergodic, namely, the conditional measure $P(\,\cdot\,|\,A)$.

This observation reflects the fact that if we sample $\omega \sim P$, the sample path we draw may lie in any of the minimal non-trivial trapping sets for $T$. Although it might not be clear based on a finite segment of $\omega$ (because such a segment might not reveal which trapping set we're in), such a sample path will stay inside this minimal trap set forever, and thus converge to the time-average characteristic of that set.

A small complication here is that even a minimal trapping set might in fact contain other trapping sets that are smaller: For instance, the set of coin flipping sequences whose averages converge to 0 are a trapping set, but so is the sequence of all 0s. However, this singleton set will either have positive probability (in which case the limit set is not minimal) or probability 0 (in which case it can never come up as a sample from $P$).

Lastly, suppose the sample path $\omega$ is not sampled from $P$, but from some other measure $Q$ which is not preserved by $T$. Can we say that $A_nQ\rightarrow P$?

Not always, but a sufficient condition for this two be the case is that $Q$ is absolutely continuous with respect to $P$ ($Q \ll P$). If that is the case, then the $Q$-probability of observing a non-converging average will be 0, since the $P$-probability of this event is 0, and $P$ dominates $Q$.

I am not sure whether this condition is also necessary. A degenerate measure $Q$ which places all mass on a single sample path, or even on countably many, can often be detected by some integrable function $f$ such as the indicator function of everything except the points on the countably many countable sample paths.

Monday, May 25, 2015

Newman et al.: "An Event-Related fMRI Study of Syntactic and Semantic Violations" (2001)

This paper reports on a brain imaging study in which people were given either ordinary English sentences or sentences with one of two types of error (p. 349):
  • Yesterday I sailed Todd's hotel to China. (semantic violation)
  • Yesterday I cut Max's with apple. (syntactic violation)
The sentences were presented one word at a time. They don't seem to say when recordings started, but they do mention "critical" words (p. 139).

The results were the following:

The syntactic errors lit up Brodmann's area 6, which includes the premotor cortex. It also activated the a spot in superior temporal gyrus which was either Brodmann's area 21 or 22.

The semantic errors activated a number of regions, including several quite frontal regions. The strongest activation was here inside the the fissure between the two halves of the brain, that is, in the medial section of the cortex.

Here a table:


And here is a picture (Fig. 1, p. 352):


The picture is ugly as hell, but I like that they are so infatuated with this beautiful brain: "The structural image used is from one subject, chosen for its particularly high quality … the choice of one anatomical image over another does not alter the localization of the results—the differences are purely aesthetic." (From the caption of Fig. 1.)

Regarding the syntactic observations, they comment:
Our findings suggest that the superior frontal gyri, including not only premotor cortex, but those portions of it which likely correspond to the supplementary motor area, are involved in the processing of syntactic phrase structure violations. While this is an unusual finding, … it is not unique: Ni et al. (2000, Experiment 2) found superior frontal gyrus activation in response to morphosyntactic violations. (p. 355)
What do we make of this? Well, it would seem that what you do when you read I cut with cake knife is quite literally comparable to pulling an emergency brake: You use the same inhibitory resources as those that are involved in holding yourself back from pushing the wrong button in a stop-go experiment, or training yourself to perform a complex series of movements with your fingers.

Lau, Phillips, and Peoppel: "A cortical network for semantics" (2008)

This paper contrasts two competing hypotheses about what the N400 signals:
  1. an "integration" theory which alleges that "the N400 effect … reflects the process of semantic integration of the critical word with the working context" (p. 921);
  2. a "lexical" theory which alleges that "the difference between the effects of anomalous and predictable endings arises not because of the anomaly but because predictable words in context are easier to access from memory" (p. 921).
The authors eventually decide that the lexical theory is more likely in view of the data. This conclusion is mostly based on a consideration of results from semantic priming experiments, which are supposed to isolate the task of retrieving a word from memory.

They use the following model of the anatomical distribution of linguistic processing (Fig. 2, p. 923):


Their conclusion is thus that the substrate responsible for the absence of an N400 effect is the pinkish rectangle at the top. It's the brain region which sits underneath your left ear, give or take.

In addition to this discussion, the paper contains a number of references to non-linguistic stimuli that can also produce an N400 effect:
  1. Barrett, S. E., Rugg, M. D. & Perrett, D. I. Event-related potentials and the matching of familiar and unfamiliar faces. Neuropsychologia 26, 105–117 (1988).
  2. Barrett, S. E. & Rugg, M. D. Event-related potentials and the semantic matching of faces. Neuropsychologia 27, 913–922 (1989). (Yes, that is a different title.)
  3. Barrett, S. E. & Rugg, M. D. Event-related potentials and the semantic matching of pictures. Brain Cogn. 14, 201–212 (1990).
  4. Holcomb, P. J. & McPherson, W. B. Event-related brain potentials reflect semantic priming in an object decision task. Brain Cogn. 24, 259–276 (1994).
  5. Ganis, G., Kutas, M. & Sereno, M. I. The search for “common sense”: an electrophysiological study of the comprehension of words and pictures in reading. J. Cogn. Neurosci. 8, 89–106 (1996).
  6. Van Petten, C. & Rheinfelder, H. Conceptual relationships between spoken words and environmental sounds: event-related brain potential measures. Neuropsychologia 33, 485–508 (1995)
  7. Ganis, G. & Kutas, M. An electrophysiological study of scene effects on object identification. Cogn. Brain Res. 16, 123–144 (2003).
  8. Sitnikova, T., Kuperberg, G. & Holcomb, P. J. Semantic integration in videos of real-world events: an electrophysiological investigation. Psychophysiology 40, 160–164 (2003).
  9. Sitnikova, T., Holcomb, P. J., Kiyonaga, K. A. & Kuperberg, G. R. Two neurocognitive mechanisms of semantic integration during the comprehension of visual real-world events. J. Cogn. Neurosci. 20, 2037–2057 (2008).
  10. Plante, E., Petten, C. V. & Senkfor, A. J. Electrophysiological dissociation between verbal and nonverbal semantic processing in learning disabled adults. Neuropsychologia 38, 1669–1684 (2000).
  11. Orgs, G., Lange, K., Dombrowski, J. H. & Heil, M. N400-effects to task-irrelevant environmental sounds: further evidence for obligatory conceptual processing. Neurosci. Lett. 436, 133–137 (2008).
That is a lot of reading.