Saturday, July 26, 2014

Wald: Sequential Analysis (1947)

I've always thought that Shannon's insights in the 1940s papers seemed like pure magic, but this book suggests that at parts of his way of thinking were already in the air at the time: From his own frequentist perspective, Wald comes oddly close to defining a version of information theory.

The central question of the book is:
How can we devise decision procedures that map observations into {Accept, Reject, Continue} in such a way that (1) the probability of wrongly choosing Accept or Reject is low, and (2) the expected number of Continue decisions is low?
Throughout the book, the answer that he proposes is to use likelihood ratio tests. This puts him strangely close to the Bayesian tradition, including for instance Chapter 5 of Jeffreys' Theory of Probability.

A sequential binomial test ending in rejection (p. 94)

In particular, the coin flipping example that Jeffreys considers in his Chapter 5.1 is very close in spirit to the sequential binomial test that Wald considers in Chapter 5 of Sequential Analysis. However, some differences are:
  • Wald compares two given parameter values p0 and p1, while Jeffreys compares a model with a free parameter to one with a fixed value for that parameter.
  • Jeffreys assigns prior probabilities to everything; but Wald only uses the likelihoods given the two parameters. From his perspective, the statistical test will thus have different characteristics depending on what the underlying situation is, and he performs no averaging over these values.
This last point also means that Wald is barred from actually inventing the notion of mutual information, although he comes very close. Since he cannot take a single average over the log-likelihood ratio, he cannot compute any single statistic, but always has to bet on two horses simultanously.

Thursday, July 24, 2014

Wolpert: "The Lack of A Priori Distinctions Between Learning Algorithms" (1996)

I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function f: X → Y when evaluated on input values that did not occur in our training data. We do so by selecting an estimate hX → Y, hoping that f(x) will agree with f(x) on the points x we haven't yet seen.

If there no restrictions on how f can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever h we choose, Wolpert can always choose f so as to disagree with every single choice that our h makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length m, and your goal is to predict the next n binary digits in the sequence. If all of the 2n + m possible sequences are in your hypothesis space, then the mutual information between the test and training set is
I(train; test)  =  H(test)  –  H(test | train)  =  2n  –  2n + m/2m  =  0.
In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:
  • A loss measure L is "homogenous" if our choice of the estimate h is independent of the sum of the loss probabilities given the correct answers, that is,
    Σy Pr{L = c | y}.
  • For finite |X| and |Y|, we can define a learning problem as uniform if all of the |Y||X| possible functions from X to Y are equally probable.
  • When we use a homogenous loss function in a uniform learning problem, then
    Σy Pr{L = c | y}  =  1/|Y| Σy Pr{L = c | y} Pr{y}  =  1/|Y| Pr{L = c}
    is independent of h, and therefore E[L | h] = E[L].
Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for f(x), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function L(f(x), h(x)) = |f(x) – h(x)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that L = 4, for instance, depends heavily on your estimator h, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He therefore states that for a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he then goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:
Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.) 
However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)
What he fails to realize is that he has explicitly made sure that the underlying process producing the data has ceiling capacity — so no learning rule with a low VC dimension could therefore learn to predict the test data, on average, above chance level.

This is not because low empirical error doesn't carry over from training to test data, but rather nobody can achieve a low expected empirical error in his learning framework. What he has verified is thus a specific example of the VC theorem: If the capacity of your theory is low, and it has a high error on the training data, then anything might happen.

Tuesday, July 22, 2014

Jaffe and Quinn: "Theoretical Mathematics" (1993)

Shaking the moralistic finger at sloppy mathematics, this paper suggests that the field should be subjected to certain "prescriptions" in order to avoid errors, priority disputes, and confusions about the reliability of proofs.

As a means to this end, Jaffe and Quinn suggest that we understand the activity of writing mathematical proofs according to the analogy
theoretical physics : experimental physics = speculative math : "rigorous" math
This leads them to adopt the odd and misleading term "theoretical mathematics" for what is essentially mathematics without detailed proofs. They suggest that some time in the future, "theoretical mathematics" might branch off from rigorous mathematics and become a discipline of its own, like theoretical physics.

This seems to misconstrue the motivations have for writing heuristic or intuitive proofs in mathematics.

Most professional mathematicians prefer to give detailed and formal proofs whenever they can, and they only deviate from that norm when they absolutely have to. For a theoretical physicist, on the other hand, there is no shame in having no experimental expertise. The analogy thus seems quite misleading, and trying to drive a wedge in between the practice of conjecturing and the practice of proving mathematical claims consequently misconstrues the ethos of mathematics.

Another thing which is never dwelled upon in the paper (but mentioned in Jeremy Gray's response to the paper) is that even "rigorous" proofs can contain mistakes.

This can be the case even for computer-verified proofs, since there is a very realistic chance of bugs, typos, or even misinterpretations of the code (e.g., subtle differences between an axiom system and the geometric domain it is intended to formalize). When Jaffe and Quinn assume that we can recognize a "rigorous" proof when we see it, and that rigorous proofs are always correct, they are thus thinking in terms of about a perfect-world mathematics, not actual-world mathematics. As a consequence, they imagine that the whole enterprise of mathematics could be "corrected" into a steadily advancing literature of absolutely certain theorems piled on top of each other.

Of course, everybody wants to give the best arguments they can, and honesty about the quality of the argument would be laudable. But in a world where quality is gradable, it would seem that any step taken toward a realization of Jaffe and Quinn's dream would kill off the whole range of mutant forms in between complete guesswork and completely "rigorous" math (however we measure rigor). Since some of these mutants would have gone on to become the standard of future mathematics, this would almost certainly be a scientific loss, in addition to being very boring.

Thursday, June 12, 2014

Kullback and Leibler: "On Information and Sufficiency" (1951)

This paper is mostly known for introducing the concept of divergence into statistics. It's quite notation-heavy but actually quite straightforward if you've already solved exercise 3.9 in Cover and Thomas' textbook. Everything is about log-likelihood ratios that fly off into infinity.

The paper is concerned quite a lot with the question of sufficient statistics. An interesting illustration in the last paragraph (§6) concerns the problem of how to choose the best non-sufficient statistic when we must (for some extrinsic reason).

More specifically, the set up is the following: You are observing samples from a categorical distribution with four different categories, and you want to distinguish the two following hypotheses:
x 1 2 3 4
Pr(x | A)   .25   .25   .25   .25 
Pr(x | B)  .50   .30   .10   .10 

However, you are not actually allowed to see the original data set; instead, you have to lump the categories into two groups, and you will then be told which group the observation fell. (This corresponds to using an indicator function as your statistic.)

In order to solve this exercise, they quantify the informativeness of the statistic in terms of the symmetrized divergence between the two hypotheses:
J(A, B)  ==  D(A || B) + D(B || A).
I guess a more cautious measure would have been the minimum of the two, since this would provide a lower bound on the growth rate of the likelihood ratio whichever of the two hypotheses were true.

Using base 10 logarithms, the divergence from the uniform to the skewed distribution (A to B) is .1039, and the divergence from the skewed to the uniform (B to A) is .0947. The symmetrized divergence J(A, B) is thus about .1986, and the minimum is .0947.

Grouping the categories in various ways, we can also find the following divergences:

Grouping  D(A || B)  D(A || B)   J(A, B     min    
{}, {1, 2, 3, 4}   0   0   0 
{1}, {2, 3, 4} .5068 .0635 .1193.0635
{2}, {1, 3, 4} .0027 .0028 .0055 .0027
{3}, {2, 3, 4} .0401 .0315 .0716 .0315
{4}, {2, 3, 4} .0401 .0315 .0716 .0315
{1, 2}, {3, 4} .0969 .0837 .1806 .0837
{1, 3}, {2, 4} .0089 .0087 .0176 .0087
{1, 4}, {2, 3} .0089 .0087 .0176 .0087

In the paper, Kullback and Leibler report that the best grouping is {{1}, {2, 3, 4}}, which indicates that they might have used D(A || B) as their objective function. This is a bit strange, since everything they say in the text indicates that they are thinking exclusively in terms of the symemtrized divergence J. I don't know what to conclude from that.

Monday, June 2, 2014

Fisher: Statistical Methods and Scientific Inference (1956), chapter 5.7

In chapter V of his book on statistical inference, Fisher compares various likelihood-based approaches to coming up with a predictive distribution. Some of the details are quite obscure, and I have problems following his argument in several places.

Section 2 is dedicated to to the Bayesian solution for a series of coin flips. It contains, among other things, a strange reference to coefficients other than the binomials, suggesting that these can be replaced freely with any other suitable polynomial (p. 112). I am not quite sure what he means or how he imagines this should be justified.

Section 7 is dedicated to a particular type of frequentist prediction. The set-up is that we have observed the counts a and b (heads and tails) and that we would like to find the likelihood of a continuation having counts c and d.

In order to find this, he suggests that we compute the likelihood ratio
Pr(a, b | f) Pr(c, d | f) / Pr(a + c, b + d | f).
The logic behind this computation is that a/b is close to c/d, then the joint probability of those two ratios is approximately the same as the probability of the pooled ratio (a + c)/(b + d). If, on the other hand, they both deviate highly from the maximum likelihood estimate (in different directions), then the joint probability will be lower than the pooled reference value. However, the actual value of the parameter of the coin flip cancels out from the fraction and thus plays no direct role in the computation.

Fisher goes through an example in which a = 3, b = 16, c = 14, d = 7. In order to compute the many factorials for this example, he uses the (very rough) approximation
x! ≈ xx.
Or at least that's what I think he does — he comments that this xx is "among others having the same margins," whatever that is supposed to mean (p. 129).

N! (green) and the approximation NN (red).

At any rate, we can redo his computation using both his own approximate method and the exact formula, getting a likelihood of about .004 (Fisher's result) or .001 (the exact result).

As he suggests on page 130, we can also compile a table of likelihoods for various other values of c and d adding to 21. We can collect all of these results in a table like the following:


As the table shows, the exact likelihoods coincide with the Bayes estimates if we normalize them. I think this is only the case because the numbers are large enough because of an issue with the normalizing constants in a Dirichlet distribution, but I don't have time to check the details now.

Friday, May 30, 2014

Fisher: "On the Mathematical Foundations of Theoretical Statistics" (1921)

I haven't had time to study this paper in detail yet, but based on a quick skim, it seems that Fisher
I'll read the whole thing later. But for now, a few quotes.

First, a centerpiece in Bayes' original paper was the postulate that the uncertainty about the bias of a coin should be represented by means of a uniform distribution. Fisher comments:
The postulate would, if true, be of great importance in bringing an immense variety of questions within the domain of probability. It is, however, evidently extremely arbitrary. Apart from evolving a vitally important piece of knowledge, that of the exact form of the distribution of values of p, out of an assumption of complete ignorance, it is not even a unique solution. (p. 325)
Second Bayesian topic is ratio tests: That is, assigning probabilities to two exclusive and exhaustive hypotheses X and Y based on the ratio between how well they explain the data set A, that is,
Fisher in 1931; image from the National Portrait Gallery.
Pr(A | X) / Pr(A | Y).
Fisher comments:
This amounts to assuming that before A was observed, it was known that our universe had been selected at random for [= from] an infinite population in which X was true in one half, and Y true in the other half. Clearly such an assumption is entirely arbitrary, nor has any method been put forward by which such assumptions can be made even with consistent uniqueness. (p. 326)
The introduction of the likelihood concept:
There would be no need to emphasise the baseless character of the assumptions made under the titles of inverse probability and BAYES' Theorem in view of the decisive criticism to which they have been exposed at the hands of BOOLE, VENN, and CHRYSTAL, were it not for the fact that the older writers, such as LAPLACE and POISSON, who accepted these assumptions, also laid the foundations of the modern theory of statistics, and have introduced into their discussions of this subject ideas of a similar character. I must indeed plead guilty in my original statement of the Method of the Maximum Likelihood (9) to having based my argument upon the principle of inverse probability; in the same paper, it is true, I emphasised the fact that such inverse probabilities were relative only. That is to say, that while we might speak of one value of p as having an inverse probability three times that of another value of p, we might on no account introduce the differential element dp, so as to be able to say that it was three times as probable that p should lie in one rather than the other of two equal elements. Upon consideration, therefore, I perceive that the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever. We must return to the actual fact that one value of p, of the frequency of which we know nothing, would yield the observed result three times as frequently as would another value of p. If we need a word to characterise this relative property of different values of p, I suggest that we may speak without confusion of the likelihood of one value of p being thrice the likelihood of another, bearing always in mind that likelihood is not here used loosely as a synonym of probability, but simply to express the relative frequencies with which such values of the hypothetical quantity p would in fact yield the observed sample. (p. 326)
In the conclusion, he says that likelihood and probability are "two radically distinct concepts, both of importance in influencing our judgment," but "confused under the single name of probability" (p. 367) Note that these concepts are "influencing our judgment" — that is, they are not just computational methods for making a decision, but rather a kind of model of a rational mind.

Wednesday, May 28, 2014

Chomsky: "Three Models for the Description of Language" (1956)

This is my favorite Chomsky text, perhaps after Syntactic Structures. It contains a comparison of finite-state, phrase-structure, and context-sensitive languages; it also suggests that a transformational theory is the most illuminating generative story for English sentences.

A structural ambiguity; p. 118.

Garden Varieties

Among other things, the paper contains the following examples of formal languages (p. 115):
  • "Mirror-image" sentences: aa, bb, abba, baab, aabbaa, …
  • Echo sentences: aa, bb, abab, baba, aabaab, …
  • Counting sentences: ab, aabb, aaabbb, …
The counting language is also used to show that the set of phrase-structure languages is a proper subset of the set of context-sensitive languages (p. 119).

A Markov model hidden states (and thus arbitrary dependence lengths); p. 116.

Irrelevant to Grammar

The paper also contains the familiar jump from a rejection of Markov models to a rejection of statistical models at large:
Whatever the other interest of statistical approximation in this sense may be, it is clear that it can shed no light on the problems of grammar. There is no general relation between the frequency of a string (or its component parts) and its grammaticalness. We can see this most clearly by considering such strings as
(14) colorless green ideas sleep furiously
which is a grammatical sentence, even though it is fair to assume that no pair of its words may ever have occurred together in the past. (p. 116)
there is no significant correlation between order of approximation and grammaticalness. If we order the strings of a given length in terms of order of approximation to English, we shall find both grammatical and ungrammatical strings scattered throughout the list, from top to bottom. Hence the notion of statistical approximation appears to be irrelevant to grammar. (p. 116)
I suppose "order of approximation" here means "probability" rather literally the "order of the Markov model" (otherwise the this assertion doesn't make much sense).