Tuesday, July 22, 2014

Jaffe and Quinn: "Theoretical Mathematics" (1993)

Shaking the moralistic finger at sloppy mathematics, this paper suggests that the field should be subjected to certain "prescriptions" in order to avoid errors, priority disputes, and confusions about the reliability of proofs.

As a means to this end, Jaffe and Quinn suggest that we understand the activity of writing mathematical proofs according to the analogy
theoretical physics : experimental physics = speculative math : "rigorous" math
This leads them to adopt the odd and misleading term "theoretical mathematics" for what is essentially mathematics without detailed proofs. They suggest that some time in the future, "theoretical mathematics" might branch off from rigorous mathematics and become a discipline of its own, like theoretical physics.

This seems to misconstrue the motivations have for writing heuristic or intuitive proofs in mathematics.

Most professional mathematicians prefer to give detailed and formal proofs whenever they can, and they only deviate from that norm when they absolutely have to. For a theoretical physicist, on the other hand, there is no shame in having no experimental expertise. The analogy thus seems quite misleading, and trying to drive a wedge in between the practice of conjecturing and the practice of proving mathematical claims consequently misconstrues the ethos of mathematics.

Another thing which is never dwelled upon in the paper (but mentioned in Jeremy Gray's response to the paper) is that even "rigorous" proofs can contain mistakes.

This can be the case even for computer-verified proofs, since there is a very realistic chance of bugs, typos, or even misinterpretations of the code (e.g., subtle differences between an axiom system and the geometric domain it is intended to formalize). When Jaffe and Quinn assume that we can recognize a "rigorous" proof when we see it, and that rigorous proofs are always correct, they are thus thinking in terms of about a perfect-world mathematics, not actual-world mathematics. As a consequence, they imagine that the whole enterprise of mathematics could be "corrected" into a steadily advancing literature of absolutely certain theorems piled on top of each other.

Of course, everybody wants to give the best arguments they can, and honesty about the quality of the argument would be laudable. But in a world where quality is gradable, it would seem that any step taken toward a realization of Jaffe and Quinn's dream would kill off the whole range of mutant forms in between complete guesswork and completely "rigorous" math (however we measure rigor). Since some of these mutants would have gone on to become the standard of future mathematics, this would almost certainly be a scientific loss, in addition to being very boring.

Thursday, June 12, 2014

Kullback and Leibler: "On Information and Sufficiency" (1951)

This paper is mostly known for introducing the concept of divergence into statistics. It's quite notation-heavy but actually quite straightforward if you've already solved exercise 3.9 in Cover and Thomas' textbook. Everything is about log-likelihood ratios that fly off into infinity.

The paper is concerned quite a lot with the question of sufficient statistics. An interesting illustration in the last paragraph (§6) concerns the problem of how to choose the best non-sufficient statistic when we must (for some extrinsic reason).

More specifically, the set up is the following: You are observing samples from a categorical distribution with four different categories, and you want to distinguish the two following hypotheses:
x 1 2 3 4
Pr(x | A)   .25   .25   .25   .25 
Pr(x | B)  .50   .30   .10   .10 

However, you are not actually allowed to see the original data set; instead, you have to lump the categories into two groups, and you will then be told which group the observation fell. (This corresponds to using an indicator function as your statistic.)

In order to solve this exercise, they quantify the informativeness of the statistic in terms of the symmetrized divergence between the two hypotheses:
J(A, B)  ==  D(A || B) + D(B || A).
I guess a more cautious measure would have been the minimum of the two, since this would provide a lower bound on the growth rate of the likelihood ratio whichever of the two hypotheses were true.

Using base 10 logarithms, the divergence from the uniform to the skewed distribution (A to B) is .1039, and the divergence from the skewed to the uniform (B to A) is .0947. The symmetrized divergence J(A, B) is thus about .1986, and the minimum is .0947.

Grouping the categories in various ways, we can also find the following divergences:

Grouping  D(A || B)  D(A || B)   J(A, B     min    
{}, {1, 2, 3, 4}   0   0   0 
{1}, {2, 3, 4} .5068 .0635 .1193.0635
{2}, {1, 3, 4} .0027 .0028 .0055 .0027
{3}, {2, 3, 4} .0401 .0315 .0716 .0315
{4}, {2, 3, 4} .0401 .0315 .0716 .0315
{1, 2}, {3, 4} .0969 .0837 .1806 .0837
{1, 3}, {2, 4} .0089 .0087 .0176 .0087
{1, 4}, {2, 3} .0089 .0087 .0176 .0087

In the paper, Kullback and Leibler report that the best grouping is {{1}, {2, 3, 4}}, which indicates that they might have used D(A || B) as their objective function. This is a bit strange, since everything they say in the text indicates that they are thinking exclusively in terms of the symemtrized divergence J. I don't know what to conclude from that.

Monday, June 2, 2014

Fisher: Statistical Methods and Scientific Inference (1956), chapter 5.7

In chapter V of his book on statistical inference, Fisher compares various likelihood-based approaches to coming up with a predictive distribution. Some of the details are quite obscure, and I have problems following his argument in several places.

Section 2 is dedicated to to the Bayesian solution for a series of coin flips. It contains, among other things, a strange reference to coefficients other than the binomials, suggesting that these can be replaced freely with any other suitable polynomial (p. 112). I am not quite sure what he means or how he imagines this should be justified.

Section 7 is dedicated to a particular type of frequentist prediction. The set-up is that we have observed the counts a and b (heads and tails) and that we would like to find the likelihood of a continuation having counts c and d.

In order to find this, he suggests that we compute the likelihood ratio
Pr(a, b | f) Pr(c, d | f) / Pr(a + c, b + d | f).
The logic behind this computation is that a/b is close to c/d, then the joint probability of those two ratios is approximately the same as the probability of the pooled ratio (a + c)/(b + d). If, on the other hand, they both deviate highly from the maximum likelihood estimate (in different directions), then the joint probability will be lower than the pooled reference value. However, the actual value of the parameter of the coin flip cancels out from the fraction and thus plays no direct role in the computation.

Fisher goes through an example in which a = 3, b = 16, c = 14, d = 7. In order to compute the many factorials for this example, he uses the (very rough) approximation
x! ≈ xx.
Or at least that's what I think he does — he comments that this xx is "among others having the same margins," whatever that is supposed to mean (p. 129).

N! (green) and the approximation NN (red).

At any rate, we can redo his computation using both his own approximate method and the exact formula, getting a likelihood of about .004 (Fisher's result) or .001 (the exact result).

As he suggests on page 130, we can also compile a table of likelihoods for various other values of c and d adding to 21. We can collect all of these results in a table like the following:

Count
Fisher
Exact
Bayes
0
.094
.098
.048
1
.0499
.223
.109
2
.0836
.309
.151
3
.991
.336
.164
4
.964
.311
.152
5
.817
.256
.125
6
.621
.192
.094
7
.432
.133
.065
8
.277
.085
.042
9
.164
.051
.025
10
.090
.028
.014
11
.046
.015
.007
12
.022
.007
.003
13
.009
.003
.002
14
.004
.001
.001
15
.001
.000
.000
16
.000
.000
.000
Sum
5.867
2.050
1.000

As the table shows, the exact likelihoods coincide with the Bayes estimates if we normalize them. I think this is only the case because the numbers are large enough because of an issue with the normalizing constants in a Dirichlet distribution, but I don't have time to check the details now.

Friday, May 30, 2014

Fisher: "On the Mathematical Foundations of Theoretical Statistics" (1921)

I haven't had time to study this paper in detail yet, but based on a quick skim, it seems that Fisher
I'll read the whole thing later. But for now, a few quotes.

First, a centerpiece in Bayes' original paper was the postulate that the uncertainty about the bias of a coin should be represented by means of a uniform distribution. Fisher comments:
The postulate would, if true, be of great importance in bringing an immense variety of questions within the domain of probability. It is, however, evidently extremely arbitrary. Apart from evolving a vitally important piece of knowledge, that of the exact form of the distribution of values of p, out of an assumption of complete ignorance, it is not even a unique solution. (p. 325)
Second Bayesian topic is ratio tests: That is, assigning probabilities to two exclusive and exhaustive hypotheses X and Y based on the ratio between how well they explain the data set A, that is,
Fisher in 1931; image from the National Portrait Gallery.
Pr(A | X) / Pr(A | Y).
Fisher comments:
This amounts to assuming that before A was observed, it was known that our universe had been selected at random for [= from] an infinite population in which X was true in one half, and Y true in the other half. Clearly such an assumption is entirely arbitrary, nor has any method been put forward by which such assumptions can be made even with consistent uniqueness. (p. 326)
The introduction of the likelihood concept:
There would be no need to emphasise the baseless character of the assumptions made under the titles of inverse probability and BAYES' Theorem in view of the decisive criticism to which they have been exposed at the hands of BOOLE, VENN, and CHRYSTAL, were it not for the fact that the older writers, such as LAPLACE and POISSON, who accepted these assumptions, also laid the foundations of the modern theory of statistics, and have introduced into their discussions of this subject ideas of a similar character. I must indeed plead guilty in my original statement of the Method of the Maximum Likelihood (9) to having based my argument upon the principle of inverse probability; in the same paper, it is true, I emphasised the fact that such inverse probabilities were relative only. That is to say, that while we might speak of one value of p as having an inverse probability three times that of another value of p, we might on no account introduce the differential element dp, so as to be able to say that it was three times as probable that p should lie in one rather than the other of two equal elements. Upon consideration, therefore, I perceive that the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever. We must return to the actual fact that one value of p, of the frequency of which we know nothing, would yield the observed result three times as frequently as would another value of p. If we need a word to characterise this relative property of different values of p, I suggest that we may speak without confusion of the likelihood of one value of p being thrice the likelihood of another, bearing always in mind that likelihood is not here used loosely as a synonym of probability, but simply to express the relative frequencies with which such values of the hypothetical quantity p would in fact yield the observed sample. (p. 326)
In the conclusion, he says that likelihood and probability are "two radically distinct concepts, both of importance in influencing our judgment," but "confused under the single name of probability" (p. 367) Note that these concepts are "influencing our judgment" — that is, they are not just computational methods for making a decision, but rather a kind of model of a rational mind.

Wednesday, May 28, 2014

Chomsky: "Three Models for the Description of Language" (1956)

This is my favorite Chomsky text, perhaps after Syntactic Structures. It contains a comparison of finite-state, phrase-structure, and context-sensitive languages; it also suggests that a transformational theory is the most illuminating generative story for English sentences.

A structural ambiguity; p. 118.

Garden Varieties

Among other things, the paper contains the following examples of formal languages (p. 115):
  • "Mirror-image" sentences: aa, bb, abba, baab, aabbaa, …
  • Echo sentences: aa, bb, abab, baba, aabaab, …
  • Counting sentences: ab, aabb, aaabbb, …
The counting language is also used to show that the set of phrase-structure languages is a proper subset of the set of context-sensitive languages (p. 119).

A Markov model hidden states (and thus arbitrary dependence lengths); p. 116.

Irrelevant to Grammar

The paper also contains the familiar jump from a rejection of Markov models to a rejection of statistical models at large:
Whatever the other interest of statistical approximation in this sense may be, it is clear that it can shed no light on the problems of grammar. There is no general relation between the frequency of a string (or its component parts) and its grammaticalness. We can see this most clearly by considering such strings as
(14) colorless green ideas sleep furiously
which is a grammatical sentence, even though it is fair to assume that no pair of its words may ever have occurred together in the past. (p. 116)
Thus,
there is no significant correlation between order of approximation and grammaticalness. If we order the strings of a given length in terms of order of approximation to English, we shall find both grammatical and ungrammatical strings scattered throughout the list, from top to bottom. Hence the notion of statistical approximation appears to be irrelevant to grammar. (p. 116)
I suppose "order of approximation" here means "probability" rather literally the "order of the Markov model" (otherwise the this assertion doesn't make much sense).

Drum: "Change, Meaning, and Information" (1957)

This 1957 article is one of those countless papers by people in linguistics trying helplessly to say something non-trivial about the relationship between information theory and linguistics. The central claim of the paper is that there are several kinds of predictability or redundancy at play in language, and that "meaning" should be seen as one of them.

More precisely, certain sentences are predictable on account of their content, so that
the "meaning"—the "what we know about it"–in a message has some direct effect upon the amount of information transmitted. (p. 166)
However, this notion is dangerously confused and conflated with the idea that more predictability equals more meaning at several points in the paper.

At any rate, Dale places semantics among several "levels" of redundancy, including a historically interesting division between "formal" and "contextual" syntax:
In the sentence, "The chair is __________," many words could be correctly used in terms of grammar; but in terms of the context of the sentence, only a very few apply. "The chair is clock" is correct grammatically, though it is nonsense in even the broadest contexts. (p. 167)
Further,
A person knowing English grammar but not word meanings, might very well write "The chair is clock," but would never do so if he knew the meanings involved. (The relationship of syntactic to contextual redundancy is much that of validity in logic to truth, incidentally). (p. 168)
A strange counterfactual, but quite revealing.

Herdan: The Advanced Theory of Language as Choice and Chance (1966)

This really odd book is based to some extent on the earlier Language as Choice and Chance (1956), but contains, according to the author, a lot of new material. It discusses a variety of topics and language statistics, often in a rather unsystematic and bizarrely directionless way.

The part about "linguistics duality" (Part IV) is particularly confusion and strange, and I'm not sure quite what to make of it. Herdan seems to want to make some great cosmic connection between quantum physics, natural language semantics, and propositional logic.

But leaving that aside, I really wanted to quote it because he so clearly expresses the deterministic philosophy of language — that speech isn't a random phenomenon, but rather has deep roots in free will.

"Human Willfulness"

He thus explains that language has one region which is outside the control of the speaker, but that once we are familiar with this set of restrictions, we can express ourselves within its bounds:
It leaves the individual free to exercise his choice in the remaining features of language, and insofar language is free. The determination of the extent to which the speaker is bound by the linguistic code he uses, and conversely, the extent to which he is free, and can be original, this is the essence of what I call quantitative linguistics. (pp. 5–6)
Similarly, he comments on a study comparing the letter distribution in two texts as follows:
There can be no doubt about the possibility of two distributions of this kind, in any language, being significantly different, if only for the simple reason that the laws of language are always subject, to some extent at least, to human willfulness or choice. By deliberately using words in one text, which happen to be of rather singular morphological structure, it may well be possible to achieve a significant difference of letter frequencies in the two texts. (p. 59)
And finally, in the chapter on the statistical analysis of style, he goes on record completely:
The deterministic view of language regards language as the deliberate choice of such linguistic units as are required for expressing the idea one has in mind. This may be said to be a definition in accordance with current views. Introspective analysis of linguistic expression would seem to show it is a deterministic process, no part of which is left to chance. A possible exception seems to be that we often use a word or expression because it 'happened' to come into our mind. But the fact that memory may have features of accidental happenings does not mean that our use of linguistic forms is of the same character. Supposing a word just happened to come into our mind while we are in the process of writing, we are still free to use it or not, and we shall do one or the other according to what is needed for expressing what we have in mind. It would seem that the cause and effect principle of physical nature has its parallel in the 'reason and consequence' or 'motive and action' principle of psychological nature, part of which is the linguistic process of giving expression to thought. Our motive for using a particular expression is that it is suited better than any other we could think of for expressing our thought. (p. 70)
He then goes on to say that "style" is a matter of self-imposed constraints, so that we are in fact a bit less free to choose our words that it appears, although we ourselves come up with the restrictions.

"Grammar Load"

In Chapter 3.3, he expands these ideas about determinism and free will in language by suggesting that we can quantify the weight of the grammatical restrictions of a language in terms of a "grammar load" statistic. He suggest that this grammar load can be assessed by counting the number of word forms per token in a sample of text from the language (p. 49). He does discuss entropy later in the book (Part III(C)), but doesn't make the connection to redundancy here.

Inspects an English corpus of 78,633 tokens, he thus finds 53,102 different forms and concludes that English has a "grammar load" of
53,102 / 78,633 = 67.53%.
Implicit in this computation is the idea that the number of new word forms grows linearly with the number of tokens you inspect. This excludes sublinear spawn rates such as
In fact, vocabulary size does seem to grow sublinearly as a function of corpus size. The spawn rate for new word forms in thus not constant in English text.

Word forms in the first N tokens in the brown corpus for N = 0 … 100,000.

However, neither of the growth functions listed above fit the data very well (as far as I can see).