Tuesday, April 15, 2014

Fisher: "Statistical Methods and Scientific Induction" (1955)

Ronald Fisher; image from Wikimedia Commons.
In this brief paper, Sir Ronald Fisher militates against what he sees as wrong and absurd interpretations of the notion of a statistical test.

The Ideology of Statistics

The core of his argument is that a test only gives positive information when yields a significant difference and thus warrants the rejection of a hypothesis — an absence of a significant difference does not mean "accept." He contends that
… this difference in point of view originated when Neyman, thinking that he was correcting and improving my own early work on tests of significance, … in fact reinterpreted them in terms of that technological and commercial apparatus which is known as an acceptance procedure. (p. 69)
And although acceptance procedures might be good enough for commerce, they have no place in science:
I am casting no contempt on acceptance procedures, and I am thankful, whenever I travel by air, that the high level of precision and reliability required can really be achieved by such means. But the logical differences between such an operation and the work of scientific discovery by physical or biological experimentation seem to me so wide that the analogy between them is not helpful, and the identification of the two sorts of operation is decidedly misleading. (pp. 69–70)
Then comes the juicy part:
I shall hope to bring out some of the logical differences more distinctly, but there is also, I fancy, in the background an ideological difference. Russians are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation. How far, within such a system, personal and individual inferences from observed facts are permissible we do not know, but it may be safer, and even, in such a political atmosphere, more agreeable, to regard one's scientific work simply as a contributary element in a great machine, and to conceal rather than to advertise the selfish and perhaps heretical aim of understanding for oneself the scientific situation. In the U.S. also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather at, let us say, speeding production, or saving money. There is therefore something to be gained by at least being being able to think of our scientific problems in a language distinct from that of technological efficiency. (p. 70)
So there you have it: In the technological regime of either of the two Cold War superpowers, "learning," "inference," and private, inner thought are taboo, according to Fisher. Presumably we are to contrast this with the aims of British science going back to Newton.

The Three Issues

Fisher singles out three phrases that he finds particularly offensive in scientific statistics:
  1. "Repeated sampling from the same distribution"
  2. Errors of the "second kind"
  3. "Inductive behaviour"
I'll discuss these one by one.

1. "Repeated sampling from the same distribution"

The issue with the first one is not completely clear to me, but here is what I make of his discussion (pp. 71–72): Suppose you are performing a test to see whether the mean of some population has a specific value; suppose further that the standard deviation of that population is unknown, but that you have estimated it based on the available sample.

The problem then is, if I understand Fisher correctly, that the test depends on the standard deviation being constant and known, but in reality, it is an unknown quantity that you have estimated by a maximum likelihood method. This is, strictly speaking, illegitimate, since any estimate should be based on a numerous and representative sample; but since the standard deviation is a property of samples of size N, you should really have M samples of a sample of size N in order to have some data to estimate from. But clearly, this sets a far too high standard for the amount of data required.

It's a convoluted argument, but I think it makes sense from a rigorously frequentist standpoint: If parameters are consistently interpreted as frequencies, then the only legitimate statistical procedure for learning about an unknown quantity t is to obtain a large number of samples dependent on t and then wait for the law of large numbers to kick in.

Strictly speaking, this means that the amount of data points you need in order to estimate all the parameters in a model will grow exponentially in the number of parameters. That sounds sort of crazy, but if you do not allow yourself to have any model in the absence of data, you really have to wait for the data to overwhelm your initial ignorance before you can say that you have a model of the situation. That takes time.

2. Errors of the "Second Kind"

Errors of the first kind are false negatives: Cases in which, for instance, a population in fact has mean m, but nevertheless exhibits a sample average so far away from m that the hypothesis is rejected. Such errors have a frequentist interpretation, because the likelihoods given m are well-defined even in the absence of a prior distribution over m.

Errors of the second kind are false positives: Some other mean m' different from m produces a sample average so close to m that the false hypothesis of a mean of m is confirmed. This kind of error has no frequentist interpretation, because it requires the alternative hypotheses m' to have prior probabilities, and because it requires that there be a loss function associated with accepting the hypothesis of m when the true mean m' is close to m.


Jerzy Neyman in the classroom, 1973; image from Wikimedia Commons.

Fisher is not willing to assume any of those two instruments. He writes:
It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly, from errors in which it is "accepted wrongly" as the phrase does. (p. 73)
Such language is not just scientifically irresponsible, he thinks — it also misunderstands the private states of mind present in the head of a scientist:
The fashion of speaking of a null hypothesis as "accepted when false", whenever a test of significance gives us no strong reason for rejecting it, and when in fact it is in some way imperfect, shows real ignorance of the research worker's attitude, by suggesting that in such a case he has come to an irreversible decision. (p. 73; Fisher's emphasis)
Of course, neither positive nor negative decisions are immune to revision as more data comes in (cf. p. 76), so Fisher prefers to depict the scientist's attitude as one of cautious learning in the face of data. This contrasts with the forced-choice nature of acceptance procedures:
In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence for it was strong or weak. It is the result of applying mechanically rules laid down in advance; no thought is given to the particular case, and the tester's state of mind, or his capacity for learning, is inoperative.
By contrast, conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation. (pp. 73–74; Fisher's emphasis).
Note again the insistence on private states of mind as the hallmark of scientific rationality.

3. "Inductive Behaviour"

The last issue Fisher has with Neyman's brand of statistics is shelves under the heading above, but it is really about an issue of linguistics: Neyman contends (according to Fisher's summary — there is no direct reference) that statements like
There is 5% probability that the sample average deviates strongly from the mean
have a meaningful and well-defined interpretation (in terms of likelihood). On the other hand,
There is 5% probability that the mean deviates strongly from the sample average
is meaningless, because the mean is not a random variable.

Fisher disagrees, not because he is a fan of prior probability distributions on the parameters, but because he thinks that such statements could only ever refer to likelihoods. To make this point vivid, he considers (I am changing the example a bit here) a statement of the form
Pr(m < x) = 5%,
where m is a parameter and x is an observation, and he contrasts this with
Pr(m < 17) = 5%.
If one of these statements has a meaning, he says, clearly the other one must have a meaning too, unless we want to "deny the syllogistic process of making a substitution" (p. 75). But Neyman contends that the probability of a statement of the second kind should be "necessarily either 0 or 1" (p. 75), so that only the former probability (the likelihood given the mean) is well-defined.

Fisher comments:
The paradox is rather childish, for it requires that we should wilfully misinterpret the probability statement so as to pretend that the population to which it refers is not defined by our observations and their precision, but is absolutely independent of them. (p. 75)
By this he means that the reference class (the "population") is defined arbitrarily by our experimental set-up. And as he says about populations earlier in the paper, "no one of them has objective reality, all being products of the statistician's imagination" (p. 71).

An Englishman's Duty

In the conclusion, Fisher comes back to the ethical standards of statistics:
As an act of construction the hypothesis is not altogether impersonal, for the scientist's personal capacity for theorizing comes into it; moreover, the criteria by which it is approved require a certain honesty, or integrity, in their application. (p. 75)
Again, he explains that decision-theoretic methods (such as Bayesian statistics) have no business in scientific inference, since the goal is not optimal decisions, but the attainment of truth:
Finally, in inductive inference we introduce no cost functions for faulty judgments … In fact, scientific research is not geared to maximize the profits of any particular organization, but is rather an attempt to improve public knowledge undertaken as an act of faith to the effect that, as more becomes known, or more surely known, the intelligent pursuit of a great variety of aims, by a great variety of men, and groups of men, will be facilitated. We make no attempt to evaluate these consequences, and do not assume that they are capable of evaluation in any sort of currency.
… We aim, in fact, at methods of inference which should be equally convincing to all rational minds, irrespective of any intentions they may have in utilizing the knowledge inferred.
We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right or other free minds to utilize them in making their own decisions. (p. 77)
We could hardly have it more explicit: The difference in statistical paradigm is one of ethics.

Wednesday, April 2, 2014

Ramsey: "Truth and Probability" (1926)

Ramsey; from Wikipedia.
F. P Ramsey is a curious figure. Reading Keynes, Russell, and Whitehead at the age of 19, he was obviously a prodigy and seemed to have attracted attention from the Cambridge community of the 1920s, including Wittgenstein, Moore, and others. He died in 1930 from complications following a stomach operation, aged 26.

This posthumous book is a collection of some of his writings, both published and unpublished. I'm reading it mostly for the unpublished essay "Truth and Probability" from 1926, and Ramsey's own 1928 comments on that paper.

Degrees of Gamblability

In that essay, Ramsey embraces an epistemic interpretation of probability although he realizes that there might not be a single person with coherent beliefs in the sense of probability theory.

In his own commentary on the paper, the choice of this philosophy is compressed to a quick dismissal of any "crude frequency theory" and the conclusion:
(2) Hence chances must be defined by degrees of belief; but they do not correspond to anyone's actual degrees of belief; the chances of 1,000 heads, and of 999 heads followed by a tail, are equal, but everyone expects the former more than the latter.
(3) Chances are degrees of belief within a certain system of beliefs and degrees of belief; not those of any actual person, but in a simplified system to which those of actual people, especially the speaker, in part approximate. (p. 206)
In part 3 of the essay (pp. 166–184), he goes on to suggest that degrees of belief might be elicited through observed gambling behavior, and he states one direction of a Dutch book theorem (the easy one). He even calls the betting system a "book" (p. 182).

"Impossible to Say"

Ramsey also denies that there can be a question of rationality in initial assumptions, that is, in the prior distributions assumed in the absence of data.

He imagines reconstructing his own initial beliefs on the basis of his present beliefs and all of the data that he has ever observed (although, I should add, Bayesian updates can't be uniquely reversed like that). Assuming the update mechanism is sound, we then have:
My present degrees of belief can then be considered logically justified if the corresponding initial degrees of belief are justified. But to ask what initial degrees of belief are justified, or in Mr Keynes' system what are the absolutely a priori probabilities, seems to me a meaningless question; and even if it had a meaning I do not see how it could be answered.
If we actually applied this process to a human being, found out, that is to say, on what a priori probabilities his present opinions could be based, we should obviously find them to be ones determined by natural selection, with a general tendency to give a higher probability to the simpler alternatives. But, as I say, I cannot see what could be meant by asking whether these degrees of belief were logically justified. (p. 192–93)

Unconditional distributions cannot be uniquely reconstructed
from the conditional distributions and the conditions.
 
In the commentary, he makes a related point about independence and relevance assumptions:
E.g., the death-rate for men of 60 is 1/10, but all the 20 red-haired 60-year-old men I've known have lived till 70. What should I expect of a new red-haired man of 60? I can but put the evidence before me, and let it act on my mind. There is a conflict of two 'usually's' which must work itself out in my mind; one is not the really reasonable, the other the really unreasonable. (p. 202)
He concludes that "to introduce the idea of 'reasonable' is really a mistake" (p. 203). So to the question "What ought we to take as relevant?" the answer is that "it is impossible to say" (p. 203).

Jeffreys: Scientific Inference, third edition (1973)

This book, first published 1931, covers much of the same ground as Jeffreys' Theory of Probability (1939), but it's shorter and easier to read.

It touches on a number of extremely interesting topics, including
  • the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
  • a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
  • priors based on theory complexity (§2.5, especially pp. 34–39)
  • the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
  • an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
  • a critique of the frequency theory of probability (§9.21, pp. 193–197)
  • a number of other philosophical issues surrounding induction and probability (§9)
I might write about some of these issues later, but now I want to focus on a specific little detail that I liked. It's a combinatorical argument for Laplace's rule, which I have otherwise only seen justified through the use of Euler integrals.


Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation.

The Generative Set-Up

Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.

It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.

A population of n swans contains r white ones;
in a sample of m swans, s are white.

Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,
Pr( s | n, m, r )  ==  C(r, s) C(nr, ms) / C(n, m),
where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that
  • C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
  • C(nr, ms) is the number of ways of choosing the remaining ms swans in the sample from the remaining nr swans in the population.
  • C(n, m) is the total number of ways to sample m swans from a population of n.
The numerator thus counts the number of ways to select the sample so that it respects the constraint set by the number s, while the denominator counts the number of samples with or without this constraint.

Inverse Hypergeometric Probabilities

In general, binomial coefficients have the following two properties:
  • C(a, b) (ab)  ==  C(a, b + 1) (b + 1)
  • C(a + 1, b + 1)  ==  C(a, b) (a + 1)/(b + 1)
We'll need both of these facts below. They can be shown directly by cancelling out factors in the factorial expression for the binomial coefficients.

One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:
  • Pr( r | n, m, s )  ==  Pr( s | n, m, r ) (m + 1)/(n + 1)
  • Pr( s | n, m, r )  ==  Pr( r | n, m, s ) (n + 1)/(m + 1)
  • Pr( s + 1 | n, m + 1, r )  ==  Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)
These equalities are, of course, saying the same thing, but I state all three forms because they will all come up.

By using the first of the two rules for binomial coefficients, we can also show that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)
According to the last fact about the inverse hypergeometric probabilities, this also means that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.

Expanding the Predictive Probability

By assumption, we have inspected s out of the r white swans, so there are rs white swans left. We have further inspected m out of the n swans, so there is a total of nm swans left. The probability that the next swan will be white is thus (rs)/(nm).

If we call this event q, then we have, by the sum rule of probability,
Pr( q | n, m, s )  ==   Σr Pr( q, r | n, m, s )
By the chain rule of probabilities, we further have
Pr( q | n, m, s )  ==   Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )
As argued above, we have
  • Pr( q | n, m, s, r ) = (rs)/(nm)
  • Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
  • Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
Putting these facts together and cancelling, we get
Pr( q | n, m, s )  ==   (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )
I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have
Pr( q | n, m, s )  ==   (s + 1) / (m + 1)
as we wanted to prove.

Carnap and Jeffrey: Studies in Inductive Logic and Probability, Vol. 1 (1971)

Carnap at his desk; from carnap.org.
This is an anthology edited by Rudolf Carnap and philosopher Richard C. Jeffrey (not to be confused with physicist Harold Jeffreys).

The majority of the book is dedicated to two essays on probability which Carnap intended to be a substitute for the (never realized) second volume of the Logical Foundations of Probability (1950). Carnap's idea is that rational belief should be understood as the result of probabilistic conditioning on a special kind of "nice" prior.

An Inconsistent Axiomatization of Rationality

In order to demarcate the realm of rational belief, Carnap has to specify the set of permitted starting states of the system and its update rules. He does so by means of the following four "rationality assumptions":
  1. Coherence — You must conform to the axioms of probability; or in terms of gambling, you may not assign positive utility to any gamble that guarantees a strict loss.
  2. Strict Coherence — You may not assign an a priori probability of 0 to any event; or equivalently, you may not assign positive utility to a gamble that renders a strict loss possible and a weak loss necessary.
  3. Belief revision depends only on the evidence — Your beliefs at any time must be determined completely by your prior beliefs and your evidence (nothing else). Assuming axiom 1 is met, this comes down to producing new beliefs by conditioning.
  4. Symmetry — You must assign the same probability to propositions of the same logical form, i.e., F(x) and F(y).
These axioms are inconsistent in a number of cases, and Carnap does not seem to realize. The problems are that
  • Many infinite sets cannot be equipped with a proper, regular, and symmetric distribution. For instance, there is no "uniform distribution on the integers";
  • There may be interdependent propositional functions in the language, and a prior that renders one symmetric might render another asymmetric. Consider for instance F(x) = "the box has a side-length between x and x + 1" and G(x) = "the box has a volume between x and x + 1".
Maybe Carnap had a vague idea about the first problem — at least he seems to assume that the sample space is finite throughout the first essay ("Inductive Logic and Rational Decisions," cf. pp. 7 and 14).

In the second essay, however, he explicitly says that there are countably many individuals in the language, so it would seem that he owes us a proper, coherent, and regular distribution on the integers ("A Basic System of Inductive Logic, Part I," ch. 9, p. 117).

Both Jaynes and Jeffreys made attempts at tackling the second problem by choosing priors that would decrease the tension between two descriptions. Jeffreys, for instance, showed that a probability density function of the form f(t) = 1/t (restricted to some positive interval) makes it irrelevant whether a normal distribution is described in terms of its variance or its precision parameter. Jaynes, by an essentially identical argument, "solved" Bertrand's paradox by choosing a prior that minimizes the discrepancy between a side-length description and a volume-description.

What is a Rationality Assumption?

Carnap knows that probability theory has to be founded on something other than probability theory to make sense and explains that "the reasons for our choice of the axioms are not purely logical." (p. 26; his emphasis).

Rather, they are game-theoretic: In order to argue against the use of some a priori probability measure (or "M-function"), Carnap must show why somebody starting from this prior
…, in a certain possible knowledge situation, would be led to an unreasonable decision. Thus, in order to give my reasons for the axiom, I move from pure logic to the context of decision theory and speak about beliefs, actions, possible losses, and the like. (p. 26)
That sounds circular, but the rest of his discussion seems to indicate that he is thinking about worst-case (or minimax) decision theory, which makes sense.

"Reduced to one"

What does not make sense, however, is his unfounded faith that there are always reasons to prefer one M-function over another:
Even on the basis of all axioms that I would accept at the present time , the number of admissible M-functions, i.e., those that satisfy all accepted axioms, is still infinite; but their class is immensely smaller than that of all coherent M-functions [i.e., all probability measures]. There will presumably be further axioms, justified in the same way by considerations of rationality. We do not know today whether in this future development the number of admissible M-functions will always remain infinite or will become finite and possibly even be reduced to one. Therefore, at the present time I do not assert that there is only one rational Cr0-function [= initial credence = credence at time 0]. (p. 27)
But clearly, he hopes so.

Carnap the Moralist

Interestingly, Carnap makes a very direct connection between moral character and epistemic habits. This comes out most clearly in a passage in which he explains that rationality is a matter of belief revision rather than belief:
When we wish to judge the morality of a person, we do not simply look at some of his acts; we study rather his character, the system of his moral values, which is part of his utility function. Observations of single acts without knowledge of motives give little basis for judgment. Similarly if we wish to judge the rationality of a person's beliefs, we should not look simply at his present beliefs. Information on his beliefs without knowledge of the evidence out of which they arose tells us little. We must rather study the way way in which the person forms his beliefs on the basis of evidence. In other words, we should study his credibility function, not simply his present credence function. (p. 22)
The "Reasonable Man" (to use the 18th century terminology) is thus the man who updates his beliefs in a responsible, careful, and modest fashion. Lack of reason is the stubborn rejection of norms of evidence, a refusal to surrender to the "truth cure."

As an illustration of what he has in mind, Carnap considers an urn example in which a person X observes a majority of black balls being drawn (E), and Y observes a majority of white balls (E'). He continues:
Let H be the prediction that the next ball drawn will be white. Suppose that for both X and Y the credence of H is 2/3. Then we would judge this same credence value 2/3 of the proposition H as unreasonable for X, but reasonable for Y. We would condemn a credibility function Cred as nonrational if Cred(H | E) = 2/3; while the result Cred(H | E') = 2/3 would be no ground for condemnation. (p. 22)
So although he elsewhere argues that rationality is a matter of risk minimization, he nevertheless falls right into the moralistic language of "grounds for condemnation."

Do the Robot

A similar formulation appears earlier, as he discusses the axiom that belief revision is based on evidence only. For a person satisfying this criterion, Carnap explains,
… changes in his credence function are influenced only by his observational results, but not by any other factors, e.g., feelings like his hopes or fears concerning a possible future event H, feelings that in fact often influence the beliefs of all actual human beings. (pp. 15–16)
 Like Jaynes, he defends this idealization by reference to a hypothetical design problem:
Thinking about the design of a robot might help us in finding rules of rationality. Once found, these rules can be applied not only in the construction of a robot but also in advising human beings in their effort to make their decisions as rational as their limited abilities permit. (p. 17)
Another way of saying the same thing is that we should first describe the machine that we would want to do the job, and then tell people how to become more like that machine.