Tuesday, April 15, 2014

Fisher: "Statistical Methods and Scientific Induction" (1955)

Ronald Fisher; image from Wikimedia Commons.
In this brief paper, Sir Ronald Fisher militates against what he sees as wrong and absurd interpretations of the notion of a statistical test.

The Ideology of Statistics

The core of his argument is that a test only gives positive information when yields a significant difference and thus warrants the rejection of a hypothesis — an absence of a significant difference does not mean "accept." He contends that
… this difference in point of view originated when Neyman, thinking that he was correcting and improving my own early work on tests of significance, … in fact reinterpreted them in terms of that technological and commercial apparatus which is known as an acceptance procedure. (p. 69)
And although acceptance procedures might be good enough for commerce, they have no place in science:
I am casting no contempt on acceptance procedures, and I am thankful, whenever I travel by air, that the high level of precision and reliability required can really be achieved by such means. But the logical differences between such an operation and the work of scientific discovery by physical or biological experimentation seem to me so wide that the analogy between them is not helpful, and the identification of the two sorts of operation is decidedly misleading. (pp. 69–70)
Then comes the juicy part:
I shall hope to bring out some of the logical differences more distinctly, but there is also, I fancy, in the background an ideological difference. Russians are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation. How far, within such a system, personal and individual inferences from observed facts are permissible we do not know, but it may be safer, and even, in such a political atmosphere, more agreeable, to regard one's scientific work simply as a contributary element in a great machine, and to conceal rather than to advertise the selfish and perhaps heretical aim of understanding for oneself the scientific situation. In the U.S. also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather at, let us say, speeding production, or saving money. There is therefore something to be gained by at least being being able to think of our scientific problems in a language distinct from that of technological efficiency. (p. 70)
So there you have it: In the technological regime of either of the two Cold War superpowers, "learning," "inference," and private, inner thought are taboo, according to Fisher. Presumably we are to contrast this with the aims of British science going back to Newton.

The Three Issues

Fisher singles out three phrases that he finds particularly offensive in scientific statistics:
  1. "Repeated sampling from the same distribution"
  2. Errors of the "second kind"
  3. "Inductive behaviour"
I'll discuss these one by one.

1. "Repeated sampling from the same distribution"

The issue with the first one is not completely clear to me, but here is what I make of his discussion (pp. 71–72): Suppose you are performing a test to see whether the mean of some population has a specific value; suppose further that the standard deviation of that population is unknown, but that you have estimated it based on the available sample.

The problem then is, if I understand Fisher correctly, that the test depends on the standard deviation being constant and known, but in reality, it is an unknown quantity that you have estimated by a maximum likelihood method. This is, strictly speaking, illegitimate, since any estimate should be based on a numerous and representative sample; but since the standard deviation is a property of samples of size N, you should really have M samples of a sample of size N in order to have some data to estimate from. But clearly, this sets a far too high standard for the amount of data required.

It's a convoluted argument, but I think it makes sense from a rigorously frequentist standpoint: If parameters are consistently interpreted as frequencies, then the only legitimate statistical procedure for learning about an unknown quantity t is to obtain a large number of samples dependent on t and then wait for the law of large numbers to kick in.

Strictly speaking, this means that the amount of data points you need in order to estimate all the parameters in a model will grow exponentially in the number of parameters. That sounds sort of crazy, but if you do not allow yourself to have any model in the absence of data, you really have to wait for the data to overwhelm your initial ignorance before you can say that you have a model of the situation. That takes time.

2. Errors of the "Second Kind"

Errors of the first kind are false negatives: Cases in which, for instance, a population in fact has mean m, but nevertheless exhibits a sample average so far away from m that the hypothesis is rejected. Such errors have a frequentist interpretation, because the likelihoods given m are well-defined even in the absence of a prior distribution over m.

Errors of the second kind are false positives: Some other mean m' different from m produces a sample average so close to m that the false hypothesis of a mean of m is confirmed. This kind of error has no frequentist interpretation, because it requires the alternative hypotheses m' to have prior probabilities, and because it requires that there be a loss function associated with accepting the hypothesis of m when the true mean m' is close to m.


Jerzy Neyman in the classroom, 1973; image from Wikimedia Commons.

Fisher is not willing to assume any of those two instruments. He writes:
It was only when the relation between a test of significance and its corresponding null hypothesis was confused with an acceptance procedure that it seemed suitable to distinguish errors in which the hypothesis is rejected wrongly, from errors in which it is "accepted wrongly" as the phrase does. (p. 73)
Such language is not just scientifically irresponsible, he thinks — it also misunderstands the private states of mind present in the head of a scientist:
The fashion of speaking of a null hypothesis as "accepted when false", whenever a test of significance gives us no strong reason for rejecting it, and when in fact it is in some way imperfect, shows real ignorance of the research worker's attitude, by suggesting that in such a case he has come to an irreversible decision. (p. 73; Fisher's emphasis)
Of course, neither positive nor negative decisions are immune to revision as more data comes in (cf. p. 76), so Fisher prefers to depict the scientist's attitude as one of cautious learning in the face of data. This contrasts with the forced-choice nature of acceptance procedures:
In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence for it was strong or weak. It is the result of applying mechanically rules laid down in advance; no thought is given to the particular case, and the tester's state of mind, or his capacity for learning, is inoperative.
By contrast, conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation. (pp. 73–74; Fisher's emphasis).
Note again the insistence on private states of mind as the hallmark of scientific rationality.

3. "Inductive Behaviour"

The last issue Fisher has with Neyman's brand of statistics is shelves under the heading above, but it is really about an issue of linguistics: Neyman contends (according to Fisher's summary — there is no direct reference) that statements like
There is 5% probability that the sample average deviates strongly from the mean
have a meaningful and well-defined interpretation (in terms of likelihood). On the other hand,
There is 5% probability that the mean deviates strongly from the sample average
is meaningless, because the mean is not a random variable.

Fisher disagrees, not because he is a fan of prior probability distributions on the parameters, but because he thinks that such statements could only ever refer to likelihoods. To make this point vivid, he considers (I am changing the example a bit here) a statement of the form
Pr(m < x) = 5%,
where m is a parameter and x is an observation, and he contrasts this with
Pr(m < 17) = 5%.
If one of these statements has a meaning, he says, clearly the other one must have a meaning too, unless we want to "deny the syllogistic process of making a substitution" (p. 75). But Neyman contends that the probability of a statement of the second kind should be "necessarily either 0 or 1" (p. 75), so that only the former probability (the likelihood given the mean) is well-defined.

Fisher comments:
The paradox is rather childish, for it requires that we should wilfully misinterpret the probability statement so as to pretend that the population to which it refers is not defined by our observations and their precision, but is absolutely independent of them. (p. 75)
By this he means that the reference class (the "population") is defined arbitrarily by our experimental set-up. And as he says about populations earlier in the paper, "no one of them has objective reality, all being products of the statistician's imagination" (p. 71).

An Englishman's Duty

In the conclusion, Fisher comes back to the ethical standards of statistics:
As an act of construction the hypothesis is not altogether impersonal, for the scientist's personal capacity for theorizing comes into it; moreover, the criteria by which it is approved require a certain honesty, or integrity, in their application. (p. 75)
Again, he explains that decision-theoretic methods (such as Bayesian statistics) have no business in scientific inference, since the goal is not optimal decisions, but the attainment of truth:
Finally, in inductive inference we introduce no cost functions for faulty judgments … In fact, scientific research is not geared to maximize the profits of any particular organization, but is rather an attempt to improve public knowledge undertaken as an act of faith to the effect that, as more becomes known, or more surely known, the intelligent pursuit of a great variety of aims, by a great variety of men, and groups of men, will be facilitated. We make no attempt to evaluate these consequences, and do not assume that they are capable of evaluation in any sort of currency.
… We aim, in fact, at methods of inference which should be equally convincing to all rational minds, irrespective of any intentions they may have in utilizing the knowledge inferred.
We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right or other free minds to utilize them in making their own decisions. (p. 77)
We could hardly have it more explicit: The difference in statistical paradigm is one of ethics.

Wednesday, April 2, 2014

Ramsey: "Truth and Probability" (1926)

Ramsey; from Wikipedia.
F. P Ramsey is a curious figure. Reading Keynes, Russell, and Whitehead at the age of 19, he was obviously a prodigy and seemed to have attracted attention from the Cambridge community of the 1920s, including Wittgenstein, Moore, and others. He died in 1930 from complications following a stomach operation, aged 26.

This posthumous book is a collection of some of his writings, both published and unpublished. I'm reading it mostly for the unpublished essay "Truth and Probability" from 1926, and Ramsey's own 1928 comments on that paper.

Degrees of Gamblability

In that essay, Ramsey embraces an epistemic interpretation of probability although he realizes that there might not be a single person with coherent beliefs in the sense of probability theory.

In his own commentary on the paper, the choice of this philosophy is compressed to a quick dismissal of any "crude frequency theory" and the conclusion:
(2) Hence chances must be defined by degrees of belief; but they do not correspond to anyone's actual degrees of belief; the chances of 1,000 heads, and of 999 heads followed by a tail, are equal, but everyone expects the former more than the latter.
(3) Chances are degrees of belief within a certain system of beliefs and degrees of belief; not those of any actual person, but in a simplified system to which those of actual people, especially the speaker, in part approximate. (p. 206)
In part 3 of the essay (pp. 166–184), he goes on to suggest that degrees of belief might be elicited through observed gambling behavior, and he states one direction of a Dutch book theorem (the easy one). He even calls the betting system a "book" (p. 182).

"Impossible to Say"

Ramsey also denies that there can be a question of rationality in initial assumptions, that is, in the prior distributions assumed in the absence of data.

He imagines reconstructing his own initial beliefs on the basis of his present beliefs and all of the data that he has ever observed (although, I should add, Bayesian updates can't be uniquely reversed like that). Assuming the update mechanism is sound, we then have:
My present degrees of belief can then be considered logically justified if the corresponding initial degrees of belief are justified. But to ask what initial degrees of belief are justified, or in Mr Keynes' system what are the absolutely a priori probabilities, seems to me a meaningless question; and even if it had a meaning I do not see how it could be answered.
If we actually applied this process to a human being, found out, that is to say, on what a priori probabilities his present opinions could be based, we should obviously find them to be ones determined by natural selection, with a general tendency to give a higher probability to the simpler alternatives. But, as I say, I cannot see what could be meant by asking whether these degrees of belief were logically justified. (p. 192–93)

Unconditional distributions cannot be uniquely reconstructed
from the conditional distributions and the conditions.
 
In the commentary, he makes a related point about independence and relevance assumptions:
E.g., the death-rate for men of 60 is 1/10, but all the 20 red-haired 60-year-old men I've known have lived till 70. What should I expect of a new red-haired man of 60? I can but put the evidence before me, and let it act on my mind. There is a conflict of two 'usually's' which must work itself out in my mind; one is not the really reasonable, the other the really unreasonable. (p. 202)
He concludes that "to introduce the idea of 'reasonable' is really a mistake" (p. 203). So to the question "What ought we to take as relevant?" the answer is that "it is impossible to say" (p. 203).

Jeffreys: Scientific Inference, third edition (1973)

This book, first published 1931, covers much of the same ground as Jeffreys' Theory of Probability (1939), but it's shorter and easier to read.

It touches on a number of extremely interesting topics, including
  • the asymptotic equivalence of Bayesian inference and maximum likelihood inference (§4.3, pp. 73–74)
  • a Bayesian notion of "statistical test" (§3.5, pp. 54–56)
  • priors based on theory complexity (§2.5, especially pp. 34–39)
  • the convergence of predictive distributions to the truth under (very) nice conditions (§2.5)
  • an inductive justification of the principle of induction through hierarchical models (§3.7, pp. 58–60)
  • a critique of the frequency theory of probability (§9.21, pp. 193–197)
  • a number of other philosophical issues surrounding induction and probability (§9)
I might write about some of these issues later, but now I want to focus on a specific little detail that I liked. It's a combinatorical argument for Laplace's rule, which I have otherwise only seen justified through the use of Euler integrals.


Laplace's rule: Add a "virtual count" to each bucket before the parameter estimation.

The Generative Set-Up

Suppose that you have a population of n swans, r of which are white. We'll assume that r is uniformly distributed on 0, 1, 2, …, n. You now inspect a sample of m < n swans and find s white swans among them.

It then turns out that the probability that the next swan is white is completely independent of n: Whatever the size of the population is, the probability of seeing one more white swan turns out to be (s + 1)/(m + 1) when we integrate out the effect of r.

A population of n swans contains r white ones;
in a sample of m swans, s are white.

Let me go into a little more detail. Given n, m, and r, the probability of finding s white swans in the sample follows a hypergeometric distribution; that is,
Pr( s | n, m, r )  ==  C(r, s) C(nr, ms) / C(n, m),
where C(a, b) is my one-dimensional notation for the binomial coefficient "a choose b." The argument for this formula is that
  • C(r, s) is the number ways of choosing s white swans out of a total of r white swans.
  • C(nr, ms) is the number of ways of choosing the remaining ms swans in the sample from the remaining nr swans in the population.
  • C(n, m) is the total number of ways to sample m swans from a population of n.
The numerator thus counts the number of ways to select the sample so that it respects the constraint set by the number s, while the denominator counts the number of samples with or without this constraint.

Inverse Hypergeometric Probabilities

In general, binomial coefficients have the following two properties:
  • C(a, b) (ab)  ==  C(a, b + 1) (b + 1)
  • C(a + 1, b + 1)  ==  C(a, b) (a + 1)/(b + 1)
We'll need both of these facts below. They can be shown directly by cancelling out factors in the factorial expression for the binomial coefficients.

One consequence is that Bayes' rule takes on a particularly simple form in the hypergeometric case:
  • Pr( r | n, m, s )  ==  Pr( s | n, m, r ) (m + 1)/(n + 1)
  • Pr( s | n, m, r )  ==  Pr( r | n, m, s ) (n + 1)/(m + 1)
  • Pr( s + 1 | n, m + 1, r )  ==  Pr( r | n, m + 1, s + 1 ) (n + 1)/(m + 1)
These equalities are, of course, saying the same thing, but I state all three forms because they will all come up.

By using the first of the two rules for binomial coefficients, we can also show that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( s + 1 | n, m + 1, r ) (s + 1)/(n + 1)
According to the last fact about the inverse hypergeometric probabilities, this also means that
Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
I have cancelled two occurrences of (n + 1) to arrive at this expression. I will use this fact below.

Expanding the Predictive Probability

By assumption, we have inspected s out of the r white swans, so there are rs white swans left. We have further inspected m out of the n swans, so there is a total of nm swans left. The probability that the next swan will be white is thus (rs)/(nm).

If we call this event q, then we have, by the sum rule of probability,
Pr( q | n, m, s )  ==   Σr Pr( q, r | n, m, s )
By the chain rule of probabilities, we further have
Pr( q | n, m, s )  ==   Σr Pr( q | n, m, s, r ) Pr( r | n, m, s )
As argued above, we have
  • Pr( q | n, m, s, r ) = (rs)/(nm)
  • Pr( r | n, m, s ) = Pr( s | n, m, r ) (m + 1)/(n + 1)
  • Pr( s | n, m, r ) (rs)/(nm)  ==  Pr( r | n, m + 1, s + 1 ) (s + 1)/(m + 1)
Putting these facts together and cancelling, we get
Pr( q | n, m, s )  ==   (s + 1)/(m + 1) Σr Pr( r | n, m + 1, s + 1 )
I have pulled the constant factors out the summation here. Notice further that the summation is a sum of probabilities for the possible values of r. It must consequently sum to 1. We thus have
Pr( q | n, m, s )  ==   (s + 1) / (m + 1)
as we wanted to prove.

Carnap and Jeffrey: Studies in Inductive Logic and Probability, Vol. 1 (1971)

Carnap at his desk; from carnap.org.
This is an anthology edited by Rudolf Carnap and philosopher Richard C. Jeffrey (not to be confused with physicist Harold Jeffreys).

The majority of the book is dedicated to two essays on probability which Carnap intended to be a substitute for the (never realized) second volume of the Logical Foundations of Probability (1950). Carnap's idea is that rational belief should be understood as the result of probabilistic conditioning on a special kind of "nice" prior.

An Inconsistent Axiomatization of Rationality

In order to demarcate the realm of rational belief, Carnap has to specify the set of permitted starting states of the system and its update rules. He does so by means of the following four "rationality assumptions":
  1. Coherence — You must conform to the axioms of probability; or in terms of gambling, you may not assign positive utility to any gamble that guarantees a strict loss.
  2. Strict Coherence — You may not assign an a priori probability of 0 to any event; or equivalently, you may not assign positive utility to a gamble that renders a strict loss possible and a weak loss necessary.
  3. Belief revision depends only on the evidence — Your beliefs at any time must be determined completely by your prior beliefs and your evidence (nothing else). Assuming axiom 1 is met, this comes down to producing new beliefs by conditioning.
  4. Symmetry — You must assign the same probability to propositions of the same logical form, i.e., F(x) and F(y).
These axioms are inconsistent in a number of cases, and Carnap does not seem to realize. The problems are that
  • Many infinite sets cannot be equipped with a proper, regular, and symmetric distribution. For instance, there is no "uniform distribution on the integers";
  • There may be interdependent propositional functions in the language, and a prior that renders one symmetric might render another asymmetric. Consider for instance F(x) = "the box has a side-length between x and x + 1" and G(x) = "the box has a volume between x and x + 1".
Maybe Carnap had a vague idea about the first problem — at least he seems to assume that the sample space is finite throughout the first essay ("Inductive Logic and Rational Decisions," cf. pp. 7 and 14).

In the second essay, however, he explicitly says that there are countably many individuals in the language, so it would seem that he owes us a proper, coherent, and regular distribution on the integers ("A Basic System of Inductive Logic, Part I," ch. 9, p. 117).

Both Jaynes and Jeffreys made attempts at tackling the second problem by choosing priors that would decrease the tension between two descriptions. Jeffreys, for instance, showed that a probability density function of the form f(t) = 1/t (restricted to some positive interval) makes it irrelevant whether a normal distribution is described in terms of its variance or its precision parameter. Jaynes, by an essentially identical argument, "solved" Bertrand's paradox by choosing a prior that minimizes the discrepancy between a side-length description and a volume-description.

What is a Rationality Assumption?

Carnap knows that probability theory has to be founded on something other than probability theory to make sense and explains that "the reasons for our choice of the axioms are not purely logical." (p. 26; his emphasis).

Rather, they are game-theoretic: In order to argue against the use of some a priori probability measure (or "M-function"), Carnap must show why somebody starting from this prior
…, in a certain possible knowledge situation, would be led to an unreasonable decision. Thus, in order to give my reasons for the axiom, I move from pure logic to the context of decision theory and speak about beliefs, actions, possible losses, and the like. (p. 26)
That sounds circular, but the rest of his discussion seems to indicate that he is thinking about worst-case (or minimax) decision theory, which makes sense.

"Reduced to one"

What does not make sense, however, is his unfounded faith that there are always reasons to prefer one M-function over another:
Even on the basis of all axioms that I would accept at the present time , the number of admissible M-functions, i.e., those that satisfy all accepted axioms, is still infinite; but their class is immensely smaller than that of all coherent M-functions [i.e., all probability measures]. There will presumably be further axioms, justified in the same way by considerations of rationality. We do not know today whether in this future development the number of admissible M-functions will always remain infinite or will become finite and possibly even be reduced to one. Therefore, at the present time I do not assert that there is only one rational Cr0-function [= initial credence = credence at time 0]. (p. 27)
But clearly, he hopes so.

Carnap the Moralist

Interestingly, Carnap makes a very direct connection between moral character and epistemic habits. This comes out most clearly in a passage in which he explains that rationality is a matter of belief revision rather than belief:
When we wish to judge the morality of a person, we do not simply look at some of his acts; we study rather his character, the system of his moral values, which is part of his utility function. Observations of single acts without knowledge of motives give little basis for judgment. Similarly if we wish to judge the rationality of a person's beliefs, we should not look simply at his present beliefs. Information on his beliefs without knowledge of the evidence out of which they arose tells us little. We must rather study the way way in which the person forms his beliefs on the basis of evidence. In other words, we should study his credibility function, not simply his present credence function. (p. 22)
The "Reasonable Man" (to use the 18th century terminology) is thus the man who updates his beliefs in a responsible, careful, and modest fashion. Lack of reason is the stubborn rejection of norms of evidence, a refusal to surrender to the "truth cure."

As an illustration of what he has in mind, Carnap considers an urn example in which a person X observes a majority of black balls being drawn (E), and Y observes a majority of white balls (E'). He continues:
Let H be the prediction that the next ball drawn will be white. Suppose that for both X and Y the credence of H is 2/3. Then we would judge this same credence value 2/3 of the proposition H as unreasonable for X, but reasonable for Y. We would condemn a credibility function Cred as nonrational if Cred(H | E) = 2/3; while the result Cred(H | E') = 2/3 would be no ground for condemnation. (p. 22)
So although he elsewhere argues that rationality is a matter of risk minimization, he nevertheless falls right into the moralistic language of "grounds for condemnation."

Do the Robot

A similar formulation appears earlier, as he discusses the axiom that belief revision is based on evidence only. For a person satisfying this criterion, Carnap explains,
… changes in his credence function are influenced only by his observational results, but not by any other factors, e.g., feelings like his hopes or fears concerning a possible future event H, feelings that in fact often influence the beliefs of all actual human beings. (pp. 15–16)
 Like Jaynes, he defends this idealization by reference to a hypothetical design problem:
Thinking about the design of a robot might help us in finding rules of rationality. Once found, these rules can be applied not only in the construction of a robot but also in advising human beings in their effort to make their decisions as rational as their limited abilities permit. (p. 17)
Another way of saying the same thing is that we should first describe the machine that we would want to do the job, and then tell people how to become more like that machine.

Thursday, March 27, 2014

Walters: An Introduction to Ergodic Theory (1982), p. 26

Book cover; from Booktopia.
Section 1.4 of Peter Walters' textbook on ergodic theory contains a proof of Poincaré's recurrence theorem. I found it a little difficult to read, so I'll try to paraphrase the proof here using a vocabulary that might be a bit more intuitive.

The Possible is the Necessarily Possible

The theorem states the following: If
  1. X is a probability space
  2. T: XX is a measure-preserving transformation of X,
  3. E is an event with positive probability,
  4. and x a point in E,
then the series
x, Tx, T2x, T3x, T4x, …
will almost certainly pass through E infinitely often. Or: if it happens once, it will happen again.

The idea behind the proof is to describe the set R of points that visit E infinitely often as the superior limit of a series of sets. This description can then be used to show that ER has the same measure as E. This will imply that almost all points in E revisit E infinitely often.

Statement and proof of the theorem; scan from page 26 in Walters' book.

I'll try to spell this proof out in more detail below. My proof is much, much longer than Walters', but hopefully this means that it's much, much more readable.

Late Visitors

Let Ri be the set of points in X that visit E after i or more applications of T. We can then make two observations about the series R0, R1, R2, R3, …:

First, if j > i, and you visit E at a time later than j, you also visit E at a time later than i. The Ri's are consequently nested inside each other:
R0R1R2R3 ⊃ …
Let's use the name R for the limit of this series (that is, the intersection of the sets). R then consists of all the points in X that visit E infinitely often.

The series of sets is downward converging.

Second, Ri contains the points that visit E at time i or later, and the transformation T–1 takes us one step back in time. The set T–1Ri consequently contains the points in X that visit E at time i + 1 or later. Thus
T–1Ri = Ri + 1.
But since we have assumed that T is measure-preserving, this implies that
m(Ri) = m(T–1Ri) = m(Ri + 1).
By induction, every set in the series thus has the same measure:
m(R0) = m(R1) = m(R2) = m(R3) = …
Or to put is differently, the discarded parts R0\R1, R1\R2, R2\R3, etc., are all null sets.

Intersection by Intersection

So we have that
  1. in set-theoretic terms, the Ri's converge to a limit R from above;
  2. but all the Ri's have the same measure.
Let's use these facts to show that m(ER) = m(E), that is, that we only throw away a null set by intersecting E with R.

The event E and the set R of points that visit the E infinitely often.

To prove this, notice first that every point in E visits E after zero applications of T. Thus, R0E, or in other words, ER0 = E. Consequently,
m(ER0) = m(E).
We now need to extend this base case by a limit argument to show that
m(ER) = m(E).
But, as we have seen above, the difference between R0 and R1 is a null set. Hence, the difference between ER0 and ER1 is also a null set, so
m(ER0) = m(ER1).
This argument holds for any i and i + 1. By induction, we thus get
m(ER0) = m(ER1) = m(ER2) = m(ER3) = …

The probability of a visit to E before but never after time i has probability 0.

Since measures respect limits, this implies that
m(ER0) = m(ER).
But we have already seen that m(ER0) = m(E), this implies that
m(ER) = m(E).

The Wherefore

An informal explanation of what's going on in this proof might be the following:

We are interested in the conditional probability of visiting E infinitely often given that we have visited it once, that is, Pr(R | E). In order to compute this probability, we divide up RE into an infinite number of cases and discover that all but one of them have probability 0.

If you Imagine yourself walking along a sample path, your route will fall in one of the following categories:
  • you never visit E;
  • you visit E for the last time at time i = 0;
  • you visit E for the last time at time i = 1;
  • you visit E for the last time at time i = 2;
  • you visit E for the last time at time i = 3;
  • there is no last time — i.e., you visit E infinitely often.
When we condition on E, the first of these cases has probability 0.

In general, the fact that T is measure-preserving guarantees that it is impossible for an event to occur i times without occurring i + 1 times; consequently, all of the following cases also have probability 0.

We thus have to conclude that the last option — infinitely many visits to E — has the same probability as visiting E once, and thus a conditional probability of 1.

Wednesday, March 26, 2014

von Mises: Probability, Statistics and Truth (1951), ch. 1

Richard von Mises; from Wikipedia.
Richar von Mises was an important proponent of the frequentist philosophy of probability.

In his book Probability, Statistics and Truth, he militates against the use of the word "probability" for anything other than indefinitely repeatable experiments with converging relative frequencies (pp. 10–12). He also compares probabilities to physical constants like the velocity of a molecule (p. 21) and asserts that the law of large numbers is an empirical generalization comparable to physical laws like the conservation of energy (pp. 16, 22, and 26).

Reference Class Relativism

A consequence of this frequentist notion of probability is that specific events do not have probabilities. Only infinite classes of comparable events can have probabilities.

For instance, when your coin comes up heads at 10 o'clock, that's a different event from the coin coming up heads at 11 o'clock in infinitely many ways. Only because you choose which properties from the situation to select can you identify the two events as equivalent.

As a kind of argument for this reference class relativism, von Mises asserts that a specific person has a different probability of dying depending on the reference class (e.g., people over 40, men over 40, male smokers over 40, etc). We thus have to explicitly select the reference class before we can talk about "the" probability.

He comments:
One might suggest that a correct value of the probability of death for Mr. X may be obtained by restricting the collective to which he belongs as far as possible, by taking into consideration more and more of his individual characteristics. There is, however, no end to this process, and if we go further and further into the selection of the members of the collective, we shall be left finally with this individual alone. (p. 18)
Even from a frequentist perspective, I'm not sure this makes sense. The fact that we have narrowed down our reference class so much that there is only a single real person left in it should not change the fact that we still have an intensional definition of the class. In so far we do, we should be able to apply that definition to the outcome of any sequence of candidates, like an infinite sequence of people or experiments. In reality, it is only data sparsity that keeps use from going "further and further."

So I think von Mises has a theoretical choice to make: Either, he must require that reference classes be actually infinite, or he must merely require that they be potentially infinite.

"Randomness" and Insensitivity to Subsequence Selection

Von Mises spends a large part of the lecture elaborating a notion of "randomness" which is intended to capture the difference between asymptotically i.d.d. sequences and not asymptotically i.d.d. sequences with the same limiting frequencies. He does so by adding the requirement that the limiting frequencies are independent of subsequence selection.

A possibly more intuitive way of stating that definition would be in terms of a Topsøe-style game: A structure-finding player is tasked to pick infinitely many places in a sequence based on past data and is rewarded when the empirical frequencies fails to converge to a given distribution; a structure-hiding player is tasked to select the sequence and is rewarded when the frequencies do converge to the given distribution.

If the structure-hider then introduces any systematic dependence between the experiments, the structure-finder can exploit these regularities to outgamble the structure-hider. Thus, only asymptotically i.d.d. sequences are part of an equilibrium.

I haven't checked the details, but this game seems to be the same as that suggested by Shafer and Vovk, although (if I remember correctly), they only consider fair (that is, maximum-entropy) i.i.d. coins, not arbitrary biases. But at any rate, coin flipping is, like distributions on a finite set, one of the cases in which there is a maximum entropy distribution even in the absence of an externally given mean.

Tuesday, March 25, 2014

Frankfurt: "Indavertence and Responsibility" (2008)

In his Amherst lecture, Harry Frankfurt defends the unspectacular assertion that we are only morally responsible for things we do on purpose. He does so by distinguishing causal and moral credit:
We are responsible for [things we do inadvertently] as their cause, even though we do not intend them. They accrue to our credit or to our blame, though not to our moral credit or moral blame. (p. 14)
Much of his discussion is spurred by a set of silly thought examples by Thomas Nagel – a person pulling a trigger without intending to fire the gun, etc.

Frankfurt giving the lecture; from amherstlecture.org.

In the course of making his point, Frankfurt presents a rather disconcerting thought example of this own:
Let us suppose, then, that a person is the carrier of a highly contagious and dreadful disease. […] I suppose that the person would naturally be horrified, would feel helplessly discouraged by the evident impossibility of keeping from doing wholesale harm, and might well conclude – even while acknowledging no moral responsibility at all for being so toxic – that the world would be better off without him. The toxicity is by no means his fault; but he certainly cannot pretend that it has nothing to do with him. However he may wish that this were not the case, he is a poisonous creature, who cannot avoid doing dreadful harm. (p. 13; emphases in the original)
Something seems really out of key here.