Tuesday, August 26, 2014

Cardano: The Book on Games of Chance (1961/1663)

Cardano; from Wikimedia.
My library has refused to buy the new translation of Jacob Bernoulli's Ars Conjectandi (1713), possibly because the price tag is on the order of $3000. Consequently, I don't have an English translation of the complete book, arguably the most influential one in the history of probability (perhaps with the exception of Kolmogorov's little pamphlet).

Meanwhile, they hold two copies of Cardano's weird, chatty, and uneven book on gambling. I checked out both of them, only two find out that one (the 1961 version) was a reprint of the other, the translation by Øystein Ore appended to his own 1953 commentary on the book (which, incidentally, also included an interpretation of Cardano's erroneous calculus of probabilities).

I'll deal with Ore's book later. Now a few comments on quotes on Cardano.


Cardano's book can roughly be divided up into sections as follows. If it looks jumbled up, it's because it is.
  • Chapters 1–10: Preliminaries, survey of games, moralistic preaching, etc.
  • Chapters 9–11: Combinatorics of dice games.
  • Chapters 16–19: On card games (mostly non-mathematical discussions of cheating, moral issues, and the like).
  • Chapters 20–21: On luck.
  • Chapters 22–25: More game taxonomies and definitions.
  • Chapter 26: On theoretical and practical knowledge.
  • Chapter 27: More on luck
  • Chapter 28: Recommendations on playing styles in backgammon.
  • Chapter 29: On why you shouldn't play against hotheads.
  • Chapters 30–31: Games in ancient Greece.
  • Chapter 32: Computing expectations for a die.
From the perspective of the history of mathematics proper, the most directly relevant parts are those on combinatorics and those on luck.

"If Fortune Be Equal"

One of the more remarkable features of Cardano's book is the prominent place it gives to "luck." As Lorraine Daston has noted, this concept seemed to fill out any gap his calculus didn't account for, including the gap between frequencies and probabilities, or between expectations and actual outcomes.

Mean playing dice; illustration from a 1531 print of Cicero's On Duties.

The first time the word occurs is during Cardano's dicussion of fair bets. Apparently, the most common dice game at the time was a bet on whether a specific throw (e.g., double six) would show up or not show up within n throws of a number of dice.

Cardano was therefore interested in computing the number n for which this bet would be fair — or in his words, for which "there is equality." He explains in Chapter 11:
This gives eighteen casts for equality for a throw (1,1); for in that number of casts this throw can appear and not appear with equal probability; and similarly for the throws (2,2) and (3,3).
But the throw (1,2) can turn up in two ways, so that for it there is equality in nine casts; and if it turns out up more frequently or more rarely, that is a matter of luck. (p. 11; my emphasis)
Later, in Chapter 14:
If, therefore, someone should say, "I want an ace, a deuce, or a trey," you know that there are 27 favorable throws, and since the [size of the] circuit [= sample space] is 36, the rest of the throws in which these points will not turn up will be 9; the odds will therefore 3 to 1.
  • Therefore in 4 throws, if fortune be equal, an ace, deuce, or trey will turn up 3 times and only one throw will be without any of them;
  • if therefore, the player who wants an ace, deuce, or trey were to wager three ducats and the other player one, then the former would win three times and would gain three ducats; and the other once and would win three ducats;
  • therefore in the circuit [= observation] of 4 throws they would always be equal. (p. 16; my epmhasis; my itemization)
His language use here suggests that, on some level, he believes that the expected values must somehow be realized — they are what the game should pay off if it weren't for all these kinks and imperfections in the universe.

"The Length of Time … shows Forth All Forms"

Apparently, he is completely serious about this. In a later chapter on skill (ch. 27), he lists two "methods" by which fortunes can change, the second being the more occult one:
But of the other method there is also some secret principle. To these matters belong amulets, witchcraft, and the like, and just as in each case (as they say) the sword fits its own sheath and the foot its own show, so the hour, the day, the year, and the place must fit; so also in this question, what will make one man happy will make another wretched. (p. 44)
Austrian 16th century woodcut of two soldiers playing dice.

Cardano does seem to think, however, that time tends to cancel out luck. In the chapter on sequential successes, he computes the probability of observing an unbroken string of 20 repetitions of a probability 1/2 event, wrongly getting the answer to be 1/8000. He comments:
Yet it can scarcely be believed that in 7,999 throws a player can cast an even number twenty times in a row. But we have demonstrated this; because, although in 8,000 performed only once, this reasoning can deceive us, still in an infinite number of throws, it is almost necessary for it to happen; for the magnitude of the circuit is the length of time which shows forth all forms. (pp. 19–20).
Here, "luck" almost seems to be synonymous with "noise."

Squares and Cubes

Since I'm talking about this error anyway, and since this is essentially the sole remaining mathematical component of the book, let me just quickly summarize what Cardano seems to be doing in the rest of Chapters 14 and 15.

At the end of Chapter 14, he claims that if the odds in favor of a single success is p : 1p, then odds in favor of k successes in a row are pk : (1p)k. This is not correct, since (1 – p)k is not in general equal to 1 – pk. He recognizes this in the opening of Chapter 15, noting that if it were really true, any run of consecutive probability 1/2 events would also have probability 1/2:
But this reasoning seems to be false, even in the case of equality, as, for example, the chance of getting one of any three chosen faces in one cast of one die is equal to the chance of getting one of the other three, but according to this reasoning there would be an even chance of getting a chosen face each time in two casts, and thus in three, and four, which is most absurd. For if a player with two dice can with equal chances throw an even and an odd number, it does not follow that he can with equal fortune throw an even number in each of three successive casts. (p. 19)
 Cardano 1   Cardano 2   Correct 
1 : 1 1 : 1 1 : 1
1 : 1 3 : 1 3 : 1
1 : 1 8 : 1 7 : 1
1 : 1 15 : 1 15 : 1
1 : 1 24 : 1 31 : 1
1 : 1 35 : 1 63 : 1
1 : 1 48 : 1 123 : 1
So now Cardano owes us a different argument. He therefore goes on to claim that the correct answer for p = 1/2 in fact is k2 – 1 : 1. This coincides with the correct answer for a couple of small values, but then diverges exponentially from it. This leads him to make the "infinity" remark quoted above.

Parenthetically, I'm not sure why he would cube rather than square the number 20 in that example. Perhaps Ore has something intelligent to say about this.

Late 15th century book illustration showing a dice game.

How To Gamble If You Must

In Chapter 20, Cardano tells a little autobiographical anecdote as an illustration of his points about fortune, luck, etc. This story is not strictly relevant to my topic here, but it is simply to bizarre not to quote. Hence, Ladies and Gentlemen, Cardano without filter:
Yet I have decided to submit to the judgment of my readers what happened to me in the year 1526 in the company of Thomas Lezius, the patrician of Venice, leaving it to each reader to form his own opinion. I had just duly resigned from the office of rector of the scholars in the University of Padua on the third of August, and now I was journeying with Hieronymus Rivola, a scholar from Bergamo, on a certain night of the same month toward Venice. We were playing a game (called Bassette) and I won all the money he had. Then he asked me to play with him on credit, if I am not mistaken up to two or three aurei, and I won again. Then, finally, he wanted to carry it on endlessly, but I refused. He promised to pay what he owed me within three ways; he did not come.
Then he chanced to meet me and said that he would come to pay the money on Saturday (which was the day of the Nativity of the Virgin) and promised to take me to a beautiful prostitute. At that time I was just completing my twenty-fifth year, but I was impotent. Nevertheless, I accepted the condition; there was not a word about the game. He came on the day agreed; and in that year the festival of the Blessed Virgin was on Saturday. He took me to the home of Thomas Lezius; there was no Thais there, but a bearded man with a young servant. No money was paid but we played with marked cards. I lost to him all the money which he owed me, and he reckoned it as part of his debts just as though he had given it to me. I list about twenty-five aurei or even a few more which I had, and played on, giving my clothes and my rings as security.
I returned home in sadness (as was natural), especially since there was no hope of getting money from home because uprisings and plots were raging at Milan. And so (and now I tell the truth, there being no reason why I should lie), I contrived for myself a certain art; I do not now remember what it was, since thirty-eight years have passed, but I think it took its rise in geomancy, by which I kept in mind on up to twenty-four plays all the numbers whereby I should win and all those whereby I should lose; by chance the former were far more numerous than the latter, even in the proportion (if I am not mistaken) of seven to one; and I do not recall now in what order these were against me.
But when I saw that I could not safely hold more numbers in my memory, I admonished my young servant, whose name was Jacob, that when he saw I had won back my clothes and my money he was to call me. I threatened that if he did not do it I would beat him severely. He promised and we went. As the game went on I won and lost in all the plays just as I had foreseen and after the third play I realized that there was no trickery or deceit about it. They laid down money freely and I accepted the wagers, but he was delighted be the example of the previous day and also on account of the marked cards (as I have said).
Thus his thoughts were inflamed by his youthful ardor; but the result was otherwise, for, on those plays in which I saw (as it were, without any real foreknowledge) that I would win, I did not rehect any amount of money and made large bets of my own, and in the other cases, where I knew he would win, I refused if he was the first to wager, and wagered very meagerly myself: thus the result was that within twenty plays I regained my clothes, my rings, and money and also what he had added besides. As for the clothes, the rings, and a collar for the boy, I sent them home piecemeal. Out of the total number there remained four deals; I played and won, and also came out victor in a few deals which were not contained in the number.
He was already perturbed and full of admiration, since he saw that in all the plays in which we played for high stakes I came out the victor, and in those in which he won I myself wagered little and when he wished to wager a great deal I refused. So (he said) I believe some demon is advising you, or that you know by some enchantment what is going to happen. What happened after that I remember that I have narrated elsewhere. (pp. 32–34)
He also gives a bit of extra detail in Chapter 17, the chapter on fraud in card games:
There are also some who smear the cards with soap so that they may slide easily and slip past one another. This was the trick practiced upon me by the well-known Thomas Lezius of Venice, patrician, when in my youth I was addicted to gambling. (p. 27)
Extraordinary, isn't it?

Wednesday, August 20, 2014

Cacciari: "Crossing the Sense in Metaphorical Language" (2008)

Cristina Cacciari's contribution to the Cambridge Handbook of Metaphor and Thought contains two relatively independent parts: First, a discussion of various attempts to describe and explain the phenomenon of cross-modal metaphor in language; and second, a fascinating summary of some of the things we know about (clinical) synaesthesia as of now.

She also provides some useful pointers to pre-1980 literature on metaphor and thought:
Something that might also be worth looking into is the anthology Cognition and Figurative Language (1980).

Sunday, August 10, 2014

Zeder: "The Origins of Agriculture in the Near East" (2011)

Traditionally, archaeologists have taken the hallmark of domestication to be genetic changes: A plant or animal counted as domesticated when it differs from the wild land species it was cultivated from.

In this interesting review paper, Melinda A. Zeder of the Smithsonian Museum mentions a number of reasons that the genetic criterion might mislead us into adopting dating domestication unduly late:
  • Ancient threshing methods might have involved beating cereal straws into a basket to get the grains out, a practice that would disproportionately select grains from weak-stemmed straws. This would counteract the effect of crop selection, where the plants are usually selected to have stronger stems. (p. S226)
  • When harvests failed, as must have done once in a while, the crop may have been restocked from the wild land species, wiping out the genetic record of domestication. (p. S226)
  • Occasionally, the morphological changes we are looking for are in fact quite modern phenomena. Zeder mentions that the large size of modern broad beans, a trait that did not appear before well into the middle ages. (p. S225)
In addition to these rather technical reasons, a more important issue is that a too rigid criterion can lead us to ignore or misunderstand practices somewhere on the continuum between opportunistic gathering and habitual farming. Such hybrid relationships between land and people, Zadeh writes,
… not only makes it impossible to identify any threshold moments when wild became domestic or hunting and gathering became agriculture but also shows that drawing such distinctions actually impedes rather than improves our understanding of this process. Instead of continuing to try to pigeonhole these concepts into tidy definitional categories, a more productive approach would be to embrace the ambiguity of this middle ground and continue to develop tools that allow us to watch unfolding developments within this neither-nor territory. (p. S231)
I second that.

Monday, July 28, 2014

Baker, Saxe, and Tenebaum: "Bayesian Theory of Mind" (2011)

I was referred to this paper by one of the reviewers of my own paper on multi-agent statistics, since both seemed to be about reasoning about other people's beliefs. Their paper is concerned with inferring one other person's belief-desire state from observed behavior, not with reasoning of arbitrarily high order (like I know that you know that I know…). This means that there are some issues in general multi-agent statistics that they don't have to worry about.

At any rate, the set-up they consider is the following:
  • The subject observes a little stylized scene comparable to a simple video game interface.
  • This scene features a little cartoonish character (a circle with two eyes).
  • Depending on where the character is standing, different parts of the scene will be blocked from view.
  • The scene contains two parking lots in which food trucks can park.
  • There are three different kinds of food truck that may or may not be present in the scene.
  • In each condition, the subject sees the little character move around in a certain predetermined way.
After watching this scene, the subjects were then asked to provide assessments of the little character's beliefs and food preferences. For instance, if the character could initially see a truck with Korean food, but still walked up to check the other parking lot, this must be because the person was wondering whether a more preferred kind of food was available.

The character sees the Korean truck, yet walks up and checks what else is available.

In the model, these videos were discretized and treated as a hidden Markov model with the desires (food preference ordering) and beliefs (about which trucks there might be in the parking lots) as hidden states.

At each time step, previous actions inform new states of the world.

Since there were actually a quite small space of possible routes and possible plans, the model could in fact have been simplified immensely in this case, although at the expense of generalizability.

Saturday, July 26, 2014

Wald: Sequential Analysis (1947)

I've always thought that Shannon's insights in the 1940s papers seemed like pure magic, but this book suggests that at parts of his way of thinking were already in the air at the time: From his own frequentist perspective, Wald comes oddly close to defining a version of information theory.

The central question of the book is:
How can we devise decision procedures that map observations into {Accept, Reject, Continue} in such a way that (1) the probability of wrongly choosing Accept or Reject is low, and (2) the expected number of Continue decisions is low?
Throughout the book, the answer that he proposes is to use likelihood ratio tests. This puts him strangely close to the Bayesian tradition, including for instance Chapter 5 of Jeffreys' Theory of Probability.

A sequential binomial test ending in rejection (p. 94)

In particular, the coin flipping example that Jeffreys considers in his Chapter 5.1 is very close in spirit to the sequential binomial test that Wald considers in Chapter 5 of Sequential Analysis. However, some differences are:
  • Wald compares two given parameter values p0 and p1, while Jeffreys compares a model with a free parameter to one with a fixed value for that parameter.
  • Jeffreys assigns prior probabilities to everything; but Wald only uses the likelihoods given the two parameters. From his perspective, the statistical test will thus have different characteristics depending on what the underlying situation is, and he performs no averaging over these values.
This last point also means that Wald is barred from actually inventing the notion of mutual information, although he comes very close. Since he cannot take a single average over the log-likelihood ratio, he cannot compute any single statistic, but always has to bet on two horses simultanously.

Thursday, July 24, 2014

Wolpert: "The Lack of A Priori Distinctions Between Learning Algorithms" (1996)

I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function f: X → Y when evaluated on input values that did not occur in our training data. We do so by selecting an estimate hX → Y, hoping that h(x) will agree with f(x) on the points x we haven't yet seen.

If there no restrictions on how f can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever h we choose, Wolpert can always choose f so as to disagree with every single choice that our h makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length m, and your goal is to predict the next n binary digits in the sequence. If all of the 2n + m possible sequences are in your hypothesis space, then the mutual information between the test and training set is
I(train; test)  =  H(test)  –  H(test | train)  =  2n  –  2n + m/2m  =  0.
In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:
  • A loss measure L is "homogenous" if our choice of the estimate h is independent of the sum of the loss probabilities given the correct answers, that is,
    Σy Pr{L = c | y}.
  • For finite |X| and |Y|, we can define a learning problem as uniform if all of the |Y||X| possible functions from X to Y are equally probable.
  • When we use a homogenous loss function in a uniform learning problem, then
    Σy Pr{L = c | y}  =  1/|Y| Σy Pr{L = c | y} Pr{y}  =  1/|Y| Pr{L = c}
    is independent of h, and therefore E[L | h] = E[L].
Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for f(x), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function L(f(x), h(x)) = |f(x) – h(x)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that L = 4, for instance, depends heavily on your estimator h, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He thus states that if we are using a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:
Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.) 
However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)
But Wolpert is confusing two criteria of success here. There is a difference between identifying a long-run frequency — which is the goal in both Bernoulli's theorem and the VC theorem — and predicting the exact outcomes of a series of experiments.

The various laws of large numbers give worst-case guarantees against misidentifying a random process, but they don't say that you'll eventually be able to predict it. Even if you know with certainty that some process is a series of independent coin flips, you have no chance of guessing the next outcome above chance level.

This is not because low empirical error on the frequency estimation mischaracterizes your error on the test data. Your empirical error is just going to be uniformly high when the input sequence is a series of coin flips (assuming that all your hypotheses model the data-generating process as i.d.d.). So unless you introduce unwarranted dependence assumptions, the data will quickly reveal that there is no structure to be discovered, and thus that no hypothesis will ever do well on exact prediction.

Tuesday, July 22, 2014

Jaffe and Quinn: "Theoretical Mathematics" (1993)

Shaking the moralistic finger at sloppy mathematics, this paper suggests that the field should be subjected to certain "prescriptions" in order to avoid errors, priority disputes, and confusions about the reliability of proofs.

As a means to this end, Jaffe and Quinn suggest that we understand the activity of writing mathematical proofs according to the analogy
theoretical physics : experimental physics = speculative math : "rigorous" math
This leads them to adopt the odd and misleading term "theoretical mathematics" for what is essentially mathematics without detailed proofs. They suggest that some time in the future, "theoretical mathematics" might branch off from rigorous mathematics and become a discipline of its own, like theoretical physics.

This seems to misconstrue the motivations people have for writing heuristic or intuitive proofs in mathematics.

Most professional mathematicians prefer to give detailed and formal proofs whenever they can, and they only deviate from that norm when they absolutely have to. For a theoretical physicist, on the other hand, there is no shame in having no experimental expertise. The analogy thus seems quite misleading, and trying to drive a wedge in between the practice of conjecturing and the practice of proving mathematical claims consequently misconstrues the ethos of mathematics.

Another thing which is never dwelled upon in the paper (but mentioned in Jeremy Gray's response to the paper) is that even "rigorous" proofs can contain mistakes.

This can be the case even for computer-verified proofs, since there is a very realistic chance of bugs, typos, or even misinterpretations of the code (e.g., subtle differences between an axiom system and the geometric domain it is intended to formalize).

When Jaffe and Quinn assume that we can recognize a "rigorous" proof when we see it, and that rigorous proofs are always correct, they are thus thinking in terms of about a perfect-world mathematics, not actual-world mathematics. As a consequence, they imagine that the whole enterprise of mathematics could be "corrected" into a steadily advancing literature of absolutely certain theorems piled on top of each other.

Of course, everybody wants to give the best arguments they can, and honesty about the quality of the argument would be laudable. But in a world where quality is gradable, it would seem that any step taken toward a realization of Jaffe and Quinn's dream would kill off the whole range of mutant forms in between complete guesswork and completely "rigorous" math (however we measure rigor). Since some of these mutants would have gone on to become the standard of future mathematics, this would almost certainly be a scientific loss, in addition to being very boring.