I think this might be one of the most oversold math papers I've ever read. The core idea of the paper is a tiny little nugget of common sense buried under a mountain of notation. This not only makes the paper hard to read, but also camouflages the heavy-handed assumptions that are necessary to make the argument work.

###
No Lunch At All

The learning scenario that Wolpert studies in the paper is one in which we want to guess the values of a function

*f*:

*X* →

*Y* when evaluated on input values that did not occur in our training data. We do so by selecting an estimate

*h*:

*X* →

*Y*, hoping that

*f*(

*x*) will agree with

*f*(

*x*) on the points

*x* we haven't yet seen.

If there no restrictions on how

*f* can be chosen, and all of the exponentially many underlying functions are possible, then no such generalization is possible. Whatever

*h* we choose, Wolpert can always choose

*f* so as to disagree with every single choice that our

*h* makes outside the training set, or (if he should so desire) so as to agree with all of them.

In other words, if everything is possible, then there is no statistical dependency between the past and the future. Obviously, this means that learning is impossible.

###
Extremely Large and Incredibly Bold

Let me just elaborate this last reformation a bit more:

Suppose your data consists of a binary sequence of length

*m*, and your goal is to predict the next

*n* binary digits in the sequence. If all of the 2

^{n + m} possible sequences are in your hypothesis space, then the mutual information between the test and training set is

*I*(train; test) = *H*(test) – *H*(test | train) = 2^{n} – 2^{n + m}/2^{m} = 0.

In such a case, you might as well guess randomly. However, this only follows because the set of possible sequences has an entropy rate of 1, and because you only care about exactly correct guesses (otherwise the sequence 1/2, 1/2, 1/2, … might outperform random 0/1 guessing).

###
If It's Independent, It's Independent

To say this in a more Wolpert way, let's make the following definitions:

- A loss measure
*L* is "**homogenous**" if our choice of the estimate *h* is independent of the sum of the loss probabilities given the correct answers, that is,
Σ_{y} Pr{*L* = *c* | *y*}.

- For finite |
*X*| and |*Y*|, we can define a learning problem as **uniform** if all of the |*Y*|^{|X|} possible functions from *X* to *Y* are equally probable.
- When we use a homogenous loss function in a uniform learning problem, then
Σ_{y} Pr{*L* = *c* | *y*} = 1/|*Y*| Σ_{y} Pr{*L* = *c* | *y*} Pr{y} = 1/|*Y*| Pr{*L* = *c*}

is independent of *h*, and therefore *E*[*L* | *h*] = *E*[*L*].

Combining homogeneity and uniformity is thus the same as assuming that the loss is independent of the learning algorithm.

###
The Don't-Get-Cute Condition

The zero-one loss function (which gives you one point for each correct answer) is "homogenous" in this sense, but it is pretty much the only one. For most other loss functions, it will often make sense to use some central value as your guess for *f*(*x*), since extreme guesses are panelized more heavily under a uniform selection scheme.

As an example, consider the loss function *L*(*f*(*x*), *h*(*x*)) = |*f*(*x*) – *h*(*x*)|, and suppose we are trying to guess an unknown number in the set {1, 2, 3, 4, 5}. Then the probability that *L* = 4, for instance, depends heavily on your estimator *h*, since an estimate that never returns the values 1 or 5 can never achieve this loss. Numerical deviation is therefore not "homogenous," and neither is ordinary squared deviations.

###
The Maximally Expensive Lunch

Naturally, Wolpert wants to discuss the relation between his own paper and the uniform law of large numbers, proven by Vapnik and Chervonenkis in the late '60s.

He therefore states that for a homogenous loss function in a uniform learning problem, the empirical loss on the training data is statistically independent of the loss on the test data. This should be obvious, since he is effectively assuming that the loss has a fixed distribution independent of the learning method.

Unfortunately, he then goes on to state that this has "strong implications for the uniform convergence formalism for studying supervised learning" (p. 1369), although he is in fact just giving a finite-sized example of just that theory. He elaborates:

Assume that our learning algorithm has a very low VC dimension. Since [the empirical error] is low and [the sample size] is large, we might hope that our generalization error will be low, independent of any assumptions concerning the target. (This is one common way people try to interpret the VC theorems.)

However according to the results presented above, low [empirical error], large [sample size], and low VC dimension, by themselves, provide no such assurances concerning the [off-the-training-set] error (unless one can somehow a priori rule out a uniform [prior over the possible generlizations]—not to mention rule out any other prior having even more dire implications for generalization performance). (p. 1369)

What he fails to realize is that he has explicitly made sure that the underlying process producing the data has ceiling capacity — so no learning rule with a low VC dimension could therefore learn to predict the test data, on average, above chance level.

This is not because low empirical error doesn't carry over from training to test data, but rather nobody can achieve a low expected empirical error in his learning framework. What he has verified is thus a specific example of the VC theorem: If the capacity of your theory is low, and it has a high error on the training data, then anything might happen.