##

## Bibliographical Remarks

In this chapter we presented several frameworks for studying the
generalization error of specific learning algorithms. We
started
with a framework seemingly combining the best of two worlds: By studying
Bayesian algorithms we have the power of incorporating prior knowledge into
the learning task via an explicit prior **P**_{H }while we can
still give PAC guarantees for the generalization error of Bayesian
classification strategies. This framework, also known as the PAC-Bayesian
framework, was introduced for the first time in
Shawe-Taylor and Williamson (1997, p.4) where the authors cast a
Bayesian algorithm in the luckiness framework. Remarkably, they concede that
"... a Bayesian might say that luckiness is just a complicated way of
encoding a prior. The sole justification for our particular way of encoding
is that it allows us to get the PAC like results we sought...''. In contrast
to their results—which hold for single
classifiers drawn according to the posterior measure—McAllester
(1998) considered classification *strategies* which allowed
him to tighten the results and ease their proofs. Theorems 5.1, 5.2 and 5.6
can be found in this paper; the more general result given in equation (5.9)
together with some remarks on how to generalize the framework to arbitrary
loss functions can be found in
McAllester (1999).
The simple relationship between the expected risk of the Gibbs and the Bayes
classification strategies (Theorem 5.7) is taken from
Herbrich
and Graepel (2001b). The full power of the bound for the Bayesian
classifier can be exploited by making use of the fact that for "benign''
hypothesis spaces the expected risk of one classifier can be expressed as
the generalization error of a subset of classifiers. This analysis, together
with the final PAC-Bayesian margin bound (Theorem 5.10) can be found in
Herbrich
and Graepel (2001b). Recently, it has been shown that not only the
evidence can be justified in a distribution free framework, but also the
estimated posterior probability **P**_{H|Z}^{m}_{=z
}(H(*x*) =* y*) leads to a decrease in expected risk
when used as a rejection criterion (see
Freund et al. (2000)). In contrast to the bounds in the PAC-Bayesian
framework, this paper studies only the generalization error of their
(pseudo)-Bayesian prediction method which results in remarkably tight
bounds. A work preceding
Shawe-Taylor and Williamson (1997) is by
Haussler et al. (1994) where it was assumed that **P**_{H} is
known to the learning algorithm and corresponds to the probability of target
concepts. Rather than studying the performance of Bayesian classification
strategies for a fixed, but unknown, data distribution **P**_{Z }
it was assumed that the prior belief **P**_{H }is used to govern
**P**_{Y|X=x}. It was shown that the *average *
generalization error of classification strategies over **P**_{H}
can be arbitrarily bad without assuming that the learning algorithm uses the
same **P**_{H}. It should be noted, however, that this quantity
does not satisfy the PAC desiderata of not knowing the data distribution.

In the following
section we introduced the notion of compression schemes. One of the
earliest works in that area is by
Littlestone and Warmuth (1986) which was summarized and extended to
on-line learning algorithms in
Floyd and
Warmuth (1995). Theorem 5.17 is taken from this paper; the lossy
compression scheme bound (Theorem 5.18) was proven in
Graepel et al. (2000); see also
Marchand and Shawe-Taylor (2001) for a result that avoids the exponent
two at the deviation e. Interestingly, all these
results can be extended further by allowing the learning algorithm to save
an additional *b *bits which would only incur an additional summand of*
b/m *in the resulting generalization error bound. An interesting
combination of large margins of the linear classifier learned by an
algorithm and sparsity w.r.t. the expansion coefficients is presented in
Herbrich et al. (2000). The
subsection on the combination of compression schemes and mistake bounds
for mistake-driven on-line learning algorithms is taken from
Floyd and
Warmuth (1995). Example 5.24 is discussed in greater length in
Graepel et al. (2001). The notion of on-line learning algorithms is due
to Littlestone (1988).
This paper also introduced the halving algorithm together with its mistake
bound (see Example 5.25). An interesting question emerging from the analysis
in the compression framework is the following: Given a learning algorithm
which maps into a hypothesis space of VC dimension
_{H},
is it always possible to find a compression scheme of size not more than u_{H }that will be consistent,
provided some target concept from *H *is used to label the data? This
is still an open problem; for first attempts to solve it the interested
reader is referred to
Floyd and
Warmuth (1995).