*Learning Kernel Classifiers: Theory and Algorithms* Herbrich, Ralf (2002). (2nd edition ) The MIT Press.

A PAC-Bayesian Margin Bound for Linear Classifiers. Herbrich, Ralf; Graepel, Thore in *IEEE Transactions on Information Theory* (2002). **48**(12) 3140–3150.

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training sample. The result is obtained in a PAC-Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound, which was developed in the luckiness framework, and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and â€'' for maximum margins â€'' to a vanishing complexity term. In contrast to previous results, however, the new bound does depend on the dimensionality of feature space. The analysis shows that the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the fraction of hypothesis space consistent with the training sample. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal with respect to the new bound only if the feature vectors in the training sample are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only. Numerical simulations support this recommendation and demonstrate that the new error bound can be used for the purpose of model selection.

Learning and Generalization: Theoretical Bounds. Herbrich, Ralf; Williamson, Robert C in *Handbook of Brain Theory and Neural Networks* (2002). (2nd edition ) 619–623.

Algorithmic Luckiness. Herbrich, Ralf; Williamson, Robert C in *Journal of Machine Learning Research* (2002). **3** 175–212.

Classical statistical learning theory studies the generalisation performance of machine learning algorithms rather indirectly. One of the main detours is that algorithms are studied in terms of the hypothesis class that they draw their hypotheses from. In this paper, motivated by the luckiness framework of Shawe-Taylor et al. (1998), we study learning algorithms more directly and in a way that allows us to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits the margin, the sparsity of the resultant weight vector, and the degree of clustering of the training data in feature space.

Average Precision and the Problem of Generalisation. Hill, Simon; Zaragoza, Hugo; Herbrich, Ralf; Rayner, Peter (2002).

In this paper we study the problem of generalisation in information retrieval. In particular we study precision-recall curves and the average precision value. We provide two types of bounds: large-deviation bounds of the average precision and maximum deviation bounds with respect to a given point of the precision recall curve. The first type of bounds are useful to answer the question: how far can true average precision be from the value observed on a test collection? The second is useful for obtaining bounds on average precision when tight bounds on a particular point of the curve can be established, as is the case when training SVMs or Perceptrons for document categorisation.

Fast Sparse Gaussian Process Methods: The Informative Vector Machine. Lawrence, Neil; Seeger, Matthias; Herbrich, Ralf (2002). 609–616.

We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretical principles, previously suggested for active learning. In contrast to most previous work on sparse GPs, our goal is not only to learn sparse predictors (which can be evaluated in \($O(d)$\) rather than \($O(n)$\), \($d \ll n$\), \($n$\) the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most \($O(n \cdot d^2)$\), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet it requires only a fraction of the training time. In contrast to the SVM, our approximation produces estimates of predictive probabilities ("error bars"), allows for Bayesian model selection and is less complex in implementation.

The Perceptron Algorithm with Uneven Margins. Li, Yaoyong; Zaragoza, Hugo; Herbrich, Ralf; Shawe-Taylor, John; Kandola, Jasvinder (2002). 379–386.

The perceptron algorithm with margins is a simple, fast and effective learning algorithm for linear classifiers; it produces decision hyperplanes within some constant ratio of the maximal margin. In this paper we study this algorithm and a new variant: the perceptron algorithm with uneven margins, tailored for document categorisation problems (i.e. problems where classes are highly unbalanced and performance depends on the ranking of patterns). We discuss the interest of these algorithms from a theoretical point of view, provide a generalisation of Novikoff's theorem for uneven margins, give a geometrically description of these algorithms and show experimentally that both algorithms yield equal or better performances than support vector machines, while reducing training time and sparsity, in classification (USPS) and document categorisation (Reuters) problems.

Microsoft Cambridge at TREC 2002: Filtering Track. Robertson, Stephen E.; Walker, Stephen; Zaragoza, Hugo; Herbrich, Ralf (2002). 361–368.

Six runs were submitted for the Adaptive Filtering track, four on the adaptive filtering task (ok11af??), and two on the routing task (msPUM?). The adaptive filtering system has been somewhat modified from the one used for TREC-10, largely for efficiently and flexibility reasons; the basic filtering algorithms remain similar to those used in recent TRECs. For the routing task, a completely new system based on perceptrons with uneven margins was used.