Learning on Graphs in the Game of Go. Graepel, Thore; Goutrié, Mike; Krüger, Marco; Herbrich, Ralf (2001). 347–352.
We consider the game of Go from the point of view of machine learning and as a well-defined domain for learning on graph representations. We discuss the representation of both board positions and candidate moves and introduce the common fate graph (CFG) as an adequate representation of board positions for learning. Single candidate moves are represented as feature vectors with features given by subgraphs relative to the given move in the CFG. Using this representation we train a support vector machine (SVM) and a kernel perceptron to discriminate good moves from bad moves on a collection of life-and-death problems and on 9x9 game records. We thus obtain kernel machines that solve Go problems and play 9x9 Go.
Support Vector Regression for Black-Box System Identification. Gretton, Arthur; Doucet, Arnaud; Herbrich, Ralf; Rayner, Peter; Schölkopf, Bernhard (2001). 341–344.
In this paper, we demonstrate the use of support vector regression (SVR) techniques for black-box system identification. There methods derive from statistical learning theory, and are of great theoretical and practical interest. We briefly describe the theory underpinning SVR, and compare support vector methods with other approaches using radial basis networks. Finally, we apply SVR to modeling the behaviour of a hydralic robot arm, and show that SVR improves on previously published results.
Bayes Point Machines. Herbrich, Ralf; Graepel, Thore; Campbell, Colin in Journal of Machine Learning Research (2001). 1 245–279.
Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that --- for the zero training error case --- Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.
Algorithmic Luckiness. Herbrich, Ralf; Williamson, Robert C (2001). 391–397.
In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [Taylor et al., 1998] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space.
A Generalized Representer Theorem. Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alexander (2001). 416–426.
Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.