References
Home
Table of Contents
Software Resources
About the Author
Ordering Information
Errata

Series Foreword / Preface / Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / References

 
bullet Aizerman, M. A., E. M. Braverman, and L. I. Rozonoer (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, pp. 821837.
bullet Allwein, E. L., R. E. Schapire, and Y. Singer (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. In P. Langley (Ed.), Proceedings of the International Conference on Machine Learning, San Francisco, California, pp. 916. Morgan Kaufmann Publishers.
bullet Alon, N., S. Ben-David, N. Cesa-Bianchi, and D. Haussler (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM 44(4), pp. 615631.
bulletAlon, N., J. H. Spencer, and P. Erdös (1991). The Probabilsitic Method. John Wiley and Sons.
bulletAmari, S. (1985). Differential-Geometrical Methods in Statistics. Berlin. Springer.
bulletAnlauf, J. K. and M. Biehl (1989). The AdaTron: An adaptive perceptron algorithm. Europhysics Letters 10, pp. 687692.
bulletAnthony, M. (1997). Probabilistic analysis of learning in artificial neural networks: The PAC model and its variants. Neural Computing Surveys 1, pp. 147.
bulletAnthony, M. and P. Bartlett (1999). A Theory of Learning in Artificial Neural Networks. Cambridge University Press.
bulletBaldi, P. and S. Brunak (1998). Bioinformatics: The Machine Learning Approach. MIT Press.
bulletBarber, D. and C. K. I. Williams (1997). Gaussian processes for Bayesian classification via Hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing Systems 9, pp. 340346. MIT Press.
bulletBarner, M. and F. Flohr (1989). Analysis. deGryter.
bulletBartlett, P., P. Long, and R. C. Williamson (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences 52(3), pp. 434452.
bulletBartlett, P. and J. Shawe-Taylor (1998). Generalization performance of support vector machines and other pattern classifiers. In Advances in Kernel MethodsSupport Vector Learning, pp. 4354. MIT Press.
bulletBartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory 44(2), pp. 525536.
bulletBartlett, P. L. and J. Shawe-Taylor (1999). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel MethodsSupport Vector Learning, Cambridge, MA, pp. 4354. MIT Press.
bulletBaudat, G. and F. Anouar (2000). Generalized discriminant analysis using a kernel approach. Neural Computation 12, pp. 23852404.
bulletBayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philiosophical Transactions of the Royal Socienty 53, pp. 370418.
bulletBellman, R. E. (1961). Adaptive Control Processes. Princeton, NJ: Princeton University Press.
bulletBennett, G. (1962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association 57, pp. 3345.
bulletBennett, K. (1998). Combining support vector and mathematical programming methods for classification. In Advances in Kernel MethodsSupport Vector Learning, pp. 307326. MIT Press.
bulletBerger, J. (1985). The frequentist viewpoint and conditioning. In Proccedings of the Berkley Symposium, pp. 1544.
bulletBernardo, J. and A. Smith (1994). Bayesian Theory. Chichester: John Wiley and Sons.
bulletBernstein, S. (1946). The Theory of Probabilities. Moscow: Gastehizdat Publishing House.
bulletBiehl, M. and M. Opper (1995). Perceptron learning: The largest version space. In Proceedings of Workshop: Theory of Neural Networks: The Statistical Mechanics Perspective.
bulletBillingsley, P. (1968). Convergence of Probability Measures. John Wiley and Sons.
bulletBishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Clarendon Press.
bulletBishop, C. M. and M. E. Tipping (2000). Variational relevance vector machines. In Proceedings of 16th Conference on Uncertainty in Artificial Intelligence UAI'2000, pp. 46—53.
bulletBlock, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern Physics 34, pp. 123—135. Reprinted in Neurocomputing by Anderson and Rosenfeld.
bullet Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth (1989). Learnability and the Vapnik-Chervonenkis Dimension. Journal of the ACM 36(4), pp. 929—965.
bulletBois, G. P. (1961). Tables of Indefinite Integrals. Dover Publications.
bullet Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992, July). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the Annual Conference on Computational Learning Theory, Pittsburgh, PA, pp. 144—152. ACM Press.
bulletBousquet, O. and A. Elisseeff (2000). Stability and generalization. Technical report, Centre de Mathematiques Appliquees.
bulletBousquet, O. and A. Elisseeff (2001). Algorithmic stability and generalization performance. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, pp. 196—202. MIT Press.
bulletBox, G. E. P. and G. C. Tiao (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley.
bullet Brown, M. P. S., W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, and D. Haussler (2000). Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Sciences 97(1), pp. 262—267.
bulletBrownie, C. and J. Kiefer (1977). The ideas of conditional confidence in the simplest setting. Communications in Statistics—Theory and Methods 6(8), pp. 691—751.
bulletBurges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), pp. 121—167.
bulletCantelli, F. (1933). Sulla probabilita come limita della frequenza. Rend. Accad. Lincei 26(1), p. 39.
bulletCarl, B. and I. Stephani (1990). Entropy, compactness, and the approximation of operators. Cambridge, UK: Cambridge University Press.
bulletCasella, G. (1988). Conditionally acceptable frequentist solutions. In Statistical Decision Theory, Volume 1, pp. 73—84.
bulletCauchy, A. (1821). Cours d'analyse de l'Ecole Royale Polytechnique: Analyse algebrique. Paris: Debure freres.
bulletChernoff, H. (1952). A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics 23, pp. 493—507.
bulletCortes, C. (1995). Prediction of Generalization Ability in Learning Machines. Ph.D. thesis, Department of Computer Science, University of Rochester.
bulletCortes, C. and V. Vapnik (1995). Support vector networks. Machine Learning 20, 273—297.
bulletCox, R. (1946). Probability, frequency, and reasonable expectations. American Journal of Physics 14, pp. 1—13.
bullet CPLEX Optimization Inc. (1994). Using the CPLEX callable library. Manual.
bulletCristianini, N. and J. Shawe-Taylor (1999). Bayesian voting schemes and large margin classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp. 55—68. MIT Press.
bulletCristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines. Cambridge, UK: Cambridge University Press.
bulletDebnath, L. and P. Mikusinski (1998). Hilbert Spaces with Applications. Academic Press.
bullet Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39(1), pp. 1—22.
bulletDevroye, L., L. Györfi, and G. Lugosi (1996). A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of mathematics. New York: Springer.
bulletDevroye, L. and G. Lugosi (2001). Combinatorial Methods in Density Estimation. Springer.
bulletDevroye, L. P. and T. J. Wagner (1979). Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory 25(5), pp. 202—207.
bulletDietrich, R., M. Opper, and H. Sompolinsky (2000). Support vectors and statistical mechanics. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 359—367. MIT Press.
bulletDuda, R. O. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York: John Wiley and Sons.
bulletDuda, R. O., P. E. Hart, and D. G. Stork (2001). Pattern Classification and Scene Analysis. New York: John Wiley and Sons. Second edition.
bulletFeller, W. (1950). An Introduction To Probability Theory and Its Application, Volume 1. New York: John Wiley and Sons.
bulletFeller, W. (1966). An Introduction To Probability Theory and Its Application, Volume 2. New York: John Wiley and Sons.
bulletFisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, pp. 179—188.
bulletFloyd, S. and M. Warmuth (1995). Sample compression, learnability, and the Vapnik Chervonenkis dimension. Machine Learning 27, pp. 1—36.
bulletFreund, Y. (1998). Self bounding learning algorithms. In Proceedings of the Annual Conference on Computational Learning Theory, Madison, Wisconsin, pp. 247—258.
bulletFreund, Y., Y. Mansour, and R. E. Schapire (2000). Analysis of a pseudo-Bayesian prediction method. In Proceedings of the Conference on Information Science and Systems.
bulletGardner, E. (1988). The space of interactions in neural networks. Journal of Physics A 21, pp. 257—270.
bulletGardner, E. and B. Derrida (1988). Optimal storage properties of neural network models. Journal of Physics A 21, pp. 271—284.
bulletGentile, C. and M. K. Warmuth (1999). Linear hinge loss and average margin. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 225—231. MIT Press.
bulletGibbs, M. and D. J. C. Mackay (1997). Efficient implementation of Gaussian processes. Technical report, Cavendish Laboratory, Cambridge, UK.
bulletGirosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation 10(6), pp. 1455—1480.
bulletGlivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilita. Giornale dell'Istituta Italiano degli Attuari 4, p. 92.
bulletGolub, G. H. and C. F. van Loan (1989). Matrix Computations. John Hopkins University Press.
bullet Graepel, T., R. Herbrich, and J. Shawe-Taylor (2000). Generalisation error bounds for sparse linear classifiers. In Proceedings of the Annual Conference on Computational Learning Theory, pp. 298—303.
bullet Graepel, T., R. Herbrich, and R. C. Williamson (2001). From margin to sparsity. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, Cambridge, MA, pp. 210—216. MIT Press.
bullet Guermeur, Y., A. Elisseeff, and H. Paugam-Moisy (2000). A new multi-class SVM based on a uniform convergence result. In Proceedings of IJCNN 2000.
bulletGurvits, L. (1997). A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In M. Li and A. Maruoka (Eds.), Proceedings of the International Conference on Algorithmic Learning Theory, LNAI-1316, Berlin, pp. 352—363. Springer.
bulletGuyon, I. and D. Storck (2000). Linear discriminant and support vector classifiers. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 147—169. MIT Press.
bulletHadamard, J. (1902). Sur les problemes aux derivees partielles et leur signification physique. Bullentin Princeton University 13, pp. 49—52.
bulletHadley, G. (1962). Linear Programming. London: Addison-Wesley.
bulletHadley, G. (1964). Nonlinear and Dynamic Programming. London: Addison-Wesley.
bulletHarville, D. A. (1997). Matrix Algebra From a Statistician's Perspective. Springer.
bulletHastie, T. and R. Tibshirani (1998). Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.), Advances in Neural Information Processing Systems 10, Cambridge, MA, pp. 507—513. MIT Press.
bulletHaussler, D. (1999). Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Department, University of California at Santa Cruz.
bulletHaussler, D., M. Kearns, and R. Schapire (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning 14, pp. 88—113.
bulletHerbrich, R. (2000). Learning Linear Classifiers—Theory and Algorithms. Ph.D. thesis, Technische Universität Berlin.
bulletHerbrich, R. and T. Graepel (2001a). Large scale Bayes point machines. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, Cambridge, MA, pp. 528—534. MIT Press.
bulletHerbrich, R. and T. Graepel (2001b). A PAC-Bayesian margin bound for linear classifiers: Why SVMs work. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, Cambridge, MA, pp. 224—230.
bulletHerbrich, R., T. Graepel, and C. Campbell (2001). Bayes point machines. Journal of Machine Learning Research 1, pp. 245—279.
bullet Herbrich, R., T. Graepel, and J. Shawe-Taylor (2000). Sparsity vs. large margins for linear classifiers. In Proceedings of the Annual Conference on Computational Learning Theory, pp. 304—308.
bulletHoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, pp. 13—30.
bulletJaakkola, T., M. Meila, and T. Jebara (2000). Maximum entropy discrimination. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 470—476. MIT Press.
bullet Jaakkola, T. S., M. Diekhans, and D. Haussler (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the International Conference on Intelligence Systems for Molecular Biology, pp. 149—158. AAAI Press.
bulletJaakkola, T. S. and D. Haussler (1999a). Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 487—493. MIT Press.
bulletJaakkola, T. S. and D. Haussler (1999b). Probabilistic kernel regression models. In Proceedings of the 1999 Conference on AI and Statistics.
bulletJaynes, E. T. (1968, September). Prior probabilities. IEEE Transactions on Systems Science and Cybernetics SSC-4(3), pp. 227—241.
bulletJebara, T. and T. Jaakkola (2000). Feature selection and dualities in maximum entropy discrimination. In Uncertainty In Artificial Intelligence.
bulletJeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Statistical Society A 186, pp. 453—461.
bulletJoachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Berlin, pp. 137—142. Springer.
bulletJoachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp. 169—184. MIT Press.
bulletJohnson, N. L., S. Kotz, and N. Balakrishnan (1994). Continuous Univariate Distributions. Volume 1 (Second Edition). John Wiley and Sons.
bulletKahane, J. P. (1968). Some Random Series of Functions. Cambridge University Press.
bulletKarchin, R. (2000). Classifying g-protein coupled receptors with support vector machines. Master's thesis, University of California.
bulletKearns, M. and D. Ron (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation 11(6), pp. 1427—1453.
bulletKearns, M. J. and R. E. Schapire (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences 48(3), pp. 464—497.
bullet Kearns, M. J., R. E. Schapire, and L. M. Sellie (1992). Toward efficient agnostic learning (extended abstract). In Proceedings of the Annual Conference on Computational Learning Theory, Pittsburgh, Pennsylvania, pp. 341—352. ACM Press.
bulletKearns, M. J. and U. V. Vazirani (1994). An Introduction to Computational Learning Theory. Cambridge, Massachusetts. MIT Press.
bullet Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (1999b). A fast iterative nearest point algorithm for support vector machine classifier design. Technical Report Technical Report TR-ISL-99-03, Indian Institute of Science, Bangalore.
bullet Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (1999a). Improvements to Platt's SMO algorithm for SVM classifier design. Technical Report CD-99-14, Dept. of Mechanical and Production Engineering, Nat. Univ. Singapore, Singapore.
bulletKiefer, J. (1977). Conditional confidence statements and confidence estimators. Journal of the American Statistical Association 72, 789—807.
bulletKimeldorf, G. S. and G. Wahba (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics 41, pp. 495—502.
bulletKivinen, J., M. K. Warmuth, and P. Auer (1997). The perceptron learning algorithm vs. winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence 97(1—2), pp. 325—343.
bulletKockelkorn, U. (2000). Lineare statistische Methoden. Oldenburg-Verlag.
bulletKolmogorov, A. (1933). Sulla determinazione empirica di una leggi di distribuzione. Giornale dell'Istituta Italiano degli Attuari 4, p. 33.
bulletKolmogorov, A. N. and S. V. Fomin (1957). Functional Analysis. Graylock Press.
bulletKolmogorov, A. N. and V. M. Tihomirov (1961). -entropy and -capacity of sets in functional spaces. American Mathematical Society Translations, Series 2 17(2), pp. 277—364.
bulletKönig, H. (1986). Eigenvalue Distribution of Compact Operators. Basel. Birkhäuser.
bulletKrauth, W. and M. Mezard (1987). Learning algorithms with optimal stability in neural networks. Journal of Physics A 20, pp. 745—752.
bulletLambert, P. F. (1969). Designing pattern categorizers with extremal paradigm information. In S. Watanabe (Ed.), Methodologies of Pattern Recognition, New York, pp. 359—391. Academic Press.
bulletLauritzen, S. L. (1981). Time series analysis in 1880, a discussion of contributions made by T. N. Thiele. ISI Review 49, pp. 319—333.
bulletLee, W. S., P. L. Bartlett, and R. C. Williamson (1998). The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory 44(5), pp. 1974—1980.
bulletLevin, R. D. and M. Tribus (1978). The maximum entropy formalism. In Proceedings of the Maximum Entropy Formalism Conference. MIT Press.
bulletLindsey, J. K. (1996). Parametric Statistical Inference. Clarendon Press.
bulletLittlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, pp. 285—318.
bulletLittlestone, N. and M. Warmuth (1986). Relating data compression and learnability. Technical report, University of California Santa Cruz.
bullet Lodhi, H., J. Shawe-Taylor, N. Cristianini, and C. Watkins (2001). Text classification using kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, Cambridge, MA, pp. 563—569. MIT Press.
bulletLunts, A. and V. Brailovsky (1969). On estimation of characters obtained in statistical procedure of recognition (in Russian). Technicheskaya Kibernetica 3.
bulletLütkepohl, H. (1996). Handbook of Matrices. Chichester: John Wiley and Sons.
bulletMacKay, D. (1994). Bayesian non-linear modelling for the energy prediction competition. ASHRAE Transcations 4, pp. 448—472.
bulletMacKay, D. J. (1999). Information theory, probability and neural networks
bulletMacKay, D. J. C. (1991). Bayesian Methods for Adaptive Models. Ph.D. thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA.
bulletMacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation 4(5), pp. 720—736.
bulletMacKay, D. J. C. (1998). Introduction to Gaussian processes. In C. M. Bishop (Ed.), Neural Networks and Machine Learning, pp. 133—165. Berlin. Springer.
bulletMagnus, J. R. and H. Neudecker (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised Edition). John Wiley and Sons.
bulletMarchand, M. and J. Shawe-Taylor (2001). Learning with the set covering machine. In Proceedings of the International Conference on Machine Learning, San Francisco, California, pp. 345—352. Morgan Kaufmann Publishers.
bulletMardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis. Academic Press.
bulletMarkov, A. A. (1912). Wahrscheinlichkeitsrechnung. Leipzig: B.G. Teubner Verlag.
bulletMatheron, G. (1963). Principles of geostatistics. Economic Geology 58, pp. 1246—1266.
bulletMcAllester, D. A. (1998). Some PAC Bayesian theorems. In Proceedings of the Annual Conference on Computational Learning Theory, Madison, Wisconsin, pp. 230—234. ACM Press.
bulletMcAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the Annual Conference on Computational Learning Theory, Santa Cruz, USA, pp. 164—170.
bulletMcDiarmid, C. (1989). On the method of bounded differences. In Survey in Combinatorics, pp. 148—188. Cambridge University Press.
bulletMercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, London A 209, pp. 415—446.
bullet Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas (Eds.), Neural Networks for Signal Processing IX, pp. 41—48. IEEE.
bulletMinka, T. (2001). Expectation Propagation for approximative Bayesian inference. Ph.D. thesis, MIT Media Labs, Cambridge, USA.
bulletMinsky, M. and S. Papert (1969). Perceptrons: An Introduction To Computational Geometry. Cambridge, MA. MIT Press.
bulletMitchell, T. M. (1977). Version spaces: a candidate elimination approach to rule learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Cambridge, Massachusetts, pp. 305—310. IJCAI.
bulletMitchell, T. M. (1982). Generalization as search. Artificial Intelligence 18(2), pp. 202—226.
bulletMitchell, T. M. (1997). Machine Learning. New York. McGraw-Hill.
bulletMurtagh, B. A. and M. A. Saunders (1993). MINOS 5.4 user's guide. Technical Report {SOL} 83.20, Stanford University.
bulletNeal, R. (1996). Bayesian Learning in Neural Networks. Springer.
bulletNeal, R. M. (1997a). Markov chain Monte Carlo method based on 'slicing' the density function. Technical report, Department of Statistics, University of Toronto. TR-9722.
bulletNeal, R. M. (1997b). Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report Technical Report 9702, Dept. of Statistics.
bulletNeal, R. M. (1998). Assessing relevance determination methods using DELVE. In Neural Networks and Machine Learning, pp. 97—129. Springer.
bulletNovikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Volume 12, pp. 615—622. Polytechnic Institute of Brooklyn.
bulletOkamoto, M. (1958). Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics 10, 29—35.
bulletOpper, M. and D. Haussler (1991). Generalization performance of Bayes optimal classification algorithms for learning a perceptron. Physical Review Letters 66, p. 2677.
bulletOpper, M. and W. Kinzel (1995). Statistical Mechanics of Generalisation, pp. 151. Springer.
bulletOpper, M., W. Kinzel, J. Kleinz, and R. Nehl (1990). On the ability of the optimal perceptron to generalize. Journal of Physics A 23, pp. 581—586.
bulletOpper, M. and O. Winther (2000). Gaussian processes for classification: Mean field algorithms. Neural Computation 12(11), pp. 2655—2684.
bulletOsuna, E., R. Freund, and F. Girosi (1997). An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson (Eds.), Neural Networks for Signal Processing VII—-Proceedings of the 1997 IEEE Workshop, New York, pp. 276—285. IEEE.
bulletPlatt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp. 185—208. MIT Press.
bullet Platt, J. C., N. Cristianini, and J. Shawe-Taylor (2000). Large margin DAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 547—553. MIT Press.
bulletPoggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics 19, pp. 201—209.
bulletPollard, D. (1984). Convergence of Stochastic Processess. New York. Springer.
bullet Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992). Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press. ISBN 0-521-43108-5.
bulletRobert, C. P. (1994). The Bayesian choice: A decision theoretic motivation. New York. Springer.
bulletRosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65(6), pp. 386—408.
bulletRosenblatt, F. (1962). Principles of neurodynamics: Perceptron and Theory of Brain Mechanisms. Washington D.C.: Spartan-Books.
bulletRoth, V. and V. Steinhage (2000). Nonlinear discriminant analysis using kernel functions. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 568—574. MIT Press.
bulletRujan, P. (1993). A fast method for calculating the perceptron with maximal stability. Journal de Physique I France 3, pp. 277—290.
bulletRujan, P. (1997). Playing billiards in version space. Neural Computation 9, pp. 99—122.
bulletRujan, P. and M. Marchand (2000). Computing the Bayes kernel classifier. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 329—347. MIT Press.
bullet Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Parallel Distributed Processing. Cambridge, MA. MIT Press.
bullet Rychetsky, M., J. Shawe-Taylor, and M. Glesner (2000). Direct Bayes point machines. In Proceedings of the International Conference on Machine Learning.
bulletSalton, G. (1968). Automatic Information Organization and Retrieval. New York. McGraw-Hill.
bulletSauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory 13, pp. 145—147.
bulletScheffe, H. (1947). A useful convergence theorem for probability distributions. Annals of Mathematical Statistics 18, pp. 434—438.
bulletSchölkopf, B., C. Burges, and V. Vapnik (1995). Extracting support data for a given task. In U. M. Fayyad and R. Uthurusamy (Eds.), Proceedings, First International Conference on Knowledge Discovery & Data Mining, Menlo Park. AAAI Press.
bullet Schölkopf, B., C. J. C. Burges, and A. J. Smola (1998). Advances in Kernel Methods. MIT Press.
bullet Schölkopf, B., R. Herbrich, and A. J. Smola (2001). A generalized representer theorem. In Proceedings of the Annual Conference on Computational Learning Theory.
bullet Schölkopf, B., J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (1999). Kernel-dependent support vector error bounds. In Ninth International Conference on Artificial Neural Networks, Conference Publications No. 470, London, pp. 103—108. IEE.
bullet Schölkopf, B., A. Smola, R. C. Williamson, and P. L. Bartlett (2000). New support vector algorithms. Neural Computation 12, pp. 1207—1245.
bullet Shawe-Taylor, J., P. L. Bartlett, R. C. Williamson, and M. Anthony (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44(5), pp. 1926—1940.
bulletShawe-Taylor, J. and N. Cristianini (1998). Robust bounds on generalization from the margin distribution. NeuroCOLT Technical Report NC-TR-1998-029, ESPRIT NeuroCOLT2 Working Group.
bulletShawe-Taylor, J. and N. Cristianini (2000). Margin distribution and soft margin. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 349—358. MIT Press.
bulletShawe-Taylor, J. and R. C. Williamson (1997). A PAC analysis of a Bayesian estimator. Technical report, Royal Holloway, University of London. NC2-TR-1997-013.
bulletShawe-Taylor, J. and R. C. Williamson (1999). Generalization performance of classifiers in terms of observed covering numbers. In P. Fischer and H. U. Simon (Eds.), Proceedings of the European Conference on Computational Learning Theory, Volume 1572 of LNAI, Berlin, pp. 285—300. Springer.
bulletShelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics 41, pp. 247—261.
bullet Shevade, S. K., S. S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy (1999). Improvements to SMO algorithm for SVM regression. Technical Report CD-99-16, Dept. of Mechanical and Production Engineering, Nat. Univ. Singapore, Singapore.
bulletSmola, A. and B. Schölkopf (1998). From regularization operators to support vector kernels. In M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.), Advances in Neural Information Processing Systems 10, Cambridge, MA, pp. 343—349. MIT Press.
bulletSmola, A. and B. Schölkopf (2001). A tutorial on support vector regression. Statistics and Computing. Forthcoming.
bulletSmola, A., B. Schölkopf, and K.-R. Müller (1998). The connection between regularization operators and support vector kernels. Neural Networks 11, pp. 637—649.
bulletSmola, A. J. (1996). Regression estimation with support vector learning machines. Diplomarbeit, Technische {Universität} {München}.
bulletSmola, A. J. (1998). Learning with Kernels. Ph.D. thesis, Technische Universität Berlin. GMD Research Series No. 25.
bulletSmola, A. J. and P. L. Bartlett (2001). Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, pp. 619—625. MIT Press.
bullet Smola, A. J., P. L. Bartlett, B. Schölkopf, and D. Schuurmans (2000). Advances in Large Margin Classifiers. Cambridge, MA. MIT Press.
bulletSmola, A. J. and B. Schölkopf (1998). On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica 22, pp. 211—231.
bullet Smola, A. J., J. Shawe-Taylor, B. Schölkopf, and R. C. Williamson (2000). The entropy regularization information criterion. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 342—348. MIT Press.
bulletSollich, P. (2000). Probabilistic methods for support vector machines. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 349—355. MIT Press.
bulletSontag, E. D. (1998). VC dimension of neural networks. In C. M. Bishop (Ed.), Neural Networks and Machine Learning, pp. 69—94. Berlin. Springer.
bulletSutton, R. S. and A. G. Barto (1998). Reinforcement Learning: An Introduction. MIT Press.
bulletTalagrand, M. (1987). The Glivenko-Cantelli problem. Annals of Probability 15, pp. 837—870.
bulletTalagrand, M. (1996). A new look at independence. Annals of Probability 24, pp. 1—34.
bulletTikhonov, A. N. and V. Y. Arsenin (1977). Solution of Ill-posed problems. V.H. Winston and Sons.
bulletTipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, pp. 211—244.
bulletTipping, M. E. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 652—658. MIT Press.
bulletTrecate, G. F., C. K. Williams, and M. Opper (1999). Finite-dimensional approximation of Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 218—224. MIT Press.
bulletTrybulec, W. A. (1990). Pigeon hole principle. Journal of Formalized Mathematics 2.
bulletTschebyscheff, P. L. (1936). Wahrscheinlichkeitsrechnung (in Russian). Moskau. Akademie Verlag.
bulletValiant, L. G. (1984). A theory of the learnable. Communications of the ACM 27(11), pp. 1134—1142.
bulletvan der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Processes. Springer.
bulletVanderbei, R. J. (1994). LOQO: An interior point code for quadratic programming. TR SOR-94-15, Statistics and Operations Research, Princeton Univ., NJ.
bulletVanderbei, R. J. (1997). Linear Programming: Foundations and Extensions. Hingham. Kluwer Academic.
bulletVapnik, V. (1995). The Nature of Statistical Learning Theory. New York. Springer.
bulletVapnik, V. (1998). Statistical Learning Theory. New York. John Wiley and Sons.
bulletVapnik, V. and A. Chervonenkis (1974). Theory of Pattern Recognition (in Russian). Moscow. Nauka. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979).
bulletVapnik, V. and A. Lerner (1963). Pattern recognition using generalized portrait method. Automation and Remote Control 24, pp. 774—780.
bulletVapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer.
bulletVapnik, V. N. and A. Y. Chervonenkis (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16(2), pp. 264—281.
bulletVapnik, V. N. and A. Y. Chervonenkis (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis 1(3), pp. 283—305.
bulletVapnik, V. N. and S. Mukherjee (2000). Support vector method for multivariate density estimation. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural Information Processing Systems 12, Cambridge, MA, pp. 659—665. MIT Press.
bullet Veropoulos, K., C. Campbell, and N. Cristianini (1999). Controlling the sensitivity of support vector machines. In Proceedings of IJCAI Workshop Support Vector Machines, pp. 55—60.
bulletVidyasagar, M. (1997). A Theory of Learning and Generalization. New York. Springer.
bulletWahba, G. (1990). Spline Models for Observational Data, Volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia. SIAM.
bulletWahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp. 69—88. MIT Press.
bulletWatkin, T. (1993). Optimal learning with a neural network. Europhysics Letters 21, p. 871.
bulletWatkins, C. (1998). Dynamic alignment kernels. Technical report, Royal Holloway, University of London. CSD-TR-98-11.
bulletWatkins, C. (2000). Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 39—50. MIT Press.
bullet Weston, J., A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, and C. Watkins (1999). Support vector density estimation. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp. 293—306. MIT Press.
bulletWeston, J. and R. Herbrich (2000). Adaptive margin support vector machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp. 281—295. MIT Press.
bulletWeston, J. and C. Watkins (1998). Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX, UK.
bulletWilliams, C. K. I. (1998). Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and Inference in Graphical Models. Kluwer Academic.
bulletWilliams, C. K. I. and D. Barber (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 20(12), pp. 1342—1351.
bulletWilliams, C. K. I. and M. Seeger (2001). Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems 13, Cambridge, MA, pp. 682—688. MIT Press.
bullet Williamson, R. C., A. J. Smola, and B. Schölkopf (2000). Entropy numbers of linear function classes. In N. Cesa-Bianchi and S. Goldman (Eds.), Proceedings of the Annual Conference on Computational Learning Theory, San Francisco, pp. 309—319. Morgan Kaufmann Publishers.
bulletWolpert, D. H. (1995). The Mathematics of Generalization. Addison-Wesley.