 |
Aizerman, M. A., E. M. Braverman, and L. I. Rozonoer (1964).
Theoretical foundations of the potential function method in pattern
recognition learning. Automation and Remote Control 25, pp. 821—837.
|
 |
Allwein, E. L., R. E. Schapire, and Y. Singer (2000).
Reducing multiclass to binary: A unifying approach for margin classifiers. In P.
Langley (Ed.), Proceedings of the International Conference on Machine
Learning, San Francisco, California, pp. 9—16.
Morgan Kaufmann Publishers. |
 |
Alon, N., S. Ben-David, N. Cesa-Bianchi, and D. Haussler (1997).
Scale-sensitive dimensions, uniform convergence, and learnability.
Journal of the ACM 44(4), pp. 615—631.
|
 | Alon, N., J. H.
Spencer, and P. Erdös (1991).
The Probabilsitic Method. John
Wiley and Sons. |
 | Amari, S. (1985).
Differential-Geometrical Methods in Statistics. Berlin. Springer. |
 | Anlauf, J. K. and M. Biehl
(1989). The AdaTron: An adaptive perceptron algorithm. Europhysics
Letters 10, pp. 687—692. |
 | Anthony, M. (1997).
Probabilistic
analysis of learning in artificial neural networks: The PAC model and its
variants. Neural Computing Surveys 1, pp. 1—47.
|
 | Anthony, M. and P.
Bartlett (1999). A Theory of Learning in Artificial Neural Networks.
Cambridge University Press. |
 | Baldi, P. and S. Brunak
(1998).
Bioinformatics: The Machine Learning Approach. MIT
Press. |
 | Barber, D. and C. K.
I. Williams (1997).
Gaussian processes for Bayesian classification via
Hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.),
Advances in Neural Information Processing Systems 9, pp. 340—346.
MIT Press. |
 | Barner, M. and F. Flohr
(1989). Analysis. deGryter. |
 | Bartlett,
P., P. Long, and R. C. Williamson (1996).
Fat-shattering and the learnability of real-valued functions. Journal of Computer and System
Sciences 52(3), pp. 434—452. |
 | Bartlett, P. and J.
Shawe-Taylor (1998).
Generalization performance of support vector
machines and other pattern classifiers. In Advances in Kernel Methods—Support
Vector Learning, pp. 43—54. MIT
Press. |
 | Bartlett, P. L. (1998).
The
sample complexity of pattern classification with neural networks: The size
of the weights is more important than the size of the network. IEEE
Transactions on Information Theory 44(2), pp. 525—536.
|
 | Bartlett, P. L.
and J. Shawe-Taylor (1999).
Generalization performance of support
vector machines and other pattern classifiers. In B. Schölkopf, C. J. C.
Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support
Vector Learning, Cambridge, MA, pp. 43—54.
MIT Press. |
 | Baudat, G. and F. Anouar
(2000).
Generalized discriminant analysis using a kernel approach.
Neural Computation 12, pp. 2385—2404.
|
 | Bayes, T. (1763). An essay towards
solving a problem in the doctrine of chances. Philiosophical
Transactions of the Royal Socienty 53, pp. 370—418.
|
 | Bellman, R. E. (1961). Adaptive
Control Processes. Princeton, NJ: Princeton University Press. |
 | Bennett, G. (1962). Probability
inequalities for the sum of independent random variables. Journal of
the American Statistical Association 57, pp. 33—45.
|
 | Bennett, K. (1998).
Combining support
vector and mathematical programming methods for classification. In
Advances in Kernel Methods—Support Vector
Learning, pp. 307—326. MIT Press.
|
 | Berger, J. (1985). The frequentist
viewpoint and conditioning. In Proccedings of the Berkley Symposium,
pp. 15—44. |
 | Bernardo, J. and A. Smith
(1994). Bayesian Theory. Chichester: John Wiley and Sons.
|
 | Bernstein, S. (1946). The Theory
of Probabilities. Moscow: Gastehizdat Publishing House. |
 | Biehl, M. and M. Opper (1995).
Perceptron learning: The largest version space. In Proceedings of
Workshop: Theory of Neural Networks: The Statistical Mechanics Perspective.
|
 | Billingsley, P. (1968).
Convergence of Probability Measures. John Wiley and Sons. |
 | Bishop, C. M. (1995).
Neural
Networks for Pattern Recognition. Oxford: Clarendon Press. |
 | Bishop, C. M. and M.
E. Tipping (2000).
Variational relevance vector machines. In
Proceedings of 16th Conference on Uncertainty in Artificial Intelligence
UAI'2000, pp. 46—53. |
 | Block, H. D. (1962). The perceptron:
A model for brain functioning. Reviews of Modern Physics 34, pp.
123—135. Reprinted in Neurocomputing by Anderson and Rosenfeld.
|
 |
Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth (1989).
Learnability and the Vapnik-Chervonenkis Dimension. Journal of the ACM
36(4), pp. 929—965. |
 | Bois, G. P. (1961). Tables of
Indefinite Integrals. Dover Publications. |
 |
Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992, July).
A training
algorithm for optimal margin classifiers. In D. Haussler (Ed.),
Proceedings of the Annual Conference on Computational Learning Theory,
Pittsburgh, PA, pp. 144—152. ACM Press. |
 | Bousquet, O. and A.
Elisseeff (2000).
Stability and generalization. Technical report,
Centre de Mathematiques Appliquees. |
 | Bousquet, O. and A.
Elisseeff (2001).
Algorithmic stability and generalization
performance. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.),
Advances in Neural Information Processing Systems 13, pp. 196—202. MIT
Press. |
 | Box, G. E. P. and G. C.
Tiao (1973). Bayesian Inference in Statistical Analysis.
Addison-Wesley. |
 |
Brown, M. P. S., W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S.
Furey, M. Ares, and D. Haussler (2000).
Knowledge-based analysis of microarray gene expression data using support vector machines.
Proceedings of the National Academy of Sciences 97(1), pp. 262—267.
|
 | Brownie, C. and J. Kiefer
(1977). The ideas of conditional confidence in the simplest setting.
Communications in Statistics—Theory and Methods 6(8), pp. 691—751.
|
 | Burges, C. J. C. (1998).
A
tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery 2(2), pp. 121—167. |
 | Cantelli, F. (1933). Sulla
probabilita come limita della frequenza. Rend. Accad. Lincei 26(1),
p. 39. |
 | Carl, B. and I. Stephani
(1990). Entropy, compactness, and the approximation of operators.
Cambridge, UK: Cambridge University Press. |
 | Casella, G. (1988). Conditionally
acceptable frequentist solutions. In Statistical Decision Theory,
Volume 1, pp. 73—84. |
 | Cauchy, A. (1821). Cours d'analyse
de l'Ecole Royale Polytechnique: Analyse algebrique. Paris: Debure
freres. |
 | Chernoff, H. (1952). A measure of
asymptotic efficiency of tests of a hypothesis based on the sum of
observations. Annals of Mathematical Statistics 23, pp. 493—507.
|
 | Cortes, C. (1995).
Prediction of
Generalization Ability in Learning Machines. Ph.D. thesis, Department
of Computer Science, University of Rochester. |
 | Cortes, C. and V. Vapnik
(1995).
Support vector networks. Machine Learning 20, 273—297.
|
 | Cox, R. (1946). Probability, frequency,
and reasonable expectations. American Journal of Physics 14, pp.
1—13. |
 |
CPLEX Optimization Inc.
(1994). Using the CPLEX callable library. Manual. |
 | Cristianini, N.
and J. Shawe-Taylor (1999).
Bayesian voting schemes and large margin
classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.),
Advances in Kernel Methods—Support Vector Learning, Cambridge, MA, pp.
55—68. MIT Press. |
 | Cristianini, N.
and J. Shawe-Taylor (2000).
An Introduction to Support Vector
Machines. Cambridge, UK: Cambridge University Press. |
 | Debnath, L. and P.
Mikusinski (1998). Hilbert Spaces with Applications. Academic
Press. |
 |
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum
Likelihood from Incomplete Data via the EM Algorithm. Journal of the
Royal Statistical Society B 39(1), pp. 1—22. |
 | Devroye, L., L.
Györfi, and G. Lugosi (1996).
A Probabilistic Theory of Pattern
Recognition. Number 31 in Applications of mathematics. New York:
Springer. |
 | Devroye, L. and G. Lugosi
(2001). Combinatorial Methods in Density Estimation. Springer.
|
 | Devroye, L. P. and T.
J. Wagner (1979). Distribution-free performance bounds for potential
function rules. IEEE Transactions on Information Theory 25(5), pp.
202—207. |
 | Dietrich,
R., M. Opper, and H. Sompolinsky (2000).
Support vectors and
statistical mechanics. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and
D. Schuurmans (Eds.), Advances in Large Margin Classifiers,
Cambridge, MA, pp. 359—367. MIT Press. |
 | Duda, R. O. and P. E. Hart
(1973). Pattern Classification and Scene Analysis. New York:
John Wiley and Sons. |
 | Duda, R. O.,
P. E. Hart, and D. G. Stork (2001).
Pattern Classification and
Scene Analysis. New York: John Wiley and Sons. Second edition.
|
 | Feller, W. (1950). An Introduction
To Probability Theory and Its Application, Volume 1. New York: John
Wiley and Sons. |
 | Feller, W. (1966). An Introduction
To Probability Theory and Its Application, Volume 2. New York: John
Wiley and Sons. |
 | Fisher, R. A. (1936). The use of
multiple measurements in taxonomic problems. Annals of Eugenics 7,
pp. 179—188. |
 | Floyd, S. and M. Warmuth
(1995). Sample compression, learnability, and the Vapnik Chervonenkis
dimension. Machine Learning 27, pp. 1—36. |
 | Freund, Y. (1998).
Self bounding
learning algorithms. In Proceedings of the Annual Conference on
Computational Learning Theory, Madison, Wisconsin, pp. 247—258. |
 | Freund,
Y., Y. Mansour, and R. E. Schapire (2000).
Analysis of a
pseudo-Bayesian prediction method. In Proceedings of the Conference on
Information Science and Systems. |
 | Gardner, E. (1988). The space of
interactions in neural networks. Journal of Physics A 21, pp.
257—270. |
 | Gardner, E. and B. Derrida
(1988). Optimal storage properties of neural network models.
Journal of Physics A 21, pp. 271—284. |
 | Gentile, C. and M. K.
Warmuth (1999).
Linear hinge loss and average margin. In M. S. Kearns,
S. A. Solla, and D. A. Cohn (Eds.), Advances in Neural Information
Processing Systems 11, Cambridge, MA, pp. 225—231. MIT Press. |
 | Gibbs, M. and D. J. C.
Mackay (1997).
Efficient implementation of Gaussian processes.
Technical report, Cavendish Laboratory, Cambridge, UK. |
 | Girosi, F. (1998).
An equivalence
between sparse approximation and support vector machines. Neural
Computation 10(6), pp. 1455—1480. |
 | Glivenko, V. (1933). Sulla
determinazione empirica delle leggi di probabilita. Giornale
dell'Istituta Italiano degli Attuari 4, p. 92. |
 | Golub, G. H. and C.
F. van Loan (1989). Matrix Computations. John Hopkins
University Press. |
 |
Graepel, T., R. Herbrich, and J. Shawe-Taylor (2000).
Generalisation
error bounds for sparse linear classifiers. In Proceedings of the
Annual Conference on Computational Learning Theory, pp. 298—303. |
 |
Graepel, T., R. Herbrich, and R. C. Williamson (2001).
From margin to
sparsity. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.),
Advances in Neural Information Processing Systems 13, Cambridge, MA,
pp. 210—216. MIT Press. |
 |
Guermeur, Y., A. Elisseeff, and H. Paugam-Moisy (2000).
A new
multi-class SVM based on a uniform convergence result. In Proceedings
of IJCNN 2000. |
 | Gurvits, L. (1997). A note on a
scale-sensitive dimension of linear bounded functionals in Banach spaces.
In M. Li and A. Maruoka (Eds.), Proceedings of the International
Conference on Algorithmic Learning Theory, LNAI-1316, Berlin, pp.
352—363. Springer. |
 | Guyon, I. and D. Storck
(2000). Linear discriminant and support vector classifiers. In A. J.
Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances
in Large Margin Classifiers, Cambridge, MA, pp. 147—169. MIT Press.
|
 | Hadamard, J. (1902). Sur les
problemes aux derivees partielles et leur signification physique.
Bullentin Princeton University 13, pp. 49—52. |
 | Hadley, G. (1962). Linear
Programming. London: Addison-Wesley. |
 | Hadley, G. (1964). Nonlinear and
Dynamic Programming. London: Addison-Wesley. |
 | Harville, D. A. (1997). Matrix
Algebra From a Statistician's Perspective. Springer. |
 | Hastie, T. and R.
Tibshirani (1998).
Classification by pairwise coupling. In M. I.
Jordan, M. J. Kearns, and S. A. Solla (Eds.), Advances in Neural
Information Processing Systems 10, Cambridge, MA, pp. 507—513. MIT
Press. |
 | Haussler, D. (1999).
Convolutional
kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer
Science Department, University of California at Santa Cruz. |
 | Haussler,
D., M. Kearns, and R. Schapire (1994).
Bounds on the sample complexity
of Bayesian learning using information theory and the VC dimension.
Machine Learning 14, pp. 88—113. |
 | Herbrich, R. (2000).
Learning
Linear Classifiers—Theory and Algorithms. Ph.D. thesis, Technische
Universität Berlin. |
 | Herbrich, R. and T.
Graepel (2001a).
Large scale Bayes point machines. In T. K. Leen, T.
G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information
Processing Systems 13, Cambridge, MA, pp. 528—534. MIT Press. |
 | Herbrich, R. and T.
Graepel (2001b).
A PAC-Bayesian margin bound for linear classifiers:
Why SVMs work. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.),
Advances in Neural Information Processing Systems 13, Cambridge, MA,
pp. 224—230. |
 | Herbrich,
R., T. Graepel, and C. Campbell (2001).
Bayes point machines.
Journal of Machine Learning Research 1, pp. 245—279. |
 |
Herbrich, R., T. Graepel, and J. Shawe-Taylor (2000).
Sparsity vs.
large margins for linear classifiers. In Proceedings of the Annual
Conference on Computational Learning Theory, pp. 304—308. |
 | Hoeffding, W. (1963). Probability
inequalities for sums of bounded random variables. Journal of the
American Statistical Association 58, pp. 13—30. |
 | Jaakkola, T.,
M. Meila, and T. Jebara (2000).
Maximum entropy discrimination. In S.
A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural
Information Processing Systems 12, Cambridge, MA, pp. 470—476. MIT
Press. |
 |
Jaakkola, T. S., M. Diekhans, and D. Haussler (1999).
Using the Fisher
kernel method to detect remote protein homologies. In Proceedings of
the International Conference on Intelligence Systems for Molecular Biology,
pp. 149—158. AAAI Press. |
 | Jaakkola, T. S. and
D. Haussler (1999a).
Exploiting generative models in discriminative
classifiers. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.),
Advances in Neural Information Processing Systems 11, Cambridge, MA,
pp. 487—493. MIT Press. |
 | Jaakkola, T. S. and
D. Haussler (1999b).
Probabilistic kernel regression models. In
Proceedings of the 1999 Conference on AI and Statistics. |
 | Jaynes, E. T. (1968,
September). Prior probabilities. IEEE Transactions on Systems
Science and Cybernetics SSC-4(3), pp. 227—241. |
 | Jebara, T. and T. Jaakkola
(2000).
Feature selection and dualities in maximum entropy
discrimination. In Uncertainty In Artificial Intelligence. |
 | Jeffreys, H. (1946). An invariant
form for the prior probability in estimation problems. Proceedings of
the Royal Statistical Society A 186, pp. 453—461. |
 | Joachims, T. (1998).
Text
categorization with support vector machines: Learning with many relevant
features. In Proceedings of the European Conference on Machine Learning,
Berlin, pp. 137—142. Springer. |
 | Joachims, T. (1999).
Making
large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and
A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning,
Cambridge, MA, pp. 169—184. MIT Press. |
 | Johnson,
N. L., S. Kotz, and N. Balakrishnan (1994). Continuous Univariate
Distributions. Volume 1 (Second Edition). John Wiley and Sons. |
 | Kahane, J. P. (1968). Some
Random Series of Functions. Cambridge University Press. |
 | Karchin, R. (2000).
Classifying
g-protein coupled receptors with support vector machines. Master's thesis,
University of California. |
 | Kearns, M. and D. Ron (1999).
Algorithmic stability and sanity-check bounds for leave-one-out
cross-validation. Neural Computation 11(6), pp. 1427—1453. |
 | Kearns, M. J. and R.
E. Schapire (1994).
Efficient distribution-free learning of
probabilistic concepts. Journal of Computer and System Sciences
48(3), pp. 464—497. |
 |
Kearns, M. J., R. E. Schapire, and L. M. Sellie (1992).
Toward
efficient agnostic learning (extended abstract). In Proceedings of the
Annual Conference on Computational Learning Theory, Pittsburgh,
Pennsylvania, pp. 341—352. ACM Press. |
 | Kearns, M. J. and U.
V. Vazirani (1994).
An Introduction to Computational Learning
Theory. Cambridge, Massachusetts. MIT Press. |
 |
Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
(1999b). A fast iterative nearest point algorithm for support vector
machine classifier design. Technical Report Technical Report TR-ISL-99-03,
Indian Institute of Science, Bangalore. |
 |
Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
(1999a). Improvements to Platt's SMO algorithm for SVM classifier
design. Technical Report CD-99-14, Dept. of Mechanical and Production
Engineering, Nat. Univ. Singapore, Singapore. |
 | Kiefer, J. (1977). Conditional
confidence statements and confidence estimators. Journal of the
American Statistical Association 72, 789—807. |
 | Kimeldorf, G. S. and G.
Wahba (1970). A correspondence between Bayesian estimation on
stochastic processes and smoothing by splines. Annals of Mathematical
Statistics 41, pp. 495—502. |
 | Kivinen, J.,
M. K. Warmuth, and P. Auer (1997).
The perceptron learning algorithm
vs. winnow: Linear vs. logarithmic mistake bounds when few input variables
are relevant. Artificial Intelligence 97(1—2), pp. 325—343.
|
 | Kockelkorn, U. (2000).
Lineare
statistische Methoden. Oldenburg-Verlag. |
 | Kolmogorov, A. (1933). Sulla
determinazione empirica di una leggi di distribuzione. Giornale
dell'Istituta Italiano degli Attuari 4, p. 33. |
 | Kolmogorov, A. N.
and S. V. Fomin (1957). Functional Analysis. Graylock Press.
|
 | Kolmogorov, A.
N. and V. M. Tihomirov (1961). ε-entropy
and ε-capacity of sets in functional
spaces. American Mathematical Society Translations, Series 2 17(2),
pp. 277—364. |
 | König, H. (1986). Eigenvalue
Distribution of Compact Operators. Basel. Birkhäuser. |
 | Krauth, W. and M. Mezard
(1987). Learning algorithms with optimal stability in neural networks.
Journal of Physics A 20, pp. 745—752. |
 | Lambert, P. F. (1969). Designing
pattern categorizers with extremal paradigm information. In S. Watanabe
(Ed.), Methodologies of Pattern Recognition, New York, pp. 359—391.
Academic Press. |
 | Lauritzen, S. L. (1981). Time
series analysis in 1880, a discussion of contributions made by T. N.
Thiele. ISI Review 49, pp. 319—333. |
 | Lee,
W. S., P. L. Bartlett, and R. C. Williamson (1998).
The importance of
convexity in learning with squared loss. IEEE Transactions on
Information Theory 44(5), pp. 1974—1980. |
 | Levin, R. D. and M. Tribus
(1978). The maximum entropy formalism. In Proceedings of the
Maximum Entropy Formalism Conference. MIT Press. |
 | Lindsey, J. K. (1996).
Parametric Statistical Inference. Clarendon Press.
|
 | Littlestone, N. (1988). Learning
quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning 2, pp. 285—318. |
 | Littlestone, N. and M.
Warmuth (1986).
Relating data compression and learnability. Technical
report, University of California Santa Cruz. |
 |
Lodhi, H., J. Shawe-Taylor, N. Cristianini, and C. Watkins (2001).
Text classification using kernels. In T. K. Leen, T. G. Dietterich, and V.
Tresp (Eds.), Advances in Neural Information Processing Systems 13,
Cambridge, MA, pp. 563—569. MIT Press. |
 | Lunts, A. and V.
Brailovsky (1969). On estimation of characters obtained in statistical
procedure of recognition (in Russian). Technicheskaya Kibernetica 3.
|
 | Lütkepohl, H. (1996). Handbook
of Matrices. Chichester: John Wiley and Sons. |
 | MacKay, D. (1994).
Bayesian non-linear
modelling for the energy prediction competition. ASHRAE Transcations 4,
pp. 448—472. |
 | MacKay, D. J. (1999).
Information
theory, probability and neural networks. |
 | MacKay, D. J. C. (1991).
Bayesian Methods for Adaptive Models. Ph.D. thesis, Computation and
Neural Systems, California Institute of Technology, Pasadena, CA. |
 | MacKay, D. J. C. (1992).
The
evidence framework applied to classification networks. Neural
Computation 4(5), pp. 720—736. |
 | MacKay, D. J. C. (1998).
Introduction to Gaussian processes. In C. M. Bishop (Ed.), Neural
Networks and Machine Learning, pp. 133—165. Berlin. Springer. |
 | Magnus, J. R. and H.
Neudecker (1999). Matrix Differential Calculus with Applications in
Statistics and Econometrics (Revised Edition). John Wiley and Sons.
|
 | Marchand, M. and J.
Shawe-Taylor (2001).
Learning with the set covering machine. In
Proceedings of the International Conference on Machine Learning, San
Francisco, California, pp. 345—352. Morgan Kaufmann Publishers. |
 | Mardia, K.
V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis.
Academic Press. |
 | Markov, A. A. (1912).
Wahrscheinlichkeitsrechnung. Leipzig: B.G. Teubner Verlag. |
 | Matheron, G. (1963). Principles of
geostatistics. Economic Geology 58, pp. 1246—1266. |
 | McAllester, D. A. (1998).
Some
PAC Bayesian theorems. In Proceedings of the Annual Conference on
Computational Learning Theory, Madison, Wisconsin, pp. 230—234. ACM
Press. |
 | McAllester, D. A. (1999).
PAC-Bayesian model averaging. In Proceedings of the Annual Conference
on Computational Learning Theory, Santa Cruz, USA, pp. 164—170. |
 | McDiarmid, C. (1989). On the method
of bounded differences. In Survey in Combinatorics, pp. 148—188.
Cambridge University Press. |
 | Mercer, J. (1909). Functions of
positive and negative type and their connection with the theory of
integral equations. Philosophical Transactions of the Royal Society,
London A 209, pp. 415—446. |
 |
Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller (1999).
Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E.
Wilson, and S. Douglas (Eds.), Neural Networks for Signal Processing IX,
pp. 41—48. IEEE. |
 | Minka, T. (2001).
Expectation
Propagation for approximative Bayesian inference. Ph.D. thesis, MIT
Media Labs, Cambridge, USA. |
 | Minsky, M. and S. Papert
(1969). Perceptrons: An Introduction To Computational Geometry.
Cambridge, MA. MIT Press. |
 | Mitchell, T. M. (1977). Version
spaces: a candidate elimination approach to rule learning. In
Proceedings of the International Joint Conference on Artificial
Intelligence, Cambridge, Massachusetts, pp. 305—310. IJCAI. |
 | Mitchell, T. M. (1982).
Generalization as search. Artificial Intelligence 18(2), pp.
202—226. |
 | Mitchell, T. M. (1997).
Machine Learning. New York. McGraw-Hill.
|
 | Murtagh, B. A. and
M. A. Saunders (1993).
MINOS 5.4 user's guide. Technical Report {SOL}
83.20, Stanford University. |
 | Neal, R. (1996).
Bayesian Learning in
Neural Networks. Springer. |
 | Neal, R. M. (1997a).
Markov chain
Monte Carlo method based on 'slicing' the density function. Technical
report, Department of Statistics, University of Toronto. TR-9722. |
 | Neal, R. M. (1997b).
Monte Carlo
implementation of Gaussian process models for Bayesian regression and
classification. Technical Report Technical Report 9702, Dept. of
Statistics. |
 | Neal, R. M. (1998).
Assessing
relevance determination methods using DELVE. In Neural Networks and
Machine Learning, pp. 97—129. Springer. |
 | Novikoff, A. B. J. (1962). On
convergence proofs on perceptrons. In Proceedings of the Symposium on
the Mathematical Theory of Automata, Volume 12, pp. 615—622.
Polytechnic Institute of Brooklyn. |
 | Okamoto, M. (1958). Some inequalities
relating to the partial sum of binomial probabilities. Annals of the
Institute of Statistical Mathematics 10, 29—35. |
 | Opper, M. and D. Haussler
(1991). Generalization performance of Bayes optimal classification
algorithms for learning a perceptron. Physical Review Letters 66,
p. 2677. |
 | Opper, M. and W. Kinzel
(1995). Statistical Mechanics of Generalisation, pp. 151.
Springer. |
 | Opper,
M., W. Kinzel, J. Kleinz, and R. Nehl (1990). On the ability of the
optimal perceptron to generalize. Journal of Physics A 23, pp.
581—586. |
 | Opper, M. and O. Winther
(2000).
Gaussian processes for classification: Mean field algorithms.
Neural Computation 12(11), pp. 2655—2684. |
 | Osuna, E., R.
Freund, and F. Girosi (1997).
An improved training algorithm for
support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson
(Eds.), Neural Networks for Signal Processing VII—-Proceedings of the
1997 IEEE Workshop, New York, pp. 276—285. IEEE. |
 | Platt, J. (1999).
Fast training of
support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel
Methods—Support Vector Learning, Cambridge, MA, pp. 185—208. MIT
Press. |
 |
Platt, J. C., N. Cristianini, and J. Shawe-Taylor (2000).
Large margin DAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R.
Müller (Eds.), Advances in Neural Information Processing Systems 12,
Cambridge, MA, pp. 547—553. MIT Press. |
 | Poggio, T. (1975). On optimal
nonlinear associative recall. Biological Cybernetics 19, pp.
201—209. |
 | Pollard, D. (1984). Convergence of
Stochastic Processess. New York. Springer. |
 |
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992).
Numerical Recipes in C: The Art of Scientific Computing (2nd ed.).
Cambridge University Press. ISBN 0-521-43108-5. |
 | Robert, C. P. (1994).
The
Bayesian choice: A decision theoretic motivation. New York. Springer.
|
 | Rosenblatt, F. (1958). The
perceptron: A probabilistic model for information storage and organization
in the brain. Psychological Review 65(6), pp. 386—408. |
 | Rosenblatt, F. (1962).
Principles of neurodynamics: Perceptron and Theory of Brain Mechanisms.
Washington D.C.: Spartan-Books. |
 | Roth, V. and V. Steinhage
(2000). Nonlinear discriminant analysis using kernel functions. In S.
A. Solla, T. K. Leen, and K.-R. Müller (Eds.), Advances in Neural
Information Processing Systems 12, Cambridge, MA, pp. 568—574. MIT
Press. |
 | Rujan, P. (1993).
A fast method for
calculating the perceptron with maximal stability. Journal de Physique
I France 3, pp. 277—290. |
 | Rujan, P. (1997).
Playing billiards in
version space. Neural Computation 9, pp. 99—122. |
 | Rujan, P. and M. Marchand
(2000).
Computing the Bayes kernel classifier. In A. J. Smola, P. L.
Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large
Margin Classifiers, Cambridge, MA, pp. 329—347. MIT Press. |
 |
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Parallel
Distributed Processing. Cambridge, MA. MIT Press. |
 |
Rychetsky, M., J. Shawe-Taylor, and M. Glesner (2000). Direct Bayes
point machines. In Proceedings of the International Conference on
Machine Learning. |
 | Salton, G. (1968). Automatic
Information Organization and Retrieval. New York. McGraw-Hill. |
 | Sauer, N. (1972). On the density of
families of sets. Journal of Combinatorial Theory 13, pp. 145—147.
|
 | Scheffe, H. (1947). A useful
convergence theorem for probability distributions. Annals of
Mathematical Statistics 18, pp. 434—438. |
 | Schölkopf,
B., C. Burges, and V. Vapnik (1995).
Extracting support data for a
given task. In U. M. Fayyad and R. Uthurusamy (Eds.), Proceedings,
First International Conference on Knowledge Discovery & Data Mining,
Menlo Park. AAAI Press. |
 |
Schölkopf, B., C. J. C. Burges, and A. J. Smola (1998).
Advances in
Kernel Methods. MIT Press. |
 |
Schölkopf, B., R. Herbrich, and A. J. Smola (2001).
A generalized representer theorem. In Proceedings of the Annual Conference on
Computational Learning Theory. |
 |
Schölkopf, B., J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (1999).
Kernel-dependent support vector error bounds. In Ninth International
Conference on Artificial Neural Networks, Conference Publications No.
470, London, pp. 103—108. IEE. |
 |
Schölkopf, B., A. Smola, R. C. Williamson, and P. L. Bartlett (2000).
New support vector algorithms. Neural Computation 12, pp.
1207—1245. |
 |
Shawe-Taylor, J., P. L. Bartlett, R. C. Williamson, and M. Anthony (1998).
Structural risk minimization over data-dependent hierarchies. IEEE
Transactions on Information Theory 44(5), pp. 1926—1940. |
 | Shawe-Taylor, J.
and N. Cristianini (1998).
Robust bounds on generalization from the
margin distribution. NeuroCOLT Technical Report NC-TR-1998-029, ESPRIT
NeuroCOLT2 Working Group. |
 | Shawe-Taylor, J.
and N. Cristianini (2000).
Margin distribution and soft margin. In A.
J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans (Eds.),
Advances in Large Margin Classifiers, Cambridge, MA, pp. 349—358. MIT
Press. |
 | Shawe-Taylor,
J. and R. C. Williamson (1997).
A PAC analysis of a Bayesian
estimator. Technical report, Royal Holloway, University of London.
NC2-TR-1997-013. |
 | Shawe-Taylor,
J. and R. C. Williamson (1999). Generalization performance of
classifiers in terms of observed covering numbers. In P. Fischer and H. U.
Simon (Eds.), Proceedings of the European Conference on Computational
Learning Theory, Volume 1572 of LNAI, Berlin, pp. 285—300.
Springer. |
 | Shelah, S. (1972). A combinatorial
problem; stability and order for models and theories in infinitary
languages. Pacific Journal of Mathematics 41, pp. 247—261. |
 |
Shevade, S. K., S. S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy
(1999).
Improvements to SMO algorithm for SVM regression. Technical
Report CD-99-16, Dept. of Mechanical and Production Engineering, Nat.
Univ. Singapore, Singapore. |
 | Smola, A. and B. Schölkopf
(1998).
From regularization operators to support vector kernels. In M.
I. Jordan, M. J. Kearns, and S. A. Solla (Eds.), Advances in Neural
Information Processing Systems 10, Cambridge, MA, pp. 343—349. MIT
Press. |
 | Smola, A. and B. Schölkopf
(2001).
A tutorial on support vector regression. Statistics and
Computing. Forthcoming. |
 | Smola, A.,
B. Schölkopf, and K.-R. Müller (1998). The connection between
regularization operators and support vector kernels. Neural Networks 11,
pp. 637—649. |
 | Smola, A. J. (1996).
Regression
estimation with support vector learning machines. Diplomarbeit, Technische
{Universität} {München}. |
 | Smola, A. J. (1998).
Learning
with Kernels. Ph.D. thesis, Technische Universität Berlin. GMD
Research Series No. 25. |
 | Smola, A. J. and P.
L. Bartlett (2001).
Sparse greedy Gaussian process regression. In T.
K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural
Information Processing Systems 13, pp. 619—625. MIT Press. |
 |
Smola, A. J., P. L. Bartlett, B. Schölkopf, and D. Schuurmans (2000).
Advances in Large Margin Classifiers. Cambridge, MA. MIT Press.
|
 | Smola, A. J. and B.
Schölkopf (1998).
On a kernel-based method for pattern recognition,
regression, approximation and operator inversion. Algorithmica 22,
pp. 211—231. |
 |
Smola, A. J., J. Shawe-Taylor, B. Schölkopf, and R. C. Williamson (2000).
The entropy regularization information criterion. In S. A. Solla, T. K.
Leen, and K.-R. Müller (Eds.), Advances in Neural Information
Processing Systems 12, Cambridge, MA, pp. 342—348. MIT Press. |
 | Sollich, P. (2000).
Probabilistic
methods for support vector machines. In S. A. Solla, T. K. Leen, and K.-R.
Müller (Eds.), Advances in Neural Information Processing Systems 12,
Cambridge, MA, pp. 349—355. MIT Press. |
 | Sontag, E. D. (1998).
VC dimension
of neural networks. In C. M. Bishop (Ed.), Neural Networks and Machine
Learning, pp. 69—94. Berlin. Springer. |
 | Sutton, R. S. and A. G.
Barto (1998).
Reinforcement Learning: An Introduction. MIT
Press. |
 | Talagrand, M. (1987). The
Glivenko-Cantelli problem. Annals of Probability 15, pp. 837—870.
|
 | Talagrand, M. (1996).
A new look at
independence. Annals of Probability 24, pp. 1—34. |
 | Tikhonov, A. N. and
V. Y. Arsenin (1977). Solution of Ill-posed problems. V.H.
Winston and Sons. |
 | Tipping, M. (2001).
Sparse Bayesian
learning and the relevance vector machine. Journal of Machine Learning
Research 1, pp. 211—244. |
 | Tipping, M. E. (2000).
The
relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. Müller
(Eds.), Advances in Neural Information Processing Systems 12,
Cambridge, MA, pp. 652—658. MIT Press. |
 | Trecate,
G. F., C. K. Williams, and M. Opper (1999).
Finite-dimensional
approximation of Gaussian processes. In M. S. Kearns, S. A. Solla, and D.
A. Cohn (Eds.), Advances in Neural Information Processing Systems 11,
Cambridge, MA, pp. 218—224. MIT Press. |
 | Trybulec, W. A. (1990).
Pigeon
hole principle. Journal of Formalized Mathematics 2. |
 | Tschebyscheff, P. L. (1936).
Wahrscheinlichkeitsrechnung (in Russian). Moskau. Akademie Verlag.
|
 | Valiant, L. G. (1984). A theory of
the learnable. Communications of the ACM 27(11), pp. 1134—1142.
|
 | van der Vaart,
A. W. and J. A. Wellner (1996).
Weak Convergence and Empirical
Processes. Springer. |
 | Vanderbei, R. J. (1994).
LOQO:
An interior point code for quadratic programming. TR SOR-94-15, Statistics
and Operations Research, Princeton Univ., NJ. |
 | Vanderbei, R. J. (1997).
Linear Programming: Foundations and Extensions. Hingham. Kluwer
Academic. |
 | Vapnik, V. (1995). The Nature of
Statistical Learning Theory. New York. Springer. |
 | Vapnik, V. (1998). Statistical
Learning Theory. New York. John Wiley and Sons. |
 | Vapnik, V. and A.
Chervonenkis (1974). Theory of Pattern Recognition (in Russian).
Moscow. Nauka. (German Translation: W. Wapnik & A. Tscherwonenkis,
Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979). |
 | Vapnik, V. and A. Lerner
(1963). Pattern recognition using generalized portrait method.
Automation and Remote Control 24, pp. 774—780. |
 | Vapnik, V. N. (1982). Estimation
of Dependences Based on Empirical Data. Springer. |
 | Vapnik, V. N.
and A. Y. Chervonenkis (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and
its Applications 16(2), pp. 264—281. |
 | Vapnik, V. N.
and A. Y. Chervonenkis (1991). The necessary and sufficient conditions
for consistency in the empirical risk minimization method. Pattern
Recognition and Image Analysis 1(3), pp. 283—305. |
 | Vapnik, V. N. and S.
Mukherjee (2000).
Support vector method for multivariate density
estimation. In S. A. Solla, T. K. Leen, and K.-R. Müller (Eds.),
Advances in Neural Information Processing Systems 12, Cambridge, MA,
pp. 659—665. MIT Press. |
 |
Veropoulos, K., C. Campbell, and N. Cristianini (1999).
Controlling
the sensitivity of support vector machines. In Proceedings of IJCAI
Workshop Support Vector Machines, pp. 55—60. |
 | Vidyasagar, M. (1997). A Theory
of Learning and Generalization. New York. Springer. |
 | Wahba, G. (1990).
Spline Models for
Observational Data, Volume 59 of CBMS-NSF Regional Conference
Series in Applied Mathematics. Philadelphia. SIAM. |
 | Wahba, G. (1999).
Support vector
machines, reproducing kernel Hilbert spaces and the randomized GACV. In B.
Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel
Methods—Support Vector Learning, Cambridge, MA, pp. 69—88. MIT Press.
|
 | Watkin, T. (1993). Optimal learning
with a neural network. Europhysics Letters 21, p. 871. |
 | Watkins, C. (1998).
Dynamic alignment
kernels. Technical report, Royal Holloway, University of London.
CSD-TR-98-11. |
 | Watkins, C. (2000). Dynamic alignment
kernels. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans
(Eds.), Advances in Large Margin Classifiers, Cambridge, MA, pp.
39—50. MIT Press. |
 |
Weston, J., A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, and C. Watkins
(1999).
Support vector density estimation. In B. Schölkopf, C. J. C.
Burges, and A. J. Smola (Eds.), Advances in Kernel Methods—Support
Vector Learning, Cambridge, MA, pp. 293—306. MIT Press. |
 | Weston, J. and R. Herbrich
(2000).
Adaptive margin support vector machines. In A. J. Smola, P. L.
Bartlett, B. Schölkopf, and D. Schuurmans (Eds.), Advances in Large
Margin Classifiers, Cambridge, MA, pp. 281—295. MIT Press. |
 | Weston, J. and C. Watkins
(1998).
Multi-class support vector machines. Technical Report
CSD-TR-98-04, Department of Computer Science, Royal Holloway, University
of London, Egham, TW20 0EX, UK. |
 | Williams, C. K. I. (1998).
Prediction with Gaussian processes: From linear regression to linear
prediction and beyond. In M. I. Jordan (Ed.), Learning and Inference in
Graphical Models. Kluwer Academic. |
 | Williams, C. K. I.
and D. Barber (1998).
Bayesian classification with Gaussian processes.
IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 20(12),
pp. 1342—1351. |
 | Williams, C. K. I.
and M. Seeger (2001).
Using the Nystrom method to speed up kernel
machines. In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.),
Advances in Neural Information Processing Systems 13, Cambridge, MA,
pp. 682—688. MIT Press. |
 |
Williamson, R. C., A. J. Smola, and B. Schölkopf (2000).
Entropy
numbers of linear function classes. In N. Cesa-Bianchi and S. Goldman
(Eds.), Proceedings of the Annual Conference on Computational Learning
Theory, San Francisco, pp. 309—319. Morgan Kaufmann Publishers. |
 | Wolpert, D. H. (1995).
The
Mathematics of Generalization. Addison-Wesley. |