Search | arXiv e-print repository

Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension

Authors: Moritz Haas, David Holzmüller, Ulrike von Luxburg, Ingo Steinwart

Abstract: The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typica… ▽ More The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets. △ Less

Submitted 26 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: We provide Python code to reproduce all of our experimental results at https://github.com/moritzhaas/mind-the-spikes

arXiv:2212.12474 [pdf, other]

Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers

Authors: Marvin Pförtner, Ingo Steinwart, Philipp Hennig, Jonathan Wenger

Abstract: Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for i… ▽ More Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models with a downstream application and thus error quantification plays a key role. However, by ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of the Gaussian process inference theorem to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models by blurring the boundaries between numerical analysis and Bayesian inference. △ Less

Submitted 28 April, 2024; v1 submitted 23 December, 2022; originally announced December 2022.

arXiv:2206.11517 [pdf, other]

Utilizing Expert Features for Contrastive Learning of Time-Series Representations

Authors: Manuel Nonnenmacher, Lukas Oldenburg, Ingo Steinwart, David Reeb

Abstract: We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformation… ▽ More We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformations are generally elusive for time-series data. We start by proposing two properties that useful time-series representations should fulfill and show that current representation learning approaches do not ensure these properties. We therefore devise ExpCLR, a novel contrastive learning approach built on an objective that utilizes expert features to encourage both properties for the learned representation. Finally, we demonstrate on three real-world time-series datasets that ExpCLR surpasses several state-of-the-art methods for both unsupervised and semi-supervised representation learning. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Journal ref: Proceedings of the 39th International Conference on Machine Learning (ICML), PMLR 162:16969-16989, 2022

arXiv:2203.09410 [pdf, other]

A Framework and Benchmark for Deep Batch Active Learning for Regression

Authors: David Holzmüller, Viktor Zaverkin, Johannes Kästner, Ingo Steinwart

Abstract: The acquisition of labels for supervised learning can be expensive. To improve the sample efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our framework encompasses many e… ▽ More The acquisition of labels for supervised learning can be expensive. To improve the sample efficiency of neural network regression, we study active learning methods that adaptively select batches of unlabeled data for labeling. We present a framework for constructing such methods out of (network-dependent) base kernels, kernel transformations, and selection methods. Our framework encompasses many existing Bayesian methods based on Gaussian process approximations of neural networks as well as non-Bayesian methods. Additionally, we propose to replace the commonly used last-layer features with sketched finite-width neural tangent kernels and to combine them with a novel clustering method. To evaluate different methods, we introduce an open-source benchmark consisting of 15 large tabular regression data sets. Our proposed method outperforms the state-of-the-art on our benchmark, scales to large data sets, and works out-of-the-box without adjusting the network architecture or training code. We provide open-source code that includes efficient implementations of all kernels, kernel transformations, and selection methods, and can be used for reproducing our results. △ Less

Submitted 1 August, 2023; v1 submitted 17 March, 2022; originally announced March 2022.

Comments: Published at the Journal of Machine Learning Research (JMLR). Changes in v4: Improvements in writing and other minor changes. Accompanying code can be found at https://github.com/dholzmueller/bmdal_reg

Journal ref: Journal of Machine Learning Research, 24(164):1-81, 2023

arXiv:2110.11395 [pdf, other]

SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning

Authors: Manuel Nonnenmacher, Thomas Pfeil, Ingo Steinwart, David Reeb

Abstract: Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order appr… ▽ More Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks. △ Less

Submitted 30 June, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Journal ref: International Conference on Learning Representations (ICLR) 2022

arXiv:2109.09569 [pdf, other]

doi 10.1021/acs.jctc.1c00527

Fast and Sample-Efficient Interatomic Neural Network Potentials for Molecules and Materials Based on Gaussian Moments

Authors: Viktor Zaverkin, David Holzmüller, Ingo Steinwart, Johannes Kästner

Abstract: Artificial neural networks (NNs) are one of the most frequently used machine learning approaches to construct interatomic potentials and enable efficient large-scale atomistic simulations with almost ab initio accuracy. However, the simultaneous training of NNs on energies and forces, which are a prerequisite for, e.g., molecular dynamics simulations, can be demanding. In this work, we present an… ▽ More Artificial neural networks (NNs) are one of the most frequently used machine learning approaches to construct interatomic potentials and enable efficient large-scale atomistic simulations with almost ab initio accuracy. However, the simultaneous training of NNs on energies and forces, which are a prerequisite for, e.g., molecular dynamics simulations, can be demanding. In this work, we present an improved NN architecture based on the previous GM-NN model [V. Zaverkin and J. Kästner, J. Chem. Theory Comput. 16, 5410-5421 (2020)], which shows an improved prediction accuracy and considerably reduced training times. Moreover, we extend the applicability of Gaussian moment-based interatomic potentials to periodic systems and demonstrate the overall excellent transferability and robustness of the respective models. The fast training by the improved methodology is a pre-requisite for training-heavy workflows such as active learning or learning-on-the-fly. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Manuscript accepted for publication in J. Chem. Theory Comput.; Code published at https://gitlab.com/zaverkin_v/gmnn

arXiv:2011.02408 [pdf, other]

doi 10.1007/978-3-030-86523-8_6

Which Minimizer Does My Neural Network Converge To?

Authors: Manuel Nonnenmacher, David Reeb, Ingo Steinwart

Abstract: The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy… ▽ More The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models. △ Less

Submitted 30 June, 2022; v1 submitted 4 November, 2020; originally announced November 2020.

Journal ref: ECML PKDD 2021. Machine Learning and Knowledge Discovery in Databases. Research Track

arXiv:2002.04861 [pdf, other]

Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent

Authors: David Holzmüller, Ingo Steinwart

Abstract: We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization l… ▽ More We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior. △ Less

Submitted 8 June, 2022; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: To appear in Journal of Machine Learning Research (JMLR). Changes in v3: Added new Section 10 with extensive experimental evaluation. Code available at https://github.com/dholzmueller/nn_inconsistency

arXiv:1905.11028 [pdf, other]

Best-scored Random Forest Classification

Authors: Hanyuan Hang, Xiaoyu Liu, Ingo Steinwart

Abstract: We propose an algorithm named best-scored random forest for binary classification problems. The terminology "best-scored" means to select the one with the best empirical performance out of a certain number of purely random tree candidates as each single tree in the forest. In this way, the resulting forest can be more accurate than the original purely random forest. From the theoretical perspectiv… ▽ More We propose an algorithm named best-scored random forest for binary classification problems. The terminology "best-scored" means to select the one with the best empirical performance out of a certain number of purely random tree candidates as each single tree in the forest. In this way, the resulting forest can be more accurate than the original purely random forest. From the theoretical perspective, within the framework of regularized empirical risk minimization penalized on the number of splits, we establish almost optimal convergence rates for the proposed best-scored random trees under certain conditions which can be extended to the best-scored random forest. In addition, we present a counterexample to illustrate that in order to ensure the consistency of the forest, every dimension must have the chance to be split. In the numerical experiments, for the sake of efficiency, we employ an adaptive random splitting criterion. Comparative experiments with other state-of-art classification methods demonstrate the accuracy of our best-scored random forest. △ Less

Submitted 27 May, 2019; originally announced May 2019.

arXiv:1905.10686 [pdf, other]

Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning

Authors: Nicole Mücke, Ingo Steinwart

Abstract: A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to po… ▽ More A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for some local averaging schemes with zero training error. However, there is no theoretical analysis of interpolating ERM-like methods so far. We take a step in this direction by showing that for certain, large hypotheses classes, some interpolating ERMs enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures. △ Less

Submitted 23 July, 2021; v1 submitted 25 May, 2019; originally announced May 2019.

arXiv:1903.11482 [pdf, other]

A Sober Look at Neural Network Initializations

Authors: Ingo Steinwart

Abstract: Initializing the weights and the biases is a key part of the training process of a neural network. Unlike the subsequent optimization phase, however, the initialization phase has gained only limited attention in the literature. In this paper we discuss some consequences of commonly used initialization strategies for vanilla DNNs with ReLU activations. Based on these insights we then develop an alt… ▽ More Initializing the weights and the biases is a key part of the training process of a neural network. Unlike the subsequent optimization phase, however, the initialization phase has gained only limited attention in the literature. In this paper we discuss some consequences of commonly used initialization strategies for vanilla DNNs with ReLU activations. Based on these insights we then develop an alternative initialization strategy. Finally, we present some large scale experiments assessing the quality of the new initialization strategy. △ Less

Submitted 4 September, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

arXiv:1810.02321 [pdf, ps, other]

Optimal Learning with Anisotropic Gaussian SVMs

Authors: Hanyuan Hang, Ingo Steinwart

Abstract: This paper investigates the nonparametric regression problem using SVMs with anisotropic Gaussian RBF kernels. Under the assumption that the target functions are resided in certain anisotropic Besov spaces, we establish the almost optimal learning rates, more precisely, optimal up to some logarithmic factor, presented by the effective smoothness. By taking the effective smoothness into considerati… ▽ More This paper investigates the nonparametric regression problem using SVMs with anisotropic Gaussian RBF kernels. Under the assumption that the target functions are resided in certain anisotropic Besov spaces, we establish the almost optimal learning rates, more precisely, optimal up to some logarithmic factor, presented by the effective smoothness. By taking the effective smoothness into consideration, our almost optimal learning rates are faster than those obtained with the underlying RKHSs being certain anisotropic Sobolev spaces. Moreover, if the target function depends only on fewer dimensions, faster learning rates can be further achieved. △ Less

Submitted 4 October, 2018; originally announced October 2018.

arXiv:1712.05279 [pdf, ps, other]

Strictly proper kernel scores and characteristic kernels on compact spaces

Authors: Ingo Steinwart, Johanna F. Ziegel

Abstract: Strictly proper kernel scores are well-known tool in probabilistic forecasting, while characteristic kernels have been extensively investigated in the machine learning literature. We first show that both notions coincide, so that insights from one part of the literature can be used in the other. We then show that the metric induced by a characteristic kernel cannot reliably distinguish between dis… ▽ More Strictly proper kernel scores are well-known tool in probabilistic forecasting, while characteristic kernels have been extensively investigated in the machine learning literature. We first show that both notions coincide, so that insights from one part of the literature can be used in the other. We then show that the metric induced by a characteristic kernel cannot reliably distinguish between distributions that are far apart in the total variation norm as soon as the underlying space of measures is infinite dimensional. In addition, we provide a characterization of characteristic kernels in terms of eigenvalues and -functions and apply this characterization to the case of continuous kernels on (locally) compact spaces. In the compact case we further show that characteristic kernels exist if and only if the space is metrizable. As special cases of our general theory we investigate translation-invariant kernels on compact Abelian groups and isotropic kernels on spheres. The latter are of particular interest for forecast evaluation of probabilistic predictions on spherical domains as frequently encountered in meteorology and climatology. △ Less

Submitted 14 December, 2017; originally announced December 2017.

arXiv:1708.05254 [pdf, other]

Adaptive Clustering Using Kernel Density Estimators

Authors: Ingo Steinwart, Bharath K. Sriperumbudur, Philipp Thomann

Abstract: We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive da… ▽ More We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as Hölder continuity, but only require intuitive geometric assumptions of non-parametric nature. △ Less

Submitted 1 November, 2021; v1 submitted 17 August, 2017; originally announced August 2017.

arXiv:1702.07552 [pdf, ps, other]

Learning Rates for Kernel-Based Expectile Regression

Authors: Muhammad Farooq, Ingo Steinwart

Abstract: Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our… ▽ More Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels. △ Less

Submitted 27 February, 2017; v1 submitted 24 February, 2017; originally announced February 2017.

arXiv:1702.07254 [pdf, ps, other]

Sobolev Norm Learning Rates for Regularized Least-Squares Algorithm

Authors: Simon Fischer, Ingo Steinwart

Abstract: Learning rates for least-squares regression are typically expressed in terms of $L_2$-norms. In this paper we extend these rates to norms stronger than the $L_2$-norm without requiring the regression function to be contained in the hypothesis space. In the special case of Sobolev reproducing kernel Hilbert spaces used as hypotheses spaces, these stronger norms coincide with fractional Sobolev norm… ▽ More Learning rates for least-squares regression are typically expressed in terms of $L_2$-norms. In this paper we extend these rates to norms stronger than the $L_2$-norm without requiring the regression function to be contained in the hypothesis space. In the special case of Sobolev reproducing kernel Hilbert spaces used as hypotheses spaces, these stronger norms coincide with fractional Sobolev norms between the used Sobolev space and $L_2$. As a consequence, not only the target function but also some of its derivatives can be estimated without changing the algorithm. From a technical point of view, we combine the well-known integral operator techniques with an embedding property, which so far has only been used in combination with empirical process arguments. This combination results in new finite sample bounds with respect to the stronger norms. From these finite sample bounds our rates easily follow. Finally, we prove the asymptotic optimality of our results in many cases. △ Less

Submitted 8 October, 2020; v1 submitted 23 February, 2017; originally announced February 2017.

Comments: accepted manuscript in J. Mach. Learn. Res

Journal ref: J. Mach. Learn. Res. 21 (2020) 1-38

arXiv:1702.06899 [pdf, ps, other]

liquidSVM: A Fast and Versatile SVM package

Authors: Ingo Steinwart, Philipp Thomann

Abstract: liquidSVM is a package written in C++ that provides SVM-type solvers for various classification and regression tasks. Because of a fully integrated hyper-parameter selection, very carefully implemented solvers, multi-threading and GPU support, and several built-in data decomposition strategies it provides unprecedented speed for small training sizes as well as for data sets of tens of millions of… ▽ More liquidSVM is a package written in C++ that provides SVM-type solvers for various classification and regression tasks. Because of a fully integrated hyper-parameter selection, very carefully implemented solvers, multi-threading and GPU support, and several built-in data decomposition strategies it provides unprecedented speed for small training sizes as well as for data sets of tens of millions of samples. Besides the C++ API and a command line interface, bindings to R, MATLAB, Java, Python, and Spark are available. We present a brief description of the package and report experimental comparisons to other SVM packages. △ Less

Submitted 22 February, 2017; originally announced February 2017.

arXiv:1612.00824 [pdf, other]

Learning with Hierarchical Gaussian Kernels

Authors: Ingo Steinwart, Philipp Thomann, Nico Schmid

Abstract: We investigate iterated compositions of weighted sums of Gaussian kernels and provide an interpretation of the construction that shows some similarities with the architectures of deep neural networks. On the theoretical side, we show that these kernels are universal and that SVMs using these kernels are universally consistent. We further describe a parameter optimization method for the kernel para… ▽ More We investigate iterated compositions of weighted sums of Gaussian kernels and provide an interpretation of the construction that shows some similarities with the architectures of deep neural networks. On the theoretical side, we show that these kernels are universal and that SVMs using these kernels are universally consistent. We further describe a parameter optimization method for the kernel parameters and empirically compare this method to SVMs, random forests, a multiple kernel learning approach, and to some deep neural networks. △ Less

Submitted 2 December, 2016; originally announced December 2016.

arXiv:1612.00374 [pdf, other]

Spatial Decompositions for Large Scale SVMs

Authors: Philipp Thomann, Ingrid Blaschzyk, Mona Meister, Ingo Steinwart

Abstract: Although support vector machines (SVMs) are theoretically well understood, their underlying optimization problem becomes very expensive, if, for example, hundreds of thousands of samples and a non-linear kernel are considered. Several approaches have been proposed in the past to address this serious limitation. In this work we investigate a decomposition strategy that learns on small, spatially de… ▽ More Although support vector machines (SVMs) are theoretically well understood, their underlying optimization problem becomes very expensive, if, for example, hundreds of thousands of samples and a non-linear kernel are considered. Several approaches have been proposed in the past to address this serious limitation. In this work we investigate a decomposition strategy that learns on small, spatially defined data chunks. Our contributions are two fold: On the theoretical side we establish an oracle inequality for the overall learning method using the hinge loss, and show that the resulting rates match those known for SVMs solving the complete optimization problem with Gaussian kernels. On the practical side we compare our approach to learning SVMs on small, randomly chosen chunks. Here it turns out that for comparable training times our approach is significantly faster during testing and also reduces the test error in most cases significantly. Furthermore, we show that our approach easily scales up to 10 million training samples: including hyper-parameter selection using cross validation, the entire training only takes a few hours on a single machine. Finally, we report an experiment on 32 million training samples. All experiments used liquidSVM (Steinwart and Thomann, 2017). △ Less

Submitted 8 February, 2018; v1 submitted 1 December, 2016; originally announced December 2016.

Journal ref: Proceedings of Machine Learning Research Volume 54: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 2017 (A. Singh and J. Zhu, eds.), pp. 1329-1337, 2017

arXiv:1607.03792 [pdf, ps, other]

Kernel Density Estimation for Dynamical Systems

Authors: Hanyuan Hang, Ingo Steinwart, Yunlong Feng, Johan A. K. Suykens

Abstract: We study the density estimation problem with observations generated by certain dynamical systems that admit a unique underlying invariant Lebesgue density. Observations drawn from dynamical systems are not independent and moreover, usual mixing concepts may not be appropriate for measuring the dependence among these observations. By employing the $\mathcal{C}$-mixing concept to measure the depende… ▽ More We study the density estimation problem with observations generated by certain dynamical systems that admit a unique underlying invariant Lebesgue density. Observations drawn from dynamical systems are not independent and moreover, usual mixing concepts may not be appropriate for measuring the dependence among these observations. By employing the $\mathcal{C}$-mixing concept to measure the dependence, we conduct statistical analysis on the consistency and convergence of the kernel density estimator. Our main results are as follows: First, we show that with properly chosen bandwidth, the kernel density estimator is universally consistent under $L_1$-norm; Second, we establish convergence rates for the estimator with respect to several classes of dynamical systems under $L_1$-norm. In the analysis, the density function $f$ is only assumed to be Hölder continuous which is a weak assumption in the literature of nonparametric density estimation and also more realistic in the dynamical system context. Last but not least, we prove that the same convergence rates of the estimator under $L_\infty$-norm and $L_1$-norm can be achieved when the density function is Hölder continuous, compactly supported and bounded. The bandwidth selection problem of the kernel density estimator for dynamical system is also discussed in our study via numerical simulations. △ Less

Submitted 13 July, 2016; originally announced July 2016.

arXiv:1605.02887 [pdf, ps, other]

Learning theory estimates with observations from general stationary stochastic processes

Authors: Hanyuan Hang, Yunlong Feng, Ingo Steinwart, Johan A. K. Suykens

Abstract: This paper investigates the supervised learning problem with observations drawn from certain general stationary stochastic processes. Here by \emph{general}, we mean that many stationary stochastic processes can be included. We show that when the stochastic processes satisfy a generalized Bernstein-type inequality, a unified treatment on analyzing the learning schemes with various mixing processes… ▽ More This paper investigates the supervised learning problem with observations drawn from certain general stationary stochastic processes. Here by \emph{general}, we mean that many stationary stochastic processes can be included. We show that when the stochastic processes satisfy a generalized Bernstein-type inequality, a unified treatment on analyzing the learning schemes with various mixing processes can be conducted and a sharp oracle inequality for generic regularized empirical risk minimization schemes can be established. The obtained oracle inequality is then applied to derive convergence rates for several learning schemes such as empirical risk minimization (ERM), least squares support vector machines (LS-SVMs) using given generic kernels, and SVMs using Gaussian kernels for both least squares and quantile regression. It turns out that for i.i.d.~processes, our learning rates for ERM recover the optimal rates. On the other hand, for non-i.i.d.~processes including geometrically $α$-mixing Markov processes, geometrically $α$-mixing processes with restricted decay, $φ$-mixing processes, and (time-reversed) geometrically $\mathcal{C}$-mixing processes, our learning rates for SVMs with Gaussian kernels match, up to some arbitrarily small extra term in the exponent, the optimal rates. For the remaining cases, our rates are at least close to the optimal rates. As a by-product, the assumed generalized Bernstein-type inequality also provides an interpretation of the so-called "effective number of observations" for various mixing processes. △ Less

Submitted 10 May, 2016; originally announced May 2016.

Comments: arXiv admin note: text overlap with arXiv:1501.03059

arXiv:1508.05249 [pdf, ps, other]

Representation of Quasi-Monotone Functionals by Families of Separating Hyperplanes

Authors: Ingo Steinwart

Abstract: We characterize when the level sets of a continuous quasi-monotone functional defined on a suitable convex subset of a normed space can be uniquely represented by a family of bounded continuous functionals. Furthermore, we investigate how regularly these functionals depend on the parameterizing level. Finally, we show how this question relates to the recent problem of property elicitation that sim… ▽ More We characterize when the level sets of a continuous quasi-monotone functional defined on a suitable convex subset of a normed space can be uniquely represented by a family of bounded continuous functionals. Furthermore, we investigate how regularly these functionals depend on the parameterizing level. Finally, we show how this question relates to the recent problem of property elicitation that simultaneously attracted interest in machine learning, statistical evaluation of forecasts, and finance. △ Less

Submitted 21 August, 2015; originally announced August 2015.

Comments: 23 pages

arXiv:1508.03712 [pdf, ps, other]

Towards an Axiomatic Approach to Hierarchical Clustering of Measures

Authors: Philipp Thomann, Ingo Steinwart, Nico Schmid

Abstract: We propose some axioms for hierarchical clustering of probability measures and investigate their ramifications. The basic idea is to let the user stipulate the clusters for some elementary measures. This is done without the need of any notion of metric, similarity or dissimilarity. Our main results then show that for each suitable choice of user-defined clustering on elementary measures we obtain… ▽ More We propose some axioms for hierarchical clustering of probability measures and investigate their ramifications. The basic idea is to let the user stipulate the clusters for some elementary measures. This is done without the need of any notion of metric, similarity or dissimilarity. Our main results then show that for each suitable choice of user-defined clustering on elementary measures we obtain a unique notion of clustering on a large set of distributions satisfying a set of additivity and continuity axioms. We illustrate the developed theory by numerous examples including some with and some without a density. △ Less

Submitted 15 August, 2015; originally announced August 2015.

MSC Class: Primary 62H30; Secondary 91C20; 62G07

Journal ref: Journal of Machine Learning Research. 16(Sep):1949-2002, 2015

arXiv:1507.06615 [pdf, ps, other]

Optimal Learning Rates for Localized SVMs

Authors: Mona Eberts, Ingo Steinwart

Abstract: One of the limiting factors of using support vector machines (SVMs) in large scale applications are their super-linear computational requirements in terms of the number of training samples. To address this issue, several approaches that train SVMs on many small chunks of large data sets separately have been proposed in the literature. So far, however, almost all these approaches have only been emp… ▽ More One of the limiting factors of using support vector machines (SVMs) in large scale applications are their super-linear computational requirements in terms of the number of training samples. To address this issue, several approaches that train SVMs on many small chunks of large data sets separately have been proposed in the literature. So far, however, almost all these approaches have only been empirically investigated. In addition, their motivation was always based on computational requirements. In this work, we consider a localized SVM approach based upon a partition of the input space. For this local SVM, we derive a general oracle inequality. Then we apply this oracle inequality to least squares regression using Gaussian kernels and deduce local learning rates that are essentially minimax optimal under some standard smoothness assumptions on the regression function. This gives the first motivation for using local SVMs that is not based on computational requirements but on theoretical predictions on the generalization performance. We further introduce a data-dependent parameter selection method for our local SVM approach and show that this method achieves the same learning rates as before. Finally, we present some larger scale experiments for our localized SVM showing that it achieves essentially the same test performance as a global SVM for a fraction of the computational requirements. In addition, it turns out that the computational requirements for the local SVMs are similar to those of a vanilla random chunk approach, while the achieved test errors are significantly better. △ Less

Submitted 23 July, 2015; originally announced July 2015.

Comments: 68 pages, 20 figures, and 11 tables

arXiv:1507.03887 [pdf, other]

An SVM-like Approach for Expectile Regression

Authors: Muhammad Farooq, Ingo Steinwart

Abstract: Expectile regression is a nice tool for investigating conditional distributions beyond the conditional mean. It is well-known that expectiles can be described with the help of the asymmetric least square loss function, and this link makes it possible to estimate expectiles in a non-parametric framework by a support vector machine like approach. In this work we develop an efficient sequential-minim… ▽ More Expectile regression is a nice tool for investigating conditional distributions beyond the conditional mean. It is well-known that expectiles can be described with the help of the asymmetric least square loss function, and this link makes it possible to estimate expectiles in a non-parametric framework by a support vector machine like approach. In this work we develop an efficient sequential-minimal-optimization-based solver for the underlying optimization problem. The behavior of the solver is investigated by conducting various experiments and the results are compared with the recent R-package ER-Boost. △ Less

Submitted 14 July, 2015; originally announced July 2015.

arXiv:1409.8437 [pdf, ps, other]

doi 10.1214/15-AOS1331

Fully adaptive density-based clustering

Authors: Ingo Steinwart

Abstract: The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is… ▽ More The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is fed with histogram-based level set estimates, we provide a finite sample analysis, which is then used to show that the algorithm consistently estimates both the smallest level and the corresponding connected components. We further establish rates of convergence for the two estimation problems, and last but not least, we present a simple, yet adaptive strategy for determining the width-parameter of the involved density estimator in a data-depending way. △ Less

Submitted 28 October, 2015; v1 submitted 30 September, 2014; originally announced September 2014.

Comments: Published at http://dx.doi.org/10.1214/15-AOS1331 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1331

Journal ref: Annals of Statistics 2015, Vol. 43, No. 5, 2132-2167

arXiv:1302.6427 [pdf, other]

Hypothesis Testing for Validation and Certification

Authors: Clint Scovel, Ingo Steinwart

Abstract: We develop a hypothesis testing framework for the formulation of the problems of 1) the validation of a simulation model and 2) using modeling to certify the performance of a physical system. These results are used to solve the extrapolative validation and certification problems, namely problems where the regime of interest is different than the regime for which we have experimental data. We use c… ▽ More We develop a hypothesis testing framework for the formulation of the problems of 1) the validation of a simulation model and 2) using modeling to certify the performance of a physical system. These results are used to solve the extrapolative validation and certification problems, namely problems where the regime of interest is different than the regime for which we have experimental data. We use concentration of measure theory to develop the tests and analyze their errors. This work was stimulated by the work of Lucas, Owhadi, and Ortiz where a rigorous method of validation and certification is described and tested. In a remark we describe the connection between the two approaches. Moreover, as mentioned in that work these results have important implications in the Quantification of Margins and Uncertainties (QMU) framework. In particular, in a remark we describe how it provides a rigorous interpretation of the notion of confidence and new notions of margins and uncertainties which allow this interpretation. Since certain concentration parameters used in the above tests may be unkown, we furthermore show, in the last half of the paper, how to derive equally powerful tests which estimate them from sample data, thus replacing the assumption of the values of the concentration parameters with weaker assumptions. This paper is an essentially exact copy of one dated April 10, 2010. △ Less

Submitted 26 February, 2013; originally announced February 2013.

Report number: LA-UR-10-02355 MSC Class: 60

arXiv:1205.3845 [pdf, other]

Forecasting with Historical Data or Process Knowledge under Misspecification: A Comparison

Authors: Luke Bornn, Marian Anghel, Ingo Steinwart

Abstract: When faced with the task of forecasting a dynamic system, practitioners often have available historical data, knowledge of the system, or a combination of both. While intuition dictates that perfect knowledge of the system should in theory yield perfect forecasting, often knowledge of the system is only partially known, known up to parameters, or known incorrectly. In contrast, forecasting using p… ▽ More When faced with the task of forecasting a dynamic system, practitioners often have available historical data, knowledge of the system, or a combination of both. While intuition dictates that perfect knowledge of the system should in theory yield perfect forecasting, often knowledge of the system is only partially known, known up to parameters, or known incorrectly. In contrast, forecasting using previous data without any process knowledge might result in accurate prediction for simple systems, but will fail for highly nonlinear and chaotic systems. In this paper, the authors demonstrate how even in chaotic systems, forecasting with historical data is preferable to using process knowledge if this knowledge exhibits certain forms of misspecification. Through an extensive simulation study, a range of misspecification and forecasting scenarios are examined with the goal of gaining an improved understanding of the circumstances under which forecasting from historical data is to be preferred over using process knowledge. △ Less

Submitted 17 May, 2012; originally announced May 2012.

arXiv:0708.1838 [pdf, ps, other]

doi 10.1214/009053606000001226

Fast rates for support vector machines using Gaussian kernels

Authors: Ingo Steinwart, Clint Scovel

Abstract: For binary classification we establish learning rates up to the order of $n^{-1}$ for support vector machines (SVMs) with hinge loss and Gaussian RBF kernels. These rates are in terms of two assumptions on the considered distributions: Tsybakov's noise assumption to establish a small estimation error, and a new geometric noise condition which is used to bound the approximation error. Unlike prev… ▽ More For binary classification we establish learning rates up to the order of $n^{-1}$ for support vector machines (SVMs) with hinge loss and Gaussian RBF kernels. These rates are in terms of two assumptions on the considered distributions: Tsybakov's noise assumption to establish a small estimation error, and a new geometric noise condition which is used to bound the approximation error. Unlike previously proposed concepts for bounding the approximation error, the geometric noise assumption does not employ any smoothness assumption. △ Less

Submitted 14 August, 2007; originally announced August 2007.

Comments: Published at http://dx.doi.org/10.1214/009053606000001226 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0237 MSC Class: 68Q32 (Primary); 62G20; 62G99; 68T05; 68T10; 41A46; 41A99 (Secondary)

Journal ref: Annals of Statistics 2007, Vol. 35, No. 2, 575-607

arXiv:0707.0322 [pdf, ps, other]

doi 10.1214/07-AOS562

Consistency of support vector machines for forecasting the evolution of an unknown ergodic dynamical system from observations with unknown noise

Authors: Ingo Steinwart, Marian Anghel

Abstract: We consider the problem of forecasting the next (observable) state of an unknown ergodic dynamical system from a noisy observation of the present state. Our main result shows, for example, that support vector machines (SVMs) using Gaussian RBF kernels can learn the best forecaster from a sequence of noisy observations if (a) the unknown observational noise process is bounded and has a summable… ▽ More We consider the problem of forecasting the next (observable) state of an unknown ergodic dynamical system from a noisy observation of the present state. Our main result shows, for example, that support vector machines (SVMs) using Gaussian RBF kernels can learn the best forecaster from a sequence of noisy observations if (a) the unknown observational noise process is bounded and has a summable $α$-mixing rate and (b) the unknown ergodic dynamical system is defined by a Lipschitz continuous function on some compact subset of $\mathbb{R}^d$ and has a summable decay of correlations for Lipschitz continuous functions. In order to prove this result we first establish a general consistency result for SVMs and all stochastic processes that satisfy a mixing notion that is substantially weaker than $α$-mixing. △ Less

Submitted 7 April, 2009; v1 submitted 2 July, 2007; originally announced July 2007.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS562 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS562 MSC Class: 62M20 (Primary) 37D25; 37C99; 37M10; 60K99; 62M10; 62M45; 68Q32; 68T05 (Secondary)

Journal ref: Annals of Statistics 2009, Vol. 37, No. 2, 841-875

arXiv:0707.0303 [pdf, ps, other]

Learning from dependent observations

Authors: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: In most papers establishing consistency for learning algorithms it is assumed that the observations used for training are realizations of an i.i.d. process. In this paper we go far beyond this classical framework by showing that support vector machines (SVMs) essentially only require that the data-generating process satisfies a certain law of large numbers. We then consider the learnability of S… ▽ More In most papers establishing consistency for learning algorithms it is assumed that the observations used for training are realizations of an i.i.d. process. In this paper we go far beyond this classical framework by showing that support vector machines (SVMs) essentially only require that the data-generating process satisfies a certain law of large numbers. We then consider the learnability of SVMs for $\a$-mixing (not necessarily stationary) processes for both classification and regression, where for the latter we explicitly allow unbounded noise. △ Less

Submitted 2 July, 2007; originally announced July 2007.

Comments: submitted to Journal of Multivariate Analysis

Report number: Los Alamos National Laboratory Technical Report LA-UR-06-3507

Showing 1–31 of 31 results for author: Steinwart, I