-
Stochastic Modified Flows for Riemannian Stochastic Gradient Descent
Authors:
Benjamin Gess,
Sebastian Kassing,
Nimit Rana
Abstract:
We give quantitative estimates for the rate of convergence of Riemannian stochastic gradient descent (RSGD) to Riemannian gradient flow and to a diffusion process, the so-called Riemannian stochastic modified flow (RSMF). Using tools from stochastic differential geometry we show that, in the small learning rate regime, RSGD can be approximated by the solution to the RSMF driven by an infinite-dime…
▽ More
We give quantitative estimates for the rate of convergence of Riemannian stochastic gradient descent (RSGD) to Riemannian gradient flow and to a diffusion process, the so-called Riemannian stochastic modified flow (RSMF). Using tools from stochastic differential geometry we show that, in the small learning rate regime, RSGD can be approximated by the solution to the RSMF driven by an infinite-dimensional Wiener process. The RSMF accounts for the random fluctuations of RSGD and, thereby, increases the order of approximation compared to the deterministic Riemannian gradient flow. The RSGD is build using the concept of a retraction map, that is, a cost efficient approximation of the exponential map, and we prove quantitative bounds for the weak error of the diffusion approximation under assumptions on the retraction map, the geometry of the manifold, and the random estimators of the gradient.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
On the existence of optimal shallow feedforward networks with ReLU activation
Authors:
Steffen Dereich,
Sebastian Kassing
Abstract:
We prove existence of global minima in the loss landscape for the approximation of continuous target functions using shallow feedforward artificial neural networks with ReLU activation. This property is one of the fundamental artifacts separating ReLU from other commonly used activation functions. We propose a kind of closure of the search space so that in the extended space minimizers exist. In a…
▽ More
We prove existence of global minima in the loss landscape for the approximation of continuous target functions using shallow feedforward artificial neural networks with ReLU activation. This property is one of the fundamental artifacts separating ReLU from other commonly used activation functions. We propose a kind of closure of the search space so that in the extended space minimizers exist. In a second step, we show under mild assumptions that the newly added functions in the extension perform worse than appropriate representable ReLU networks. This then implies that the optimal response in the extended target space is indeed the response of a ReLU network.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
On the existence of minimizers in shallow residual ReLU neural network optimization landscapes
Authors:
Steffen Dereich,
Arnulf Jentzen,
Sebastian Kassing
Abstract:
Many mathematical convergence results for gradient descent (GD) based algorithms employ the assumption that the GD process is (almost surely) bounded and, also in concrete numerical simulations, divergence of the GD process may slow down, or even completely rule out, convergence of the error function. In practical relevant learning problems, it thus seems to be advisable to design the ANN architec…
▽ More
Many mathematical convergence results for gradient descent (GD) based algorithms employ the assumption that the GD process is (almost surely) bounded and, also in concrete numerical simulations, divergence of the GD process may slow down, or even completely rule out, convergence of the error function. In practical relevant learning problems, it thus seems to be advisable to design the ANN architectures in a way so that GD optimization processes remain bounded. The property of the boundedness of GD processes for a given learning problem seems, however, to be closely related to the existence of minimizers in the optimization landscape and, in particular, GD trajectories may escape to infinity if the infimum of the error function (objective function) is not attained in the optimization landscape. This naturally raises the question of the existence of minimizers in the optimization landscape and, in the situation of shallow residual ANNs with multi-dimensional input layers and multi-dimensional hidden layers with the ReLU activation, the main result of this work answers this question affirmatively for a general class of loss functions and all continuous target functions. In our proof of this statement, we propose a kind of closure of the search space, where the limits are called generalized responses, and, thereafter, we provide sufficient criteria for the loss function and the underlying probability distribution which ensure that all additional artificial generalized responses are suboptimal which finally allows us to conclude the existence of minimizers in the optimization landscape.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent
Authors:
Benjamin Gess,
Sebastian Kassing,
Vitalii Konarovskyi
Abstract:
We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochasti…
▽ More
We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
Convergence rates for momentum stochastic gradient descent with noise of machine learning type
Authors:
Benjamin Gess,
Sebastian Kassing
Abstract:
We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are…
▽ More
We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are motivated by overparameterized supervised learning applications. Moreover, we optimize the convergence rate over the set of friction parameters and show that the MSGD process almost surely converges.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Resource Allocation in Serverless Query Processing
Authors:
Simon Kassing,
Ingo Müller,
Gustavo Alonso
Abstract:
Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data processing. In this paper, we expand on such work by lo…
▽ More
Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining (and paying for) a fixed infrastructure. Recent research has analyzed the potential of serverless for data processing. In this paper, we expand on such work by looking into the question of serverless resource allocation to data processing tasks (number and size of the functions). We formulate a general model to roughly estimate completion time and financial cost, which we apply to augment an existing serverless data processing system with an advisory tool that automatically identifies configurations striking a good balance -- which we define as being close to the "knee" of their Pareto frontier. The model takes into account key aspects of serverless: start-up, computation, network transfers, and overhead as a function of the input sizes and intermediate result exchanges. Using (micro)benchmarks and parts of TPC-H, we show that this advisor is capable of pinpointing configurations desirable to the user. Moreover, we identify and discuss several aspects of data processing on serverless affecting efficiency. By using an automated tool to configure the resources, the barrier to using serverless for data processing is lowered and the narrow window where it is cost effective can be expanded by using a more optimal allocation instead of having to over-provision the design.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
New primitives for bounded degradation in network service
Authors:
Simon Kassing,
Vojislav Dukic,
Ce Zhang,
Ankit Singla
Abstract:
Certain new ascendant data center workloads can absorb some degradation in network service, not needing fully reliable data transport and/or their fair-share of network bandwidth. This opens up opportunities for superior network and infrastructure multiplexing by having this flexible traffic cede capacity under congestion to regular traffic with stricter needs. We posit there is opportunity in net…
▽ More
Certain new ascendant data center workloads can absorb some degradation in network service, not needing fully reliable data transport and/or their fair-share of network bandwidth. This opens up opportunities for superior network and infrastructure multiplexing by having this flexible traffic cede capacity under congestion to regular traffic with stricter needs. We posit there is opportunity in network service primitives which permit degradation within certain bounds, such that flexible traffic still receives an acceptable level of service, while benefiting from its weaker requirements. We propose two primitives, namely guaranteed partial delivery and bounded deprioritization. We design a budgeting algorithm to provide guarantees relative to their fair share, which is measured via probing. The requirement of budgeting and probing limits the algorithm's applicability to large flexible flows.
We evaluate our algorithm with large flexible flows and for three workloads of regular flows of small size, large size and a distribution of sizes. Across the workloads, our algorithm achieves less speed-up of regular flows than fixed prioritization, especially for the small flows workload (1.25x vs. 6.82 in the 99th %-tile). Our algorithm provides better guarantees in the workload with large regular flows (with 14.5% vs. 32.5% of flexible flows being slowed down beyond their guarantee). However, it provides not much better or even slightly worse guarantees for the other two workloads. The ability to enforce guarantees is influenced by flow fair share interdependence, measurement inaccuracies and dependency on convergence. We observe that priority changes to probe or to deprioritize causes queue shifts which deteriorate guarantees and limit possible speed-up, especially of small flows. We find that mechanisms to both prioritize traffic and track guarantees should be as non-disruptive as possible.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
On minimal representations of shallow ReLU networks
Authors:
S. Dereich,
S. Kassing
Abstract:
The realization function of a shallow ReLU network is a continuous and piecewise affine function $f:\mathbb R^d\to \mathbb R$, where the domain $\mathbb R^{d}$ is partitioned by a set of $n$ hyperplanes into cells on which $f$ is affine. We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons and we characterize each of the three cases. In the particular case, where…
▽ More
The realization function of a shallow ReLU network is a continuous and piecewise affine function $f:\mathbb R^d\to \mathbb R$, where the domain $\mathbb R^{d}$ is partitioned by a set of $n$ hyperplanes into cells on which $f$ is affine. We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons and we characterize each of the three cases. In the particular case, where the input layer is one-dimensional, minimal representations always use at most $n+1$ neurons but in all higher dimensional settings there are functions for which $n+2$ neurons are needed. Then we show that the set of minimal networks representing $f$ forms a $C^\infty$-submanifold $M$ and we derive the dimension and the number of connected components of $M$. Additionally, we give a criterion for the hyperplanes that guarantees that all continuous, piecewise affine functions are realization functions of appropriate ReLU networks.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
Cooling down stochastic differential equations: almost sure convergence
Authors:
S. Dereich,
S. Kassing
Abstract:
We consider almost sure convergence of the SDE $dX_t=α_t d t + β_t d W_t$ under the existence of a $C^2$-Lyapunov function $F:\mathbb R^d \to \mathbb R$. More explicitly, we show that on the event that the process stays local we have almost sure convergence in the Lyapunov function $(F(X_t))$ as well as $\nabla F(X_t)\to 0$, if $|β_t|=\mathcal O( t^{-β})$ for a $β>1/2$. If, additionally, one assum…
▽ More
We consider almost sure convergence of the SDE $dX_t=α_t d t + β_t d W_t$ under the existence of a $C^2$-Lyapunov function $F:\mathbb R^d \to \mathbb R$. More explicitly, we show that on the event that the process stays local we have almost sure convergence in the Lyapunov function $(F(X_t))$ as well as $\nabla F(X_t)\to 0$, if $|β_t|=\mathcal O( t^{-β})$ for a $β>1/2$. If, additionally, one assumes that $F$ is a Lojasiewicz function, we get almost sure convergence of the process itself, given that $|β_t|=\mathcal O(t^{-β})$ for a $β>1$. The assumptions are shown to be optimal in the sense that there is a divergent counterexample where $|β_t|$ is of order $t^{-1}$.
△ Less
Submitted 15 July, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Order P4-66: Characterizing and mitigating surreptitious programmable network device exploitation
Authors:
Simon Kassing,
Hussain Abbas,
Laurent Vanbever,
Ankit Singla
Abstract:
Substantial efforts are invested in improving network security, but the threat landscape is rapidly evolving, particularly with the recent interest in programmable network hardware. We explore a new security threat, from an attacker who has gained control of such devices. While it should be obvious that such attackers can trivially cause substantial damage, the challenge and novelty are in doing s…
▽ More
Substantial efforts are invested in improving network security, but the threat landscape is rapidly evolving, particularly with the recent interest in programmable network hardware. We explore a new security threat, from an attacker who has gained control of such devices. While it should be obvious that such attackers can trivially cause substantial damage, the challenge and novelty are in doing so while preventing quick diagnosis by the operator.
We find that compromised programmable devices can easily degrade networked applications by orders of magnitude, while evading diagnosis by even the most sophisticated network diagnosis methods in deployment. Two key observations yield this result: (a) targeting a small number of packets is often enough to cause disproportionate performance degradation; and (b) new programmable hardware is an effective enabler of careful, selective targeting of packets. Our results also point to recommendations for minimizing the damage from such attacks, ranging from known, easy to implement techniques like encryption and redundant requests, to more complex considerations that would potentially limit some intended uses of programmable hardware. For data center contexts, we also discuss application-aware monitoring and response as a potential mitigation.
△ Less
Submitted 27 May, 2021; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes
Authors:
Steffen Dereich,
Sebastian Kassing
Abstract:
In this article, we consider convergence of stochastic gradient descent schemes (SGD), including momentum stochastic gradient descent (MSGD), under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays bounded we have convergence of the SGD if there is only a countable number of critical points or if the objective function satisfies Lojasiewicz…
▽ More
In this article, we consider convergence of stochastic gradient descent schemes (SGD), including momentum stochastic gradient descent (MSGD), under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays bounded we have convergence of the SGD if there is only a countable number of critical points or if the objective function satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying bounded, if the random variables modelling the signal and response in the training are compactly supported.
△ Less
Submitted 9 January, 2024; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Central limit theorems for stochastic gradient descent with averaging for stable manifolds
Authors:
Steffen Dereich,
Sebastian Kassing
Abstract:
In this article we establish new central limit theorems for Ruppert-Polyak averaged stochastic gradient descent schemes. Compared to previous work we do not assume that convergence occurs to an isolated attractor but instead allow convergence to a stable manifold. On the stable manifold the target function is constant and the oscillations in the tangential direction may be significantly larger tha…
▽ More
In this article we establish new central limit theorems for Ruppert-Polyak averaged stochastic gradient descent schemes. Compared to previous work we do not assume that convergence occurs to an isolated attractor but instead allow convergence to a stable manifold. On the stable manifold the target function is constant and the oscillations in the tangential direction may be significantly larger than the ones in the normal direction. As we show, one still recovers a central limit theorem with the same rates as in the case of isolated attractors. Here we consider step-sizes $γ_n=n^{-γ}$ with $γ\in(\frac34,1)$, typically.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.
-
Distributed Learning over Unreliable Networks
Authors:
Chen Yu,
Hanlin Tang,
Cedric Renggli,
Simon Kassing,
Ankit Singla,
Dan Alistarh,
Ce Zhang,
Ji Liu
Abstract:
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, w…
▽ More
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.
△ Less
Submitted 15 May, 2019; v1 submitted 17 October, 2018;
originally announced October 2018.