Search | arXiv e-print repository

Optimal estimation in spatially distributed systems: how far to share measurements from?

Authors: Juncal Arbelaiz, Bassam Bamieh, Anette E. Hosoi, Ali Jadbabaie

Abstract: We consider the centralized optimal estimation problem in spatially distributed systems. We use the setting of spatially invariant systems as an idealization for which concrete and detailed results are given. Such estimators are known to have a degree of spatial localization in the sense that the estimator gains decay in space, with the spatial decay rates serving as a proxy for how far measuremen… ▽ More We consider the centralized optimal estimation problem in spatially distributed systems. We use the setting of spatially invariant systems as an idealization for which concrete and detailed results are given. Such estimators are known to have a degree of spatial localization in the sense that the estimator gains decay in space, with the spatial decay rates serving as a proxy for how far measurements need to be shared in an optimal distributed estimator. In particular, we examine the dependence of spatial decay rates on problem specifications such as system dynamics, measurement and process noise variances, as well as their spatial autocorrelations. We propose non-dimensional parameters that characterize the decay rates as a function of problem specifications. In particular, we find an interesting matching condition between the characteristic lengthscale of the dynamics and the measurement noise correlation lengthscale for which the optimal centralized estimator is completely decentralized. A new technique - termed the branch point locus - is introduced to quantify spatial decay rates in terms of analyticity regions in the complex spatial frequency plane. Our results are illustrated through two case studies of systems with dynamics modeled by diffusion and the Swift-Hohenberg equation, respectively. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.02997 [pdf, other]

Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Authors: Michael Scholkemper, Xinyi Wu, Ali Jadbabaie, Michael T. Schaub

Abstract: Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (line… ▽ More Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of node representations to the top-$k$ eigenspace of the message-passing operator; (c) moreover, we show that the centering step of a normalization layer -- which can be understood as a projection -- alters the graph signal in message-passing in such a way that relevant information can become harder to extract. We therefore introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way. Experimental results confirm the effectiveness of our method. △ Less

Submitted 12 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2405.18781 [pdf, other]

On the Role of Attention Masks and LayerNorm in Transformers

Authors: Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

Abstract: Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical… ▽ More Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2404.08120 [pdf, other]

A least-square method for non-asymptotic identification in linear switching control

Authors: Haoyuan Sun, Ali Jadbabaie

Abstract: The focus of this paper is on linear system identification in the setting where it is known that the underlying partially-observed linear dynamical system lies within a finite collection of known candidate models. We first consider the problem of identification from a given trajectory, which in this setting reduces to identifying the index of the true model with high probability. We characterize t… ▽ More The focus of this paper is on linear system identification in the setting where it is known that the underlying partially-observed linear dynamical system lies within a finite collection of known candidate models. We first consider the problem of identification from a given trajectory, which in this setting reduces to identifying the index of the true model with high probability. We characterize the finite-time sample complexity of this problem by leveraging recent advances in the non-asymptotic analysis of linear least-square methods in the literature. In comparison to the earlier results that assume no prior knowledge of the system, our approach takes advantage of the smaller hypothesis class and leads to the design of a learner with a dimension-free sample complexity bound. Next, we consider the switching control of linear systems, where there is a candidate controller for each of the candidate models and data is collected through interaction of the system with a collection of potentially destabilizing controllers. We develop a dimension-dependent criterion that can detect those destabilizing controllers in finite time. By leveraging these results, we propose a data-driven switching strategy that identifies the unknown parameters of the underlying system. We then provide a non-asymptotic analysis of its performance and discuss its implications on the classical method of estimator-based supervisory control. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.17174 [pdf, ps, other]

Belief Samples Are All You Need For Social Learning

Authors: Mahyar JafariNodeh, Amir Ajorlou, Ali Jadbabaie

Abstract: In this paper, we consider the problem of social learning, where a group of agents embedded in a social network are interested in learning an underlying state of the world. Agents have incomplete, noisy, and heterogeneous sources of information, providing them with recurring private observations of the underlying state of the world. Agents can share their learning experience with their peers by ta… ▽ More In this paper, we consider the problem of social learning, where a group of agents embedded in a social network are interested in learning an underlying state of the world. Agents have incomplete, noisy, and heterogeneous sources of information, providing them with recurring private observations of the underlying state of the world. Agents can share their learning experience with their peers by taking actions observable to them, with values from a finite feasible set of states. Actions can be interpreted as samples from the beliefs which agents may form and update on what the true state of the world is. Sharing samples, in place of full beliefs, is motivated by the limited communication, cognitive, and information-processing resources available to agents especially in large populations. Previous work (Salhab et al.) poses the question as to whether learning with probability one is still achievable if agents are only allowed to communicate samples from their beliefs. We provide a definite positive answer to this question, assuming a strongly connected network and a ``collective distinguishability'' assumption, which are both required for learning even in full-belief-sharing settings. In our proposed belief update mechanism, each agent's belief is a normalized weighted geometric interpolation between a fully Bayesian private belief -- aggregating information from the private source -- and an ensemble of empirical distributions of the samples shared by her neighbors over time. By carefully constructing asymptotic almost-sure lower/upper bounds on the frequency of shared samples matching the true state/or not, we rigorously prove the convergence of all the beliefs to the true state, with probability one. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 6 pages

arXiv:2310.17171 [pdf, other]

Estimating True Beliefs in Opinion Dynamics with Social Pressure

Authors: Jennifer Tang, Aviv Adler, Amir Ajorlou, Ali Jadbabaie

Abstract: Social networks often exert social pressure, causing individuals to adapt their expressed opinions to conform to their peers. An agent in such systems can be modeled as having a (true and unchanging) inherent belief while broadcasting a declared opinion at each time step based on her inherent belief and the past declared opinions of her neighbors. An important question in this setting is parameter… ▽ More Social networks often exert social pressure, causing individuals to adapt their expressed opinions to conform to their peers. An agent in such systems can be modeled as having a (true and unchanging) inherent belief while broadcasting a declared opinion at each time step based on her inherent belief and the past declared opinions of her neighbors. An important question in this setting is parameter estimation: how to disentangle the effects of social pressure to estimate inherent beliefs from declared opinions. This is useful for forecasting when agents' declared opinions are influenced by social pressure while real-world behavior only depends on their inherent beliefs. To address this, Jadbabaie et al. formulated the Interacting Pólya Urn model of opinion dynamics under social pressure and studied it on complete-graph social networks using an aggregate estimator, and found that their estimator converges to the inherent beliefs unless majority pressure pushes the network to consensus. In this work, we studythis model on arbitrary networks, providing an estimator which converges to the inherent beliefs even in consensus situations. Finally, we bound the convergence rate of our estimator in both consensus and non-consensus scenarios; to get the bound for consensus scenarios (which converge slower than non-consensus) we additionally found how quickly the system converges to consensus. △ Less

Submitted 26 June, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.01082 [pdf, other]

Linear attention is (maybe) all you need (to understand transformer optimization)

Authors: Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and… ▽ More Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization. △ Less

Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: Published at ICLR 2024

arXiv:2308.09275 [pdf, other]

Stochastic Opinion Dynamics under Social Pressure in Arbitrary Networks

Authors: Jennifer Tang, Aviv Adler, Amir Ajorlou, Ali Jadbabaie

Abstract: Social pressure is a key factor affecting the evolution of opinions on networks in many types of settings, pushing people to conform to their neighbors' opinions. To study this, the interacting Polya urn model was introduced by Jadbabaie et al., in which each agent has two kinds of opinion: inherent beliefs, which are hidden from the other agents and fixed; and declared opinions, which are randoml… ▽ More Social pressure is a key factor affecting the evolution of opinions on networks in many types of settings, pushing people to conform to their neighbors' opinions. To study this, the interacting Polya urn model was introduced by Jadbabaie et al., in which each agent has two kinds of opinion: inherent beliefs, which are hidden from the other agents and fixed; and declared opinions, which are randomly sampled at each step from a distribution which depends on the agent's inherent belief and her neighbors' past declared opinions (the social pressure component), and which is then communicated to their neighbors. Each agent also has a bias parameter denoting her level of resistance to social pressure. At every step, the agents simultaneously update their declared opinions according to their neighbors' aggregate past declared opinions, their inherent beliefs, and their bias parameters. We study the asymptotic behavior of this opinion dynamics model and show that agents' declaration probabilities converge almost surely in the limit using Lyapunov theory and stochastic approximation techniques. We also derive necessary and sufficient conditions for the agents to approach consensus on their declared opinions. Our work provides further insight into the difficulty of inferring the inherent beliefs of agents when they are under social pressure. △ Less

Submitted 25 October, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: fixed typos

arXiv:2307.14619 [pdf, other]

Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior

Authors: Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, Russ Tedrake

Abstract: We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitat… ▽ More We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitation learner, pure supervised behavior cloning can generate trajectories matching the per-time step distribution of essentially arbitrary expert trajectories in an optimal transport cost. Our analysis relies on a stochastic continuity property of the learned policy we call "total variation continuity" (TVC). We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations, and discussing implications for future research directions for better behavior cloning with generative modeling. △ Less

Submitted 24 October, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: updated figures, minor notational change for readability

arXiv:2307.05858 [pdf, other]

doi 10.1103/PhysRevLett.131.193602

Quantum-Enhanced Metrology for Molecular Symmetry Violation using Decoherence-Free Subspaces

Authors: Chi Zhang, Phelan Yu, Arian Jadbabaie, Nicholas R. Hutzler

Abstract: We propose a method to measure time-reversal symmetry violation in molecules that overcomes the standard quantum limit while leveraging decoherence-free subspaces to mitigate sensitivity to classical noise. The protocol does not require an external electric field, and the entangled states have no first-order sensitivity to static electromagnetic fields as they involve superpositions with zero aver… ▽ More We propose a method to measure time-reversal symmetry violation in molecules that overcomes the standard quantum limit while leveraging decoherence-free subspaces to mitigate sensitivity to classical noise. The protocol does not require an external electric field, and the entangled states have no first-order sensitivity to static electromagnetic fields as they involve superpositions with zero average lab-frame projection of spins and dipoles. This protocol can be applied with trapped neutral or ionic species, and can be implemented using methods which have been demonstrated experimentally. △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: 7+11 pages, 3+3 figures

Journal ref: Phys. Rev. Lett. 131, 193602 (2023)

arXiv:2306.01914 [pdf, other]

Smooth Model Predictive Control with Applications to Statistical Learning

Authors: Kwangjun Ahn, Daniel Pfrommer, Jack Umenberger, Tobia Marcucci, Zak Mhammedi, Ali Jadbabaie

Abstract: Statistical learning theory and high dimensional statistics have had a tremendous impact on Machine Learning theory and have impacted a variety of domains including systems and control theory. Over the past few years we have witnessed a variety of applications of such theoretical tools to help answer questions such as: how many state-action pairs are needed to learn a static control policy to a gi… ▽ More Statistical learning theory and high dimensional statistics have had a tremendous impact on Machine Learning theory and have impacted a variety of domains including systems and control theory. Over the past few years we have witnessed a variety of applications of such theoretical tools to help answer questions such as: how many state-action pairs are needed to learn a static control policy to a given accuracy? Recent results have shown that continuously differentiable and stabilizing control policies can be well-approximated using neural networks with hard guarantees on performance, yet often even the simplest constrained control problems are not smooth. To address this void, in this paper we study smooth approximations of linear Model Predictive Control (MPC) policies, in which hard constraints are replaced by barrier functions, a.k.a. barrier MPC. In particular, we show that barrier MPC inherits the exponential stability properties of the original non-smooth MPC policy. Using a careful analysis of the proposed barrier MPC, we show that its smoothness constant can be carefully controlled, thereby paving the way for new sample complexity results for approximating MPC policies from sampled state-action pairs. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 15 pages, 1 figure

arXiv:2306.01264 [pdf, ps, other]

Convex and Non-convex Optimization Under Generalized Smoothness

Authors: Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, Ali Jadbabaie

Abstract: Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clip**, ass… ▽ More Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clip**, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clip** and allows heavy-tailed noise with bounded variance in the stochastic setting. △ Less

Submitted 3 November, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: 37 pages

arXiv:2305.16102 [pdf, other]

Demystifying Oversmoothing in Attention-Based Graph Neural Networks

Authors: Xinyi Wu, Amir Ajorlou, Zihui Wu, Ali Jadbabaie

Abstract: Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to th… ▽ More Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models, including random walk GCNs, Graph Attention Networks (GATs) and (graph) transformers. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU. △ Less

Submitted 3 June, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023 spotlight. Fixed an error in the previous version; new results and remarks added

arXiv:2305.15659 [pdf, other]

How to escape sharp minima with random perturbations

Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra

Abstract: Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use… ▽ More Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice. △ Less

Submitted 25 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at ICML 2024

arXiv:2304.14548 [pdf, other]

doi 10.1103/PhysRevA.108.012813

Optical cycling in polyatomic molecules with complex hyperfine structure

Authors: Yi Zeng, Arian Jadbabaie, Ashay N. Patel, Phelan Yu, Timothy C. Steimle, Nicholas R. Hutzler

Abstract: We have developed and demonstrated a scheme to achieve rotationally-closed photon cycling in polyatomic molecules with complex hyperfine structure and sensitivity to hadronic symmetry violation, specifically $^{171}$YbOH and $^{173}$YbOH. We calculate rotational branching ratios for spontaneous decay and identify repum** schemes which use electro-optical modulators (EOMs) to address the hyperfin… ▽ More We have developed and demonstrated a scheme to achieve rotationally-closed photon cycling in polyatomic molecules with complex hyperfine structure and sensitivity to hadronic symmetry violation, specifically $^{171}$YbOH and $^{173}$YbOH. We calculate rotational branching ratios for spontaneous decay and identify repum** schemes which use electro-optical modulators (EOMs) to address the hyperfine structure. We demonstrate our scheme by cycling photons in a molecular beam and verify that we have achieved rotationally-closed cycling by measuring optical pum** into unaddressed vibrational states. Our work makes progress along the path toward utilizing photon cycling for state preparation, readout, and laser cooling in precision measurements of polyatomic molecules with complex hyperfine structure. △ Less

Submitted 27 April, 2023; originally announced April 2023.

Comments: 10 pages, 7 figures

Journal ref: Phys. Rev. A 108, 012813 (2023)

arXiv:2304.13972 [pdf, ps, other]

Convergence of Adam Under Relaxed Assumptions

Authors: Haochuan Li, Alexander Rakhlin, Ali Jadbabaie

Abstract: In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally boun… ▽ More In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $ε$-stationary points with ${O}(ε^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(ε^{-3})$. △ Less

Submitted 6 November, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: 35 pages

arXiv:2304.13817 [pdf, other]

doi 10.1103/PhysRevLett.131.183003

Engineering field-insensitive molecular clock transitions for symmetry violation searches

Authors: Yuiki Takahashi, Chi Zhang, Arian Jadbabaie, Nicholas R. Hutzler

Abstract: Molecules are a powerful platform to probe fundamental symmetry violations beyond the Standard Model, as they offer both large amplification factors and robustness against systematic errors. As experimental sensitivities improve, it is important to develop new methods to suppress sensitivity to external electromagnetic fields, as limits on the ability to control these fields are a major experiment… ▽ More Molecules are a powerful platform to probe fundamental symmetry violations beyond the Standard Model, as they offer both large amplification factors and robustness against systematic errors. As experimental sensitivities improve, it is important to develop new methods to suppress sensitivity to external electromagnetic fields, as limits on the ability to control these fields are a major experimental concern. Here we show that sensitivity to both external magnetic and electric fields can be simultaneously suppressed using engineered radio frequency, microwave, or two-photon transitions that maintain large amplification of CP-violating effects. By performing a clock measurement on these transitions, CP-violating observables including the electron electric dipole moment, nuclear Schiff moment, and magnetic quadrupole moment can be measured with suppression of external field sensitivity of $\gtrsim$100 generically, and even more in many cases. Furthermore, the method is compatible with traditional Ramsey measurements, offers internal co-magnetometry, and is useful for systems with large angular momentum commonly present in molecular searches for nuclear CP-violation. △ Less

Submitted 3 October, 2023; v1 submitted 26 April, 2023; originally announced April 2023.

Journal ref: Phys. Rev. Lett. 131, 183003 (2023)

arXiv:2303.03233 [pdf, other]

doi 10.1103/PhysRevA.107.062805

Direct measurement of high-lying vibrational repum** transitions for molecular laser cooling

Authors: Nickolas H. Pilgram, Arian Jadbabaie, Chandler J. Conn, Nicholas R. Hutzler

Abstract: Molecular laser cooling and trap** requires addressing all spontaneous decays to excited vibrational states that occur at the $\gtrsim 10^{-4} - 10^{-5}$ level, which is accomplished by driving repum** transitions out of these states. However, the transitions must first be identified spectroscopically at high-resolution. A typical approach is to prepare molecules in excited vibrational states… ▽ More Molecular laser cooling and trap** requires addressing all spontaneous decays to excited vibrational states that occur at the $\gtrsim 10^{-4} - 10^{-5}$ level, which is accomplished by driving repum** transitions out of these states. However, the transitions must first be identified spectroscopically at high-resolution. A typical approach is to prepare molecules in excited vibrational states via optical cycling and pum**, which requires multiple high-power lasers. Here, we demonstrate a general method to perform this spectroscopy without the need for optical cycling. We produce molecules in excited vibrational states by using optically-driven chemical reactions in a cryogenic buffer gas cell, and implement frequency-modulated absorption to perform direct, sensitive, high-resolution spectroscopy. We demonstrate this technique by measuring the spectrum of the $\tilde{A}^2Π_{1/2}(1,0,0)-\tilde{X}^2Σ^+(3,0,0)$ band in $^{174}$YbOH. We identify the specific vibrational repump transitions needed for photon cycling, and combine our data with previous measurements of the $\tilde{A}^2Π_{1/2}(1,0,0)-\tilde{X}^2Σ^+(0,0,0)$ band to determine all of the relevant spectral constants of the $\tilde{X}^2Σ^+(3,0,0)$ state. This technique achieves high signal-to-noise, can be further improved to measure increasingly high-lying vibrational states, and is applicable to other molecular species favorable for laser cooling. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 14 pages, 5 figures

Journal ref: Phys. Rev. A 107, 062805 (2023)

arXiv:2303.00883 [pdf, other]

Variance-reduced Clip** for Non-convex Optimization

Authors: Amirhossein Reisizadeh, Haochuan Li, Subhro Das, Ali Jadbabaie

Abstract: Gradient clip** is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clip**. That is, the smoothness grows with the gradient norm. This is in clea… ▽ More Gradient clip** is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clip**. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. $L$--smoothness, where the smoothness is assumed to be bounded by a constant $L$ globally. The recently introduced $(L_0,L_1)$--smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clip** requires $O(ε^{-4})$ stochastic gradient computations to find an $ε$--stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to $O(ε^{-3})$ which is order-optimal. Our designed learning rate comprises the clip** technique to mitigate the growing smoothness. Moreover, when the objective function is the average of $n$ components, we improve the existing $O(nε^{-2})$ bound on the stochastic gradient complexity to $O(\sqrt{n} ε^{-2} + n)$, which is order-optimal as well. In addition to being theoretically optimal, SPIDER with our designed parameters demonstrates comparable empirical performance against variance-reduced methods such as SVRG and SARAH in several vision tasks. △ Less

Submitted 2 June, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2301.08656 [pdf, other]

doi 10.1126/science.adg8155

Quantum Control of Trapped Polyatomic Molecules for eEDM Searches

Authors: Loïc Anderegg, Nathaniel B. Vilas, Christian Hallas, Paige Robichaud, Arian Jadbabaie, John M. Doyle, Nicholas R. Hutzler

Abstract: Ultracold polyatomic molecules are promising candidates for experiments in quantum science, quantum sensing, ultracold chemistry, and precision measurements of physics beyond the Standard Model. A key, yet unrealized, requirement of these experiments is the ability to achieve full quantum control over the complex internal structure of the molecules. Here, we establish coherent control of individua… ▽ More Ultracold polyatomic molecules are promising candidates for experiments in quantum science, quantum sensing, ultracold chemistry, and precision measurements of physics beyond the Standard Model. A key, yet unrealized, requirement of these experiments is the ability to achieve full quantum control over the complex internal structure of the molecules. Here, we establish coherent control of individual quantum states in a polyatomic molecule, calcium monohydroxide (CaOH), and use these techniques to demonstrate a method for searching for the electron electric dipole moment (eEDM). Optically trapped, ultracold CaOH molecules are prepared in a single quantum state, polarized in an electric field, and coherently transferred into an eEDM sensitive state where an electron spin precession measurement is performed. To extend the coherence time of the measurement, we utilize eEDM sensitive states with tunable, near-zero magnetic field sensitivity. The spin precession coherence time is limited by AC Stark shifts and uncontrolled magnetic fields. These results establish a path for eEDM searches with trapped polyatomic molecules, towards orders-of-magnitude improved experimental sensitivity to time-reversal-violating physics. △ Less

Submitted 20 January, 2023; originally announced January 2023.

Journal ref: Science 382, 665 (2023)

arXiv:2301.04124 [pdf, other]

doi 10.1088/1367-2630/ace471

Characterizing the Fundamental Bending Vibration of a Linear Polyatomic Molecule for Symmetry Violation Searches

Authors: Arian Jadbabaie, Yuiki Takahashi, Nickolas H. Pilgram, Chandler J. Conn, Yi Zeng, Chi Zhang, Nicholas R. Hutzler

Abstract: Polyatomic molecules have been identified as sensitive probes of charge-parity violating and parity-violating physics beyond the Standard Model (BSM). For example, many linear triatomic molecules are both laser-coolable and have parity doublets in the ground electronic $\tilde{X} {}^2Σ^+ (010)$ state arising from the bending vibration, both features that can greatly aid BSM searches. Understanding… ▽ More Polyatomic molecules have been identified as sensitive probes of charge-parity violating and parity-violating physics beyond the Standard Model (BSM). For example, many linear triatomic molecules are both laser-coolable and have parity doublets in the ground electronic $\tilde{X} {}^2Σ^+ (010)$ state arising from the bending vibration, both features that can greatly aid BSM searches. Understanding the $\tilde{X} {}^2Σ^+ (010)$ state is a crucial prerequisite to precision measurements with linear polyatomic molecules. Here, we characterize fundamental bending vibration of ${}^{174}$YbOH using high-resolution optical spectroscopy on the nominally forbidden $\tilde{X} {}^2Σ^+ (010) \rightarrow \tilde{A} {}^2Π_{1/2} (000)$ transition at 588 nm. We assign 39 transitions originating from the lowest rotational levels of the $\tilde{X} {}^2Σ^+ (010)$ state, and accurately model the state's structure with an effective Hamiltonian using best-fit parameters. Additionally, we perform Stark and Zeeman spectroscopy on the $\tilde{X} {}^2Σ^+ (010)$ state and fit the molecule-frame dipole moment to $D_\mathrm{mol}=2.16(1)$ D and the effective electron $g$-factor to $g_S=2.07(2)$. Further, we use an empirical model to explain observed anomalous line intensities in terms of interference from spin-orbit and vibronic perturbations in the excited $\tilde{A} {}^2Π_{1/2} (000)$ state. Our work is an essential step toward searches for BSM physics in YbOH and other linear polyatomic molecules. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: 26 pages, 7 figures

Journal ref: New J. Phys. 25, 073014 (2023)

arXiv:2212.10701 [pdf, other]

A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks

Authors: Xinyi Wu, Zhengdao Chen, William Wang, Ali Jadbabaie

Abstract: Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs). While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when a… ▽ More Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs). While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions -- an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR), or equivalently, the effects of initial residual connections on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice can be magnified by the difficulty of optimizing deep GNN models. △ Less

Submitted 28 February, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: Accepted by the 11th International Conference on Learning Representations (ICLR 2023)

arXiv:2210.09206 [pdf, other]

Model Predictive Control via On-Policy Imitation Learning

Authors: Kwangjun Ahn, Zakaria Mhammedi, Horia Mania, Zhang-Wei Hong, Ali Jadbabaie

Abstract: In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results and performance guarantees for data-driven Model Predictive Control (MPC) for constrained linear systems. In its simplest form, imitation learning is an approach that tries to learn an expert policy by querying… ▽ More In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results and performance guarantees for data-driven Model Predictive Control (MPC) for constrained linear systems. In its simplest form, imitation learning is an approach that tries to learn an expert policy by querying samples from an expert. Recent approaches to data-driven MPC have used the simplest form of imitation learning known as behavior cloning to learn controllers that mimic the performance of MPC by online sampling of the trajectories of the closed-loop MPC system. Behavior cloning, however, is a method that is known to be data inefficient and suffer from distribution shifts. As an alternative, we develop a variant of the forward training algorithm which is an on-policy imitation learning method proposed by Ross et al. (2010). Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance. We validate our results through simulations and show that the forward training algorithm is indeed superior to behavior cloning when applied to MPC. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: 26 pages

arXiv:2210.01849 [pdf, other]

Link Partitioning on Simplicial Complexes Using Higher-Order Laplacians

Authors: Xinyi Wu, Arnab Sarker, Ali Jadbabaie

Abstract: Link partitioning is a popular approach in network science used for discovering overlap** communities by identifying clusters of strongly connected links. Current link partitioning methods are specifically designed for networks modelled by graphs representing pairwise relationships. Therefore, these methods are incapable of utilizing higher-order information about group interactions in network d… ▽ More Link partitioning is a popular approach in network science used for discovering overlap** communities by identifying clusters of strongly connected links. Current link partitioning methods are specifically designed for networks modelled by graphs representing pairwise relationships. Therefore, these methods are incapable of utilizing higher-order information about group interactions in network data which is increasingly available. Simplicial complexes extend the dyadic model of graphs and can model polyadic relationships which are ubiquitous and crucial in many complex social and technological systems. In this paper, we introduce a link partitioning method that leverages higher-order (i.e. triadic and higher) information in simplicial complexes for better community detection. Our method utilizes a novel random walk on links of simplicial complexes defined by the higher-order Laplacian--a generalization of the graph Laplacian that incorporates polyadic relationships of the network. We transform this random walk into a graph-based random walk on a lifted line graph--a dual graph in which links are nodes while nodes and higher-order connections are links--and optimize for the standard notion of modularity. We show that our method is guaranteed to provide interpretable link partitioning results under mild assumptions. We also offer new theoretical results on the spectral properties of simplicial complexes by studying the spectrum of the link random walk. Experiment results on real-world community detection tasks show that our higher-order approach significantly outperforms existing graph-based link partitioning methods. △ Less

Submitted 10 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted to 22nd IEEE International Conference on Data Mining (ICDM 2022). Fixed some typos in v1

arXiv:2207.11335 [pdf, other]

Generalizing Homophily to Simplicial Complexes

Authors: Arnab Sarker, Natalie Northrup, Ali Jadbabaie

Abstract: Group interactions occur frequently in social settings, yet their properties beyond pairwise relationships in network models remain unexplored. In this work, we study homophily, the nearly ubiquitous phenomena wherein similar individuals are more likely than random to form connections with one another, and define it on simplicial complexes, a generalization of network models that goes beyond dyadi… ▽ More Group interactions occur frequently in social settings, yet their properties beyond pairwise relationships in network models remain unexplored. In this work, we study homophily, the nearly ubiquitous phenomena wherein similar individuals are more likely than random to form connections with one another, and define it on simplicial complexes, a generalization of network models that goes beyond dyadic interactions. While some group homophily definitions have been proposed in the literature, we provide theoretical and empirical evidence that prior definitions mostly inherit properties of homophily in pairwise interactions rather than capture the homophily of group dynamics. Hence, we propose a new measure, $k$-simplicial homophily, which properly identifies homophily in group dynamics. Across 16 empirical networks, $k$-simplicial homophily provides information uncorrelated with homophily measures on pairwise interactions. Moreover, we show the empirical value of $k$-simplicial homophily in identifying when metadata on nodes is useful for predicting group interactions, whereas previous measures are uninformative. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: Preprint submitted to International Conference on Complex Networks and their Applications

arXiv:2207.00957 [pdf, other]

On Convergence of Gradient Descent Ascent: A Tight Local Analysis

Authors: Haochuan Li, Farzan Farnia, Subhro Das, Ali Jadbabaie

Abstract: Gradient Descent Ascent (GDA) methods are the mainstream algorithms for minimax optimization in generative adversarial networks (GANs). Convergence properties of GDA have drawn significant interest in the recent literature. Specifically, for $\min_{\mathbf{x}} \max_{\mathbf{y}} f(\mathbf{x};\mathbf{y})$ where $f$ is strongly-concave in $\mathbf{y}$ and possibly nonconvex in $\mathbf{x}$, (Lin et a… ▽ More Gradient Descent Ascent (GDA) methods are the mainstream algorithms for minimax optimization in generative adversarial networks (GANs). Convergence properties of GDA have drawn significant interest in the recent literature. Specifically, for $\min_{\mathbf{x}} \max_{\mathbf{y}} f(\mathbf{x};\mathbf{y})$ where $f$ is strongly-concave in $\mathbf{y}$ and possibly nonconvex in $\mathbf{x}$, (Lin et al., 2020) proved the convergence of GDA with a stepsize ratio $η_{\mathbf{y}}/η_{\mathbf{x}}=Θ(κ^2)$ where $η_{\mathbf{x}}$ and $η_{\mathbf{y}}$ are the stepsizes for $\mathbf{x}$ and $\mathbf{y}$ and $κ$ is the condition number for $\mathbf{y}$. While this stepsize ratio suggests a slow training of the min player, practical GAN algorithms typically adopt similar stepsizes for both variables, indicating a wide gap between theoretical and empirical results. In this paper, we aim to bridge this gap by analyzing the \emph{local convergence} of general \emph{nonconvex-nonconcave} minimax problems. We demonstrate that a stepsize ratio of $Θ(κ)$ is necessary and sufficient for local convergence of GDA to a Stackelberg Equilibrium, where $κ$ is the local condition number for $\mathbf{y}$. We prove a nearly tight convergence rate with a matching lower bound. We further extend the convergence guarantees to stochastic GDA and extra-gradient methods (EG). Finally, we conduct several numerical experiments to support our theoretical findings. △ Less

Submitted 3 July, 2022; originally announced July 2022.

Comments: Accepted by ICML 2022

arXiv:2206.09945 [pdf, ps, other]

Sparse Representations of Dynamical Networks: A Coprime Factorization Approach

Authors: Şerban Sabău, Andrei Sperilă, Cristian Oară, Ali Jadbabaie

Abstract: We study a class of dynamical networks modeled by linear and time-invariant systems which are described by state-space realizations. For these networks, we investigate the relations between various types of factorizations which preserve the structure of their component subsystems' interconnection. In doing so, we provide tractable means of shifting between different types of sparsity-preserving re… ▽ More We study a class of dynamical networks modeled by linear and time-invariant systems which are described by state-space realizations. For these networks, we investigate the relations between various types of factorizations which preserve the structure of their component subsystems' interconnection. In doing so, we provide tractable means of shifting between different types of sparsity-preserving representations and we show how to employ these factorizations to obtain distributed implementations for stabilizing and possibly stable controllers. By formulating all these results for both discrete- and continuous-time systems, we develop specialized distributed implementations that, up to this point, were only available for networks modeled as discrete-time systems. △ Less

Submitted 13 February, 2024; v1 submitted 20 June, 2022; originally announced June 2022.

Comments: 35 pages, 5 figures

MSC Class: 93A14; 93B99; 93C05

arXiv:2206.08257 [pdf, other]

Gradient Descent for Low-Rank Functions

Authors: Romain Cosson, Ali Jadbabaie, Anuran Makur, Amirhossein Reisizadeh, Devavrat Shah

Abstract: Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our p… ▽ More Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $ε$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/ε) + rp)$ and $\mathcal{O}(r/ε^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/ε))$ and $\mathcal{O}(p/ε^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: 26 pages, 2 figures

arXiv:2206.02468 [pdf, ps, other]

doi 10.1109/JSAIT.2022.3182355

An Optimal Transport Approach to Personalized Federated Learning

Authors: Farzan Farnia, Amirhossein Reisizadeh, Ramtin Pedarsani, Ali Jadbabaie

Abstract: Federated learning is a distributed machine learning paradigm, which aims to train a model using the local data of many distributed clients. A key challenge in federated learning is that the data samples across the clients may not be identically distributed. To address this challenge, personalized federated learning with the goal of tailoring the learned model to the data distribution of every ind… ▽ More Federated learning is a distributed machine learning paradigm, which aims to train a model using the local data of many distributed clients. A key challenge in federated learning is that the data samples across the clients may not be identically distributed. To address this challenge, personalized federated learning with the goal of tailoring the learned model to the data distribution of every individual client has been proposed. In this paper, we focus on this problem and propose a novel personalized Federated Learning scheme based on Optimal Transport (FedOT) as a learning algorithm that learns the optimal transport maps for transferring data points to a common distribution as well as the prediction model under the applied transport map. To formulate the FedOT problem, we extend the standard optimal transport task between two probability distributions to multi-marginal optimal transport problems with the goal of transporting samples from multiple distributions to a common probability domain. We then leverage the results on multi-marginal optimal transport problems to formulate FedOT as a min-max optimization problem and analyze its generalization and optimization properties. We discuss the results of several numerical experiments to evaluate the performance of FedOT under heterogeneous data distributions in federated learning problems. △ Less

Submitted 6 June, 2022; originally announced June 2022.

arXiv:2204.01155 [pdf, ps, other]

Byzantine-Robust Federated Linear Bandits

Authors: Ali Jadbabaie, Haochuan Li, Jian Qian, Yi Tian

Abstract: In this paper, we study a linear bandit optimization problem in a federated setting where a large collection of distributed agents collaboratively learn a common linear bandit model. Standard federated learning algorithms applied to this setting are vulnerable to Byzantine attacks on even a small fraction of agents. We propose a novel algorithm with a robust aggregation oracle that utilizes the ge… ▽ More In this paper, we study a linear bandit optimization problem in a federated setting where a large collection of distributed agents collaboratively learn a common linear bandit model. Standard federated learning algorithms applied to this setting are vulnerable to Byzantine attacks on even a small fraction of agents. We propose a novel algorithm with a robust aggregation oracle that utilizes the geometric median. We prove that our proposed algorithm is robust to Byzantine attacks on fewer than half of agents and achieves a sublinear $\tilde{\mathcal{O}}({T^{3/4}})$ regret with $\mathcal{O}(\sqrt{T})$ steps of communication in $T$ steps. Moreover, we make our algorithm differentially private via a tree-based mechanism. Finally, if the level of corruption is known to be small, we show that using the geometric median of mean oracle for robust aggregation further improves the regret bound. △ Less

Submitted 3 April, 2022; originally announced April 2022.

arXiv:2203.15916 [pdf, other]

Current Implicit Policies May Not Eradicate COVID-19

Authors: Ali Jadbabaie, Arnab Sarker, Devavrat Shah

Abstract: Successful predictive modeling of epidemics requires an understanding of the implicit feedback control strategies which are implemented by populations to modulate the spread of contagion. While this task of capturing endogenous behavior can be achieved through intricate modeling assumptions, we find that a population's reaction to case counts can be described through a second order affine dynamica… ▽ More Successful predictive modeling of epidemics requires an understanding of the implicit feedback control strategies which are implemented by populations to modulate the spread of contagion. While this task of capturing endogenous behavior can be achieved through intricate modeling assumptions, we find that a population's reaction to case counts can be described through a second order affine dynamical system with linear control which fits well to the data across different regions and times throughout the COVID-19 pandemic. The model fits the data well both in and out of sample across the 50 states of the United States, with comparable $R^2$ scores to state of the art ensemble predictions. In contrast to recent models of epidemics, rather than assuming that individuals directly control the contact rate which governs the spread of disease, we assume that individuals control the rate at which they vary their number of interactions, i.e. they control the derivative of the contact rate. We propose an implicit feedback law for this control input and verify that it correlates with policies taken throughout the pandemic. A key takeaway of the dynamical model is that the "stable" point of case counts is non-zero, i.e. COVID-19 will not be eradicated under the current collection of policies and strategies, and additional policies are needed to fully eradicate it quickly. Hence, we suggest alternative implicit policies which focus on making interventions (such as vaccinations and mobility restrictions) a function of cumulative case counts, for which our results suggest a better possibility of eradicating COVID-19. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2201.04960 [pdf, other]

Unifying Epidemic Models with Mixtures

Authors: Arnab Sarker, Ali Jadbabaie, Devavrat Shah

Abstract: The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges th… ▽ More The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in develo** data-driven solutions to controlling epidemics. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2201.01954 [pdf, other]

doi 10.1109/TIT.2023.3317168

Federated Optimization of Smooth Loss Functions

Authors: Ali Jadbabaie, Anuran Makur, Devavrat Shah

Abstract: In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $ε$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the conve… ▽ More In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $ε$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $η$-Hölder smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $φm(p/ε)^{Θ(d/η)}$ and that of FedAve scales like $φm(p/ε)^{3/4}$ (neglecting sub-dominant factors), where $φ\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models. △ Less

Submitted 3 January, 2024; v1 submitted 6 January, 2022; originally announced January 2022.

Comments: 31 pages, double column format, 2 figures

Journal ref: IEEE Transactions on Information Theory, vol. 69, no. 12, Dec. 2023

arXiv:2112.14862 [pdf, ps, other]

Time varying regression with hidden linear dynamics

Authors: Ali Jadbabaie, Horia Mania, Devavrat Shah, Suvrit Sra

Abstract: We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and d… ▽ More We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 22 pages

arXiv:2112.09093 [pdf, other]

doi 10.1109/TAC.2023.3298549

Network Realization Functions for Optimal Distributed Control

Authors: Şerban Sabău, Andrei Sperilă, Cristian Oară, Ali Jadbabaie

Abstract: In this paper, we discuss a distributed control architecture, aimed at networks with linear and time-invariant dynamics, which is amenable to convex formulations for controller design. The proposed approach is well suited for large scale systems, since the resulting feedback schemes completely avoid the exchange of internal states, i.e., plant or controller states, among sub-controllers. Additiona… ▽ More In this paper, we discuss a distributed control architecture, aimed at networks with linear and time-invariant dynamics, which is amenable to convex formulations for controller design. The proposed approach is well suited for large scale systems, since the resulting feedback schemes completely avoid the exchange of internal states, i.e., plant or controller states, among sub-controllers. Additionally, we provide state-space formulas for these sub-controllers, able to be implemented in a distributed manner. △ Less

Submitted 7 August, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: 8 pages, 6 figures

Journal ref: IEEE Transactions on Automatic Control, Early Access, 2023

arXiv:2110.06256 [pdf, other]

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Authors: **gzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie

Abstract: This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the… ▽ More This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice. △ Less

Submitted 17 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Journal ref: ICML 2022

arXiv:2108.02091 [pdf, other]

Which Bridges Are Weak Ties? Algebraic Topological Insights on Network Structure and Tie Strength

Authors: Arnab Sarker, Jean-Baptiste Seby, Austin R. Benson, Ali Jadbabaie

Abstract: Bridging relationships between individuals situated in different parts of a social network are important conduits for information and resources in social and organizational settings. Dyadic tie strength has often been used as an indicator for whether a relationship is bridging, under the assumption that bridging ties are always weak ties. However, recent empirical evidence suggests that bridging t… ▽ More Bridging relationships between individuals situated in different parts of a social network are important conduits for information and resources in social and organizational settings. Dyadic tie strength has often been used as an indicator for whether a relationship is bridging, under the assumption that bridging ties are always weak ties. However, recent empirical evidence suggests that bridging ties are often strong, forcing us to rethink the relationship between social network structure and dyadic tie strength. Here, we provide an analysis based on algebraic topology which clarifies this relationship between network structure and dyadic tie strength. Rather than model the network as a graph, we use a simplicial complex which can explicitly encode group interactions between three or more individuals. First, we show theoretically and empirically that Edge PageRank, an algebraic topological measure originally defined as an extension of the classical PageRank measure, is a valid continuous measure of how well a relationship acts as a bridge. Second, we use the tool of Hodge Decomposition, which allows us to decompose any flow in a simplicial complex into three orthogonal components, to clarify the relationship between dyadic tie strength and network structure. We find that individuals invest less in relationships associated with topological holes in the network, replicating and explaining recent empirical results that bridging relationships spanning short network distances tend to be weak, whereas those spanning longer distances are strong. Our results are validated on 15 large scale datasets and suggest the value of algebraic topological methods in empirical network analysis. △ Less

Submitted 5 January, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.11868 [pdf, ps, other]

In Defense of Liquid Democracy

Authors: Daniel Halpern, Joseph Y. Halpern, Ali Jadbabaie, Elchanan Mossel, Ariel D. Procaccia, Manon Revel

Abstract: Fluid democracy is a voting paradigm that allows voters to choose between directly voting and transitively delegating their votes to other voters. While fluid democracy has been viewed as a system that can combine the best aspects of direct and representative democracy, it can also result in situations where few voters amass a large amount of influence. To analyze the impact of this shortcoming, w… ▽ More Fluid democracy is a voting paradigm that allows voters to choose between directly voting and transitively delegating their votes to other voters. While fluid democracy has been viewed as a system that can combine the best aspects of direct and representative democracy, it can also result in situations where few voters amass a large amount of influence. To analyze the impact of this shortcoming, we consider what has been called an epistemic setting, where voters decide on a binary issue for which there is a ground truth. Previous work has shown that under certain assumptions on the delegation mechanism, the concentration of power is so severe that fluid democracy is less likely to identify the ground truth than direct voting. We examine different, arguably more realistic, classes of mechanisms, and prove they behave well by ensuring that (with high probability) there is a limit on concentration of power. Our proofs demonstrate that delegations can be treated as stochastic processes and that they can be compared to well-known processes from the literature -- such as preferential attachment and multi-types branching process -- that are sufficiently bounded for our purposes. Our results suggest that the concerns raised about fluid democracy can be overcome, thereby bolstering the case for this emerging paradigm. △ Less

Submitted 29 March, 2022; v1 submitted 25 July, 2021; originally announced July 2021.

arXiv:2104.11769 [pdf]

doi 10.1063/5.0055293

Fine and hyperfine interactions in $^{171}$YbOH and $^{173}$YbOH

Authors: Nickolas H. Pilgram, Arian Jadbabaie, Yi Zeng, Nicholas R. Hutzler, Timothy C. Steimle

Abstract: The odd isotopologues of ytterbium monohydroxide, $^{171,173}$YbOH, have been identified as promising molecules in which to measure parity (P) and time reversal (T) violating physics. Here we characterize the $\tilde{A}^{2}Π_{1/2}(0,0,0)-\tilde{X}^2Σ^+(0,0,0)$ band near 577 nm for these odd isotopologues. Both laser-induced fluorescence (LIF) excitation spectra of a supersonic molecular beam sampl… ▽ More The odd isotopologues of ytterbium monohydroxide, $^{171,173}$YbOH, have been identified as promising molecules in which to measure parity (P) and time reversal (T) violating physics. Here we characterize the $\tilde{A}^{2}Π_{1/2}(0,0,0)-\tilde{X}^2Σ^+(0,0,0)$ band near 577 nm for these odd isotopologues. Both laser-induced fluorescence (LIF) excitation spectra of a supersonic molecular beam sample and absorption spectra of a cryogenic buffer-gas cooled sample were recorded. Additionally, a novel spectroscopic technique based on laser-enhanced chemical reactions is demonstrated and utilized in the absorption measurements. This technique is especially powerful for disentangling congested spectra. An effective Hamiltonian model is used to extract the fine and hyperfine parameters for the $\tilde{A}^{2}Π_{1/2}(0,0,0)$ and $\tilde{X}^2Σ^+(0,0,0)$ states. A comparison of the determined $\tilde{X}^2Σ^+(0,0,0)$ hyperfine parameters with recently predicted values (M. Denis, et al., J. Chem. Phys. $\bf{152}$, 084303 (2020), K. Gaul and R. Berger, Phys. Rev. A $\bf{101}$, 012508 (2020), J. Liu et al., J. Chem. Phys. $\bf{154}$, 064110 (2021)) is made. The measured hyperfine parameters provide experimental confirmation of the computational methods used to compute the P,T-violating coupling constants $W_d$ and $W_M$, which correlate P,T-violating physics to P,T-violating energy shifts in the molecule. The dependence of the fine and hyperfine parameters of the $\tilde{A}^{2}Π_{1/2}(0,0,0)$ and $\tilde{X}^2Σ^+(0,0,0)$ states for all isotopologues of YbOH are discussed and a comparison to isoelectronic YbF is made. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Comments: 54 pages, 7 figures

Journal ref: J. Chem. Phys. 154, 244309 (2021)

arXiv:2104.11172 [pdf, ps, other]

Inference in Opinion Dynamics under Social Pressure

Authors: Ali Jadbabaie, Anuran Makur, Elchanan Mossel, Rabih Salhab

Abstract: We introduce a new opinion dynamics model where a group of agents holds two kinds of opinions: inherent and declared. Each agent's inherent opinion is fixed and unobservable by the other agents. At each time step, agents broadcast their declared opinions on a social network, which are governed by the agents' inherent opinions and social pressure. In particular, we assume that agents may declare op… ▽ More We introduce a new opinion dynamics model where a group of agents holds two kinds of opinions: inherent and declared. Each agent's inherent opinion is fixed and unobservable by the other agents. At each time step, agents broadcast their declared opinions on a social network, which are governed by the agents' inherent opinions and social pressure. In particular, we assume that agents may declare opinions that are not aligned with their inherent opinions to conform with their neighbors. This raises the natural question: Can we estimate the agents' inherent opinions from observations of declared opinions? For example, agents' inherent opinions may represent their true political alliances (Democrat or Republican), while their declared opinions may model the political inclinations of tweets on social media. In this context, we may seek to predict the election results by observing voters' tweets, which do not necessarily reflect their political support due to social pressure. We analyze this question in the special case where the underlying social network is a complete graph. We prove that, as long as the population does not include large majorities, estimation of aggregate and individual inherent opinions is possible. On the other hand, large majorities force minorities to lie over time, which makes asymptotic estimation impossible. △ Less

Submitted 3 May, 2022; v1 submitted 22 April, 2021; originally announced April 2021.

arXiv:2104.08708 [pdf, other]

Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization

Authors: Haochuan Li, Yi Tian, **gzhao Zhang, Ali Jadbabaie

Abstract: We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable. We establish a lower bound of $Ω\left(\sqrtκε^{-2}\right)$ for deterministic oracles, where $ε$ defines the level of approximate stationarity and $κ$ i… ▽ More We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable. We establish a lower bound of $Ω\left(\sqrtκε^{-2}\right)$ for deterministic oracles, where $ε$ defines the level of approximate stationarity and $κ$ is the condition number. Our analysis shows that the upper bound achieved in (Lin et al., 2020b) is optimal in the $ε$ and $κ$ dependence up to logarithmic factors. For stochastic oracles, we provide a lower bound of $Ω\left(\sqrtκε^{-2} + κ^{1/3}ε^{-4}\right)$. It suggests that there is a significant gap between the upper bound $\mathcal{O}(κ^3 ε^{-4})$ in (Lin et al., 2020a) and our lower bound in the condition number dependence. △ Less

Submitted 18 April, 2021; originally announced April 2021.

Comments: 20 pages, 1 figure

arXiv:2103.07079 [pdf, other]

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri… ▽ More We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities that can be summarized as: "single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD." We present theorems that support our conjecture by proving several special cases. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: 26 pages, 2 figures

arXiv:2012.02847 [pdf, other]

Network Group Testing

Authors: Paolo Bertolotti, Ali Jadbabaie

Abstract: We consider the problem of identifying infected individuals in a population of size N. We introduce a group testing approach that uses significantly fewer than N tests when infection prevalence is low. The most common approach to group testing, Dorfman testing, groups individuals randomly. However, as communicable diseases spread from individual to individual through underlying social networks, ou… ▽ More We consider the problem of identifying infected individuals in a population of size N. We introduce a group testing approach that uses significantly fewer than N tests when infection prevalence is low. The most common approach to group testing, Dorfman testing, groups individuals randomly. However, as communicable diseases spread from individual to individual through underlying social networks, our approach utilizes network information to improve performance. Network grou**, which groups individuals by community, weakly dominates Dorfman testing in terms of the expected number of tests used. Network grou**'s outperformance is determined by the strength of community structure in the network. When networks have strong community structure, network grou** achieves the lower bound for two-stage testing procedures. As an empirical example, we consider the scenario of a university testing its population for COVID-19. Using social network data from a Danish university, we demonstrate network grou** requires significantly fewer tests than Dorfman. In contrast to many proposed group testing approaches, network grou** is simple for practitioners to implement. In practice, individuals can be grouped by family unit, social group, or work group. △ Less

Submitted 30 December, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: Updated to version presented at ICLR 2021 AI for Public Health Workshop

arXiv:2011.10669 [pdf, other]

A General Framework for Distributed Inference with Uncertain Models

Authors: James Z. Hare, Cesar A. Uribe, Lance Kaplan, Ali Jadbabaie

Abstract: This paper studies the problem of distributed classification with a network of heterogeneous agents. The agents seek to jointly identify the underlying target class that best describes a sequence of observations. The problem is first abstracted to a hypothesis-testing framework, where we assume that the agents seek to agree on the hypothesis (target class) that best matches the distribution of obs… ▽ More This paper studies the problem of distributed classification with a network of heterogeneous agents. The agents seek to jointly identify the underlying target class that best describes a sequence of observations. The problem is first abstracted to a hypothesis-testing framework, where we assume that the agents seek to agree on the hypothesis (target class) that best matches the distribution of observations. Non-Bayesian social learning theory provides a framework that solves this problem in an efficient manner by allowing the agents to sequentially communicate and update their beliefs for each hypothesis over the network. Most existing approaches assume that agents have access to exact statistical models for each hypothesis. However, in many practical applications, agents learn the likelihood models based on limited data, which induces uncertainty in the likelihood function parameters. In this work, we build upon the concept of uncertain models to incorporate the agents' uncertainty in the likelihoods by identifying a broad set of parametric distribution that allows the agents' beliefs to converge to the same result as a centralized approach. Furthermore, we empirically explore extensions to non-parametric models to provide a generalized framework of uncertain models in non-Bayesian social learning. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2011.02522 [pdf, ps, other]

Gradient-Based Empirical Risk Minimization using Local Polynomial Regression

Authors: Ali Jadbabaie, Anuran Makur, Devavrat Shah

Abstract: In this paper, we consider the problem of empirical risk minimization (ERM) of smooth, strongly convex loss functions using iterative gradient-based methods. A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to $ε$-approximate solutions. For example, the oracle comple… ▽ More In this paper, we consider the problem of empirical risk minimization (ERM) of smooth, strongly convex loss functions using iterative gradient-based methods. A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or stochastic gradient descent (SGD), by analyzing their rates of convergence to $ε$-approximate solutions. For example, the oracle complexity of GD is $O(n\log(ε^{-1}))$, where $n$ is the number of training samples. When $n$ is large, this can be expensive in practice, and SGD is preferred due to its oracle complexity of $O(ε^{-1})$. Such standard analyses only utilize the smoothness of the loss function in the parameter being optimized. In contrast, we demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD in important regimes. Specifically, at every iteration, our proposed algorithm performs local polynomial regression to learn the gradient of the loss function, and then estimates the true gradient of the ERM objective function. We establish that the oracle complexity of our algorithm scales like $\tilde{O}((p ε^{-1})^{d/(2η)})$ (neglecting sub-dominant factors), where $d$ and $p$ are the data and parameter space dimensions, respectively, and the gradient of the loss function belongs to a $η$-Hölder class with respect to the data. Our proof extends the analysis of local polynomial regression in non-parametric statistics to provide interpolation guarantees in multivariate settings, and also exploits tools from the inexact GD literature. Unlike GD and SGD, the complexity of our method depends on $d$ and $p$. However, when $d$ is small and the loss function exhibits modest smoothness in the data, our algorithm beats GD and SGD in oracle complexity for a very broad range of $p$ and $ε$. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: 34 pages

arXiv:2007.03562 [pdf, ps, other]

A Distributed Cubic-Regularized Newton Method for Smooth Convex Optimization over Networks

Authors: César A. Uribe, Ali Jadbabaie

Abstract: We propose a distributed, cubic-regularized Newton method for large-scale convex optimization over networks. The proposed method requires only local computations and communications and is suitable for federated learning applications over arbitrary network topologies. We show a $O(k^{{-}3})$ convergence rate when the cost function is convex with Lipschitz gradient and Hessian, with $k$ being the nu… ▽ More We propose a distributed, cubic-regularized Newton method for large-scale convex optimization over networks. The proposed method requires only local computations and communications and is suitable for federated learning applications over arbitrary network topologies. We show a $O(k^{{-}3})$ convergence rate when the cost function is convex with Lipschitz gradient and Hessian, with $k$ being the number of iterations. We further provide network-dependent bounds for the communication required in each step of the algorithm. We provide numerical experiments that validate our theoretical results. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 22 pages, 2 figures. Preprint, under review

arXiv:2006.10293 [pdf, other]

GAT-GMM: Generative Adversarial Training for Gaussian Mixture Models

Authors: Farzan Farnia, William Wang, Subhro Das, Ali Jadbabaie

Abstract: Generative adversarial networks (GANs) learn the distribution of observed samples through a zero-sum game between two machine players, a generator and a discriminator. While GANs achieve great success in learning the complex distribution of image, sound, and text data, they perform suboptimally in learning multi-modal distribution-learning benchmarks including Gaussian mixture models (GMMs). In th… ▽ More Generative adversarial networks (GANs) learn the distribution of observed samples through a zero-sum game between two machine players, a generator and a discriminator. While GANs achieve great success in learning the complex distribution of image, sound, and text data, they perform suboptimally in learning multi-modal distribution-learning benchmarks including Gaussian mixture models (GMMs). In this paper, we propose Generative Adversarial Training for Gaussian Mixture Models (GAT-GMM), a minimax GAN framework for learning GMMs. Motivated by optimal transport theory, we design the zero-sum game in GAT-GMM using a random linear generator and a softmax-based quadratic discriminator architecture, which leads to a non-convex concave minimax optimization problem. We show that a Gradient Descent Ascent (GDA) method converges to an approximate stationary minimax point of the GAT-GMM optimization problem. In the benchmark case of a mixture of two symmetric, well-separated Gaussians, we further show this stationary point recovers the true parameters of the underlying GMM. We numerically support our theoretical findings by performing several experiments, which demonstrate that GAT-GMM can perform as well as the expectation-maximization algorithm in learning mixtures of two Gaussians. △ Less

Submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.08907 [pdf, other]

Robust Federated Learning: The Case of Affine Distribution Shifts

Authors: Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, Ali Jadbabaie

Abstract: Federated learning is a distributed paradigm that aims at training models using samples distributed across multiple users in a network while kee** the samples on users' devices with the aim of efficiency and protecting users privacy. In such settings, the training data is often statistically heterogeneous and manifests various distribution shifts across users, which degrades the performance of t… ▽ More Federated learning is a distributed paradigm that aims at training models using samples distributed across multiple users in a network while kee** the samples on users' devices with the aim of efficiency and protecting users privacy. In such settings, the training data is often statistically heterogeneous and manifests various distribution shifts across users, which degrades the performance of the learnt model. The primary goal of this paper is to develop a robust federated learning algorithm that achieves satisfactory performance against distribution shifts in users' samples. To achieve this goal, we first consider a structured affine distribution shift in users' data that captures the device-dependent data heterogeneity in federated settings. This perturbation model is applicable to various federated learning problems such as image classification where the images undergo device-dependent imperfections, e.g. different intensity, contrast, and brightness. To address affine distribution shifts across users, we propose a Federated Learning framework Robust to Affine distribution shifts (FLRA) that is provably robust against affine Wasserstein shifts to the distribution of observed samples. To solve the FLRA's distributed minimax problem, we propose a fast and efficient optimization method and provide convergence guarantees via a gradient Descent Ascent (GDA) method. We further prove generalization error bounds for the learnt classifier to show proper generalization from empirical distribution of samples to the true underlying distribution. We perform several numerical experiments to empirically support FLRA. We show that an affine distribution shift indeed suffices to significantly decrease the performance of the learnt classifier in a new test user, and our proposed algorithm achieves a significant gain in comparison to standard federated learning and adversarial training methods. △ Less

Submitted 15 June, 2020; originally announced June 2020.

arXiv:2006.08189 [pdf, other]

Estimation of Skill Distributions

Authors: Ali Jadbabaie, Anuran Makur, Devavrat Shah

Abstract: In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of game outcomes are governed by the Bradley-T… ▽ More In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of game outcomes are governed by the Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, when the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove minimax lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e.g., soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 37 pages, 1 figure

arXiv:2006.04429 [pdf, other]

Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity

Authors: **gzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie

Abstract: We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular,… ▽ More We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity. △ Less

Submitted 17 June, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

Journal ref: ICML 2022

Showing 1–50 of 151 results for author: Jadbabaie, A