Search | arXiv e-print repository

Scalable Dual Coordinate Descent for Kernel Methods

Authors: Zishan Shao, Aditya Devarakonda

Abstract: Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) are important iterative methods for solving convex optimization problems. In this work, we develop scalable DCD and BDCD methods for the kernel support vector machines (K-SVM) and kernel ridge regression (K-RR) problems. On distributed-memory parallel machines the scalability of these methods is limited by the need to communica… ▽ More Dual Coordinate Descent (DCD) and Block Dual Coordinate Descent (BDCD) are important iterative methods for solving convex optimization problems. In this work, we develop scalable DCD and BDCD methods for the kernel support vector machines (K-SVM) and kernel ridge regression (K-RR) problems. On distributed-memory parallel machines the scalability of these methods is limited by the need to communicate every iteration. On modern hardware where communication is orders of magnitude more expensive, the running time of the DCD and BDCD methods is dominated by communication cost. We address this communication bottleneck by deriving $s$-step variants of DCD and BDCD for solving the K-SVM and K-RR problems, respectively. The $s$-step variants reduce the frequency of communication by a tunable factor of $s$ at the expense of additional bandwidth and computation. The $s$-step variants compute the same solution as the existing methods in exact arithmetic. We perform numerical experiments to illustrate that the $s$-step variants are also numerically stable in finite-arithmetic, even for large values of $s$. We perform theoretical analysis to bound the computation and communication costs of the newly designed variants, up to leading order. Finally, we develop high performance implementations written in C and MPI and present scaling experiments performed on a Cray EX cluster. The new $s$-step variants achieved strong scaling speedups of up to $9.8\times$ over existing methods using up to $512$ cores. △ Less

Submitted 25 June, 2024; originally announced June 2024.

MSC Class: 65Y05 ACM Class: D.1.3; G.4; F.2.1

arXiv:2402.08677 [pdf, other]

Striped electronic phases in an incommensurately modulated van der Waals superlattice

Authors: Aravind Devarakonda, Alan Chen, Shiang Fang, David Graf, Markus Kriener, Austin J. Akey, David C. Bell, Takehito Suzuki, Joseph G. Checkelsky

Abstract: Electronic properties of crystals can be manipulated using spatially periodic modulations. Long-wavelength, incommensurate modulations are of particular interest, exemplified recently by moiré patterned van der Waals (vdW) heterostructures. Bulk vdW superlattices hosting interfaces between clean 2D layers represent scalable bulk analogs of vdW heterostructures and present a complementary venue to… ▽ More Electronic properties of crystals can be manipulated using spatially periodic modulations. Long-wavelength, incommensurate modulations are of particular interest, exemplified recently by moiré patterned van der Waals (vdW) heterostructures. Bulk vdW superlattices hosting interfaces between clean 2D layers represent scalable bulk analogs of vdW heterostructures and present a complementary venue to explore incommensurately modulated 2D states. Here we report the bulk vdW superlattice SrTa$_2$S$_5$ realizing an incommensurate 1D modulation of 2D transition metal dichalcogenide (TMD) $H$-TaS$_2$ layers. High-quality electronic transport in the $H$-TaS$_2$ layers, evidenced by quantum oscillations, is made anisotropic by the modulation and shows commensurability oscillations akin to lithographically modulated 2D systems. We also find unconventional, clean-limit superconductivity (SC) in SrTa$_2$S$_5$ with a pronounced suppression of interlayer coherence relative to intralayer coherence. Such a hierarchy can arise from pair-density wave (PDW) SC with mismatched spatial arrangement in adjacent superconducting layers. Examining the in-plane magnetic field $H_{ab}$ dependence of interlayer critical current density $J_c$, we find anisotropy with respect to $H_{ab}$ orientation: $J_c$ is maximized (minimized) when $H_{ab}$ is perpendicular (parallel) to the stripes, consistent with 1D PDW SC. From diffraction we find the structural modulation is shifted between adjacent $H$-TaS$_2$ layers, suggesting mismatched 1D PDW is seeded by the striped structure. With a high-mobility Fermi liquid in a coherently modulated structure, SrTa$_2$S$_5$ is a promising host for novel phenomena anticipated in clean, striped metals and superconductors. More broadly, SrTa$_2$S$_5$ establishes bulk vdW superlattices as macroscopic platforms to address long-standing predictions for modulated electronic phases. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 19 pages, 4 figures

arXiv:2308.02772 [pdf, other]

Probing charge order of monolayer NbSe$_2$ within a bulk crystal

Authors: Doron Azoury, Edoardo Baldini, Aravind Devarakonda, Jiarui Li, Shiang Fang, Pheona Williams, Riccardo Comin, Joseph Checkelsky, Nuh Gedik

Abstract: Atomically thin transition metal dichalcogenides can exhibit markedly different electronic properties compared to their bulk counterparts. In the case of NbSe$_2$, the question of whether its charge density wave (CDW) phase is enhanced in the monolayer limit has been the subject of intense debate, primarily due to the difficulty of decoupling this order from its environment. Here, we address this… ▽ More Atomically thin transition metal dichalcogenides can exhibit markedly different electronic properties compared to their bulk counterparts. In the case of NbSe$_2$, the question of whether its charge density wave (CDW) phase is enhanced in the monolayer limit has been the subject of intense debate, primarily due to the difficulty of decoupling this order from its environment. Here, we address this challenge by using a misfit crystal that comprises NbSe$_2$ monolayers separated by SnSe rock-salt spacers, a structure that allows us to investigate a monolayer crystal embedded in a bulk matrix. We establish an effective monolayer electronic behavior of the misfit crystal by studying its transport properties and visualizing its electronic structure by angle-resolved photoemission measurements. We then investigate the emergence of the CDW by tracking the temperature dependence of its collective modes. Our findings reveal a nearly sixfold enhancement in the CDW transition temperature, providing compelling evidence for the profound impact of dimensionality on charge order formation in NbSe$_2$. △ Less

Submitted 4 August, 2023; originally announced August 2023.

arXiv:2307.16652 [pdf, other]

Sequential and Shared-Memory Parallel Algorithms for Partitioned Local Depths

Authors: Aditya Devarakonda, Grey Ballard

Abstract: In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community… ▽ More In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to $29\times$ over a baseline sequential implementation and parallel speedups of up to $19.4\times$ over optimized sequential implementations using up to $32$ threads on an Intel multicore CPU. △ Less

Submitted 31 July, 2023; originally announced July 2023.

MSC Class: 68W10 ACM Class: D.1.3

arXiv:2011.08281 [pdf, other]

Avoiding Communication in Logistic Regression

Authors: Aditya Devarakonda, James Demmel

Abstract: Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration.… ▽ More Stochastic gradient descent (SGD) is one of the most widely used optimization methods for solving various machine learning problems. SGD solves an optimization problem by iteratively sampling a few data points from the input data, computing gradients for the selected data points, and updating the solution. However, in a parallel setting, SGD requires interprocess communication at every iteration. We introduce a new communication-avoiding technique for solving the logistic regression problem using SGD. This technique re-organizes the SGD computations into a form that communicates every $s$ iterations instead of every iteration, where $s$ is a tuning parameter. We prove theoretical flops, bandwidth, and latency upper bounds for SGD and its new communication-avoiding variant. Furthermore, we show experimental results that illustrate that the new Communication-Avoiding SGD (CA-SGD) method can achieve speedups of up to $4.97\times$ on a high-performance Infiniband cluster without altering the convergence behavior or accuracy. △ Less

Submitted 16 November, 2020; originally announced November 2020.

arXiv:1906.02065 [pdf]

doi 10.1126/science.aaz6643

Clean 2D superconductivity in a bulk van der Waals superlattice

Authors: Aravind Devarakonda, Hisashi Inoue, Shiang Fang, Cigdem Ozsoy-Keskinbora, Takehito Suzuki, Markus Kriener, Liang Fu, Efthimios Kaxiras, David C. Bell, Joseph G. Checkelsky

Abstract: Advances in low-dimensional superconductivity are often realized through improvements in material quality. Apart from a small group of organic materials, there is a near absence of clean-limit two-dimensional (2D) superconductors, which presents an impediment to the pursuit of numerous long-standing predictions for exotic superconductivity with fragile pairing symmetries. Here, we report the devel… ▽ More Advances in low-dimensional superconductivity are often realized through improvements in material quality. Apart from a small group of organic materials, there is a near absence of clean-limit two-dimensional (2D) superconductors, which presents an impediment to the pursuit of numerous long-standing predictions for exotic superconductivity with fragile pairing symmetries. Here, we report the development of a bulk superlattice consisting of the transition metal dichalcogenide (TMD) superconductor 2$H$-niobium disulfide (2$H$-NbS$_2$) and a commensurate block layer that yields dramatically enhanced two-dimensionality, high electronic quality, and clean-limit inorganic 2D superconductivity. The structure of this material may naturally be extended to generate a distinct family of 2D superconductors, topological insulators, and excitonic systems based on TMDs with improved material properties. △ Less

Submitted 11 October, 2020; v1 submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted version with revised title, discussion, structure, and figures. 38 pages, 4 figures

Journal ref: Science 370, 231-236 (2020)

arXiv:1712.06047 [pdf, other]

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Authors: Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Abstract: Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require com… ▽ More Parallel computing has played an important role in speeding up convex optimization methods for big data analytics and large-scale machine learning (ML). However, the scalability of these optimization methods is inhibited by the cost of communicating and synchronizing processors in a parallel setting. Iterative ML methods are particularly sensitive to communication cost since they often require communication every iteration. In this work, we extend well-known techniques from Communication-Avoiding Krylov subspace methods to first-order, block coordinate descent methods for Support Vector Machines and Proximal Least-Squares problems. Our Synchronization-Avoiding (SA) variants reduce the latency cost by a tunable factor of $s$ at the expense of a factor of $s$ increase in flops and bandwidth costs. We show that the SA-variants are numerically stable and can attain large speedups of up to $5.1\times$ on a Cray XC30 supercomputer. △ Less

Submitted 16 December, 2017; originally announced December 2017.

MSC Class: 68W10; 90C25 ACM Class: G.1.6

arXiv:1712.02029 [pdf, other]

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Authors: Aditya Devarakonda, Maxim Naumov, Michael Garland

Abstract: Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size f… ▽ More Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size for all epochs, adaptively increases the batch size during the training process. Our method delivers the convergence rate of small batch sizes while achieving performance similar to large batch sizes. We analyse our approach using the standard AlexNet, ResNet, and VGG networks operating on the popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate that learning with adaptive batch sizes can improve performance by factors of up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1% relative to training with fixed batch sizes. △ Less

Submitted 13 February, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

Comments: 14 pages

MSC Class: 68T05; ACM Class: I.2.6; I.5.0

arXiv:1710.08883 [pdf, other]

Avoiding Communication in Proximal Methods for Convex Optimization Problems

Authors: Saeed Soori, Aditya Devarakonda, James Demmel, Mert Gurbuzbalaban, Maryam Mehri Dehnavi

Abstract: The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The comm… ▽ More The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The communication costs of FISTA, including bandwidth and latency costs, is closely tied to the mathematical formulation of the algorithm. This work reformulates FISTA to communicate data at every k iterations and reduce data communication when operating on large data sets. We formulate the algorithm for two different optimization methods on the Lasso problem and show that the latency cost is reduced by a factor of k while bandwidth and floating-point operation costs remain the same. The convergence rates and stability properties of the reformulated algorithms are similar to the standard formulations. The performance of communication-avoiding FISTA and Proximal Newton methods is evaluated on 1 to 1024 nodes for multiple benchmarks and demonstrate average speedups of 3-10x with scaling properties that outperform the classical algorithms. △ Less

Submitted 24 October, 2017; originally announced October 2017.

arXiv:1612.04003 [pdf, other]

Avoiding communication in primal and dual block coordinate descent methods

Authors: Aditya Devarakonda, Kimon Fountoulakis, James Demmel, Michael W. Mahoney

Abstract: Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration which, on modern data center and supercomputing architectures, often dom… ▽ More Primal and dual block coordinate descent methods are iterative methods for solving regularized and unregularized optimization problems. Distributed-memory parallel implementations of these methods have become popular in analyzing large machine learning datasets. However, existing implementations communicate at every iteration which, on modern data center and supercomputing architectures, often dominates the cost of floating-point computation. Recent results on communication-avoiding Krylov subspace methods suggest that large speedups are possible by re-organizing iterative algorithms to avoid communication. We show how applying similar algorithmic transformations can lead to primal and dual block coordinate descent methods that only communicate every $s$ iterations--where $s$ is a tuning parameter--instead of every iteration for the \textit{regularized least-squares problem}. We show that the communication-avoiding variants reduce the number of synchronizations by a factor of $s$ on distributed-memory parallel machines without altering the convergence rate and attains strong scaling speedups of up to $6.1\times$ on a Cray XC30 supercomputer. △ Less

Submitted 1 May, 2017; v1 submitted 12 December, 2016; originally announced December 2016.

MSC Class: 68W10; 65F10 ACM Class: G.1.0; G.1.3; G.1.6

arXiv:1607.01335 [pdf, other]

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Authors: Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney, Prabhat

Abstract: We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity… ▽ More We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance. △ Less

Submitted 20 September, 2016; v1 submitted 5 July, 2016; originally announced July 2016.

ACM Class: G.1.3; C.2.4

Showing 1–11 of 11 results for author: Devarakonda, A