Search | arXiv e-print repository

Stochastic Gradient Descent without Full Data Shuffle

Authors: Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jie** Ye, Ce Zhang

Abstract: Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various… ▽ More Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement -- they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5X faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD. △ Less

Submitted 12 June, 2022; originally announced June 2022.

Comments: This technical report is an extension of our SIGMOD 2022 paper titled "In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle". https://doi.org/10.1145/3514221.3526150

arXiv:2104.10266 [pdf, ps, other]

doi 10.1016/j.ifacol.2021.11.160

Variance Reduction of Quadcopter Trajectory Tracking in Turbulent Wind

Authors: Asma Tabassum, Rohit K. S. S. Vuppala, He Bai, Kursat Kara

Abstract: We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that… ▽ More We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that can reduce the turbulence effects on the quadcopter's motion. In particular, we design a minimum cost variance (MCV) controller aiming to minimize the cost in terms of its weighted sum of mean and variance. We linearize the quadcopter dynamics and examine the MCV controller derived from a set of coupled algebraic Riccati equations (CARE) with full-state feedback. Our preliminary simulation results show reduction in variance and in mean trajectory tracking error compared to a traditional linear quadratic regulator (LQR). △ Less

Submitted 25 August, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

arXiv:2007.06179 [pdf, ps, other]

doi 10.1371/journal.pone.0246092

Multifidelity Computing for Coupling Full and Reduced Order Models

Authors: Shady E. Ahmed, Omer San, Kursat Kara, Rami Younis, Adil Rasheed

Abstract: Hybrid physics-machine learning models are increasingly being used in simulations of transport processes. Many complex multiphysics systems relevant to scientific and engineering applications include multiple spatiotemporal scales and comprise a multifidelity problem sharing an interface between various formulations or heterogeneous computational entities. To this end, we present a robust hybrid a… ▽ More Hybrid physics-machine learning models are increasingly being used in simulations of transport processes. Many complex multiphysics systems relevant to scientific and engineering applications include multiple spatiotemporal scales and comprise a multifidelity problem sharing an interface between various formulations or heterogeneous computational entities. To this end, we present a robust hybrid analysis and modeling approach combining a physics-based full order model (FOM) and a data-driven reduced order model (ROM) to form the building blocks of an integrated approach among mixed fidelity descriptions toward predictive digital twin technologies. At the interface, we introduce a long short-term memory network to bridge these high and low-fidelity models in various forms of interfacial error correction or prolongation. The proposed interface learning approaches are tested as a new way to address ROM-FOM coupling problems solving nonlinear advection-diffusion flow situations with a bifidelity setup that captures the essence of a broad class of transport processes. △ Less

Submitted 12 February, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

arXiv:2006.10112 [pdf, other]

doi 10.1103/PhysRevE.102.053304

Interface learning of multiphysics and multiscale systems

Authors: Shady E. Ahmed, Omer San, Kursat Kara, Rami Younis, Adil Rasheed

Abstract: Complex natural or engineered systems comprise multiple characteristic scales, multiple spatiotemporal domains, and even multiple physical closure laws. To address such challenges, we introduce an interface learning paradigm and put forth a data-driven closure approach based on memory embedding to provide physically correct boundary conditions at the interface. To enable the interface learning for… ▽ More Complex natural or engineered systems comprise multiple characteristic scales, multiple spatiotemporal domains, and even multiple physical closure laws. To address such challenges, we introduce an interface learning paradigm and put forth a data-driven closure approach based on memory embedding to provide physically correct boundary conditions at the interface. To enable the interface learning for hyperbolic systems by considering the domain of influence and wave structures into account, we put forth the concept of upwind learning towards a physics-informed domain decomposition. The promise of the proposed approach is shown for a set of canonical illustrative problems. We highlight that high-performance computing environments can benefit from this methodology to reduce communication costs among processing units in emerging machine learning ready heterogeneous platforms toward exascale era. △ Less

Submitted 31 October, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

Journal ref: Phys. Rev. E 102, 053304 (2020)

arXiv:2005.13332 [pdf, other]

HyperLogLog Sketch Acceleration on FPGA

Authors: Amit Kulkarni, Monica Chiosa, Thomas B. Preußer, Kaan Kara, David Sidler, Gustavo Alonso

Abstract: Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many dis… ▽ More Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many distinct data items there are in a data set). Although it does not count every data item (to reduce memory consumption), it provides probabilistic guarantees on the result, and it is, thus, often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than an optimized HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads. △ Less

Submitted 20 October, 2020; v1 submitted 24 May, 2020; originally announced May 2020.

Comments: This paper was accepted as a full paper to FPL 2020. The latest/full version of this paper is available: https://ieeexplore.ieee.org/document/9221525

arXiv:2004.01635 [pdf, other]

High Bandwidth Memory on FPGAs: A Data Analytics Perspective

Authors: Kaan Kara, Christoph Hagleitner, Dionysios Diamantopoulos, Dimitris Syrivelis, Gustavo Alonso

Abstract: FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwid… ▽ More FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, faced often by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD). △ Less

Submitted 2 April, 2020; originally announced April 2020.

arXiv:1903.03404 [pdf, other]

doi 10.14778/3317315.3317322

Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)

Authors: Zeke Wang, Kaan Kara, Hantian Zhang, Gustavo Alonso, Onur Mutlu, Ce Zhang

Abstract: Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLW… ▽ More Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods. △ Less

Submitted 28 March, 2019; v1 submitted 8 March, 2019; originally announced March 2019.

Comments: 18 pages

Journal ref: PVLDB, 2019

arXiv:1802.04907 [pdf, other]

doi 10.1109/TSP.2020.3010355

Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications

Authors: Nezihe Merve Gürel, Kaan Kara, Alen Stojanov, Tyler Smith, Thomas Lemmin, Dan Alistarh, Markus Püschel, Ce Zhang

Abstract: Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensin… ▽ More Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized {IHT} that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speed-up with negligible loss of recovery quality. △ Less

Submitted 22 December, 2020; v1 submitted 13 February, 2018; originally announced February 2018.

Comments: 19 pages, 5 figures, 1 table, in IEEE Transactions on Signal Processing Vol. 68, No. 7, pp. 4268-4282, 2020

arXiv:1705.05154 [pdf, other]

Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond

Authors: Heng Guo, Kaan Kara, Ce Zhang

Abstract: For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order - while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The… ▽ More For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order - while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The de facto implementation for these models scans variables in a layerwise fashion. We show that the Gibbs sampler with a layerwise alternating scan order has its relaxation time (in terms of epochs) no larger than that of a random-update Gibbs sampler (in terms of variable updates). We also construct examples to show that this bound is asymptotically tight. Through standard inequalities, our result also implies a comparison on the mixing times. △ Less

Submitted 9 October, 2017; v1 submitted 15 May, 2017; originally announced May 2017.

Comments: v2: typo fixes and improved presentation

arXiv:1611.05402 [pdf, other]

The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning

Authors: Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, Ce Zhang

Abstract: Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to co… ▽ More Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to consistent order-of-magnitude speedups? We present a framework called ZipML to answer these questions. For linear models, the answer is yes. We develop a simple framework based on one simple but novel strategy called double sampling. Our framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias. We validate our framework across a range of applications, and show that it enables an FPGA prototype that is up to 6.5x faster than an implementation using full 32-bit precision. We further develop a variance-optimal stochastic quantization strategy and show that it can make a significant difference in a variety of settings. When applied to linear models together with double sampling, we save up to another 1.7x in data movement compared with uniform quantization. When training deep networks with quantized models, we achieve higher accuracy than the state-of-the-art XNOR-Net. Finally, we extend our framework through approximation to non-linear models, such as SVM. We show that, although using low-precision data induces bias, we can appropriately bound and control the bias. We find in practice 8-bit precision is often sufficient to converge to the correct solution. Interestingly, however, in practice we notice that our framework does not always outperform the naive rounding approach. We discuss this negative result in detail. △ Less

Submitted 19 June, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

arXiv:1306.4883 [pdf]

Fault-Tolerant Control of a 2 DOF Helicopter (TRMS System) Based on H_infinity

Authors: Abderrahmen Bouguerra, Djamel Saigaa, Kamel Kara, Samir Zeghlache, Keltoum Loukal

Abstract: In this paper, a Fault-Tolerant control of 2 DOF Helicopter (TRMS System) Based on H-infinity is presented. In particular, the introductory part of the paper presents a Fault-Tolerant Control (FTC), the first part of this paper presents a description of the mathematical model of TRMS, and the last part of the paper presented and a polytypic Unknown Input Observer (UIO) is synthesized using equalit… ▽ More In this paper, a Fault-Tolerant control of 2 DOF Helicopter (TRMS System) Based on H-infinity is presented. In particular, the introductory part of the paper presents a Fault-Tolerant Control (FTC), the first part of this paper presents a description of the mathematical model of TRMS, and the last part of the paper presented and a polytypic Unknown Input Observer (UIO) is synthesized using equalities and LMIs. This UIO is used to observe the faults and then compensate them, in this part the shown how to design a fault-tolerant control strategy for this particular class of non-linear systems. △ Less

Submitted 20 June, 2013; originally announced June 2013.

Comments: 6 pages, 11 figures, conference. In CEIT 2013

Showing 1–11 of 11 results for author: Kara, K