-
Stochastic Gradient Descent without Full Data Shuffle
Authors:
Lijie Xu,
Shuang Qiu,
Binhang Yuan,
Jiawei Jiang,
Cedric Renggli,
Shaoduo Gan,
Kaan Kara,
Guoliang Li,
Ji Liu,
Wentao Wu,
Jie** Ye,
Ce Zhang
Abstract:
Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various…
▽ More
Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access).
In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement -- they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5X faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
△ Less
Submitted 12 June, 2022;
originally announced June 2022.
-
Variance Reduction of Quadcopter Trajectory Tracking in Turbulent Wind
Authors:
Asma Tabassum,
Rohit K. S. S. Vuppala,
He Bai,
Kursat Kara
Abstract:
We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that…
▽ More
We consider a quadcopter operating in a turbulent windy environment. The turbulent environment may be imposed on a quadcopter by structures, landscapes, terrains and most importantly by the unique physical phenomena in the lower atmosphere. Turbulence can negatively impact quadcopter's performance and operations. Modeling turbulence as a stochastic random input, we investigate control designs that can reduce the turbulence effects on the quadcopter's motion. In particular, we design a minimum cost variance (MCV) controller aiming to minimize the cost in terms of its weighted sum of mean and variance. We linearize the quadcopter dynamics and examine the MCV controller derived from a set of coupled algebraic Riccati equations (CARE) with full-state feedback. Our preliminary simulation results show reduction in variance and in mean trajectory tracking error compared to a traditional linear quadratic regulator (LQR).
△ Less
Submitted 25 August, 2021; v1 submitted 20 April, 2021;
originally announced April 2021.
-
Multifidelity Computing for Coupling Full and Reduced Order Models
Authors:
Shady E. Ahmed,
Omer San,
Kursat Kara,
Rami Younis,
Adil Rasheed
Abstract:
Hybrid physics-machine learning models are increasingly being used in simulations of transport processes. Many complex multiphysics systems relevant to scientific and engineering applications include multiple spatiotemporal scales and comprise a multifidelity problem sharing an interface between various formulations or heterogeneous computational entities. To this end, we present a robust hybrid a…
▽ More
Hybrid physics-machine learning models are increasingly being used in simulations of transport processes. Many complex multiphysics systems relevant to scientific and engineering applications include multiple spatiotemporal scales and comprise a multifidelity problem sharing an interface between various formulations or heterogeneous computational entities. To this end, we present a robust hybrid analysis and modeling approach combining a physics-based full order model (FOM) and a data-driven reduced order model (ROM) to form the building blocks of an integrated approach among mixed fidelity descriptions toward predictive digital twin technologies. At the interface, we introduce a long short-term memory network to bridge these high and low-fidelity models in various forms of interfacial error correction or prolongation. The proposed interface learning approaches are tested as a new way to address ROM-FOM coupling problems solving nonlinear advection-diffusion flow situations with a bifidelity setup that captures the essence of a broad class of transport processes.
△ Less
Submitted 12 February, 2021; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Interface learning of multiphysics and multiscale systems
Authors:
Shady E. Ahmed,
Omer San,
Kursat Kara,
Rami Younis,
Adil Rasheed
Abstract:
Complex natural or engineered systems comprise multiple characteristic scales, multiple spatiotemporal domains, and even multiple physical closure laws. To address such challenges, we introduce an interface learning paradigm and put forth a data-driven closure approach based on memory embedding to provide physically correct boundary conditions at the interface. To enable the interface learning for…
▽ More
Complex natural or engineered systems comprise multiple characteristic scales, multiple spatiotemporal domains, and even multiple physical closure laws. To address such challenges, we introduce an interface learning paradigm and put forth a data-driven closure approach based on memory embedding to provide physically correct boundary conditions at the interface. To enable the interface learning for hyperbolic systems by considering the domain of influence and wave structures into account, we put forth the concept of upwind learning towards a physics-informed domain decomposition. The promise of the proposed approach is shown for a set of canonical illustrative problems. We highlight that high-performance computing environments can benefit from this methodology to reduce communication costs among processing units in emerging machine learning ready heterogeneous platforms toward exascale era.
△ Less
Submitted 31 October, 2020; v1 submitted 17 June, 2020;
originally announced June 2020.
-
HyperLogLog Sketch Acceleration on FPGA
Authors:
Amit Kulkarni,
Monica Chiosa,
Thomas B. Preußer,
Kaan Kara,
David Sidler,
Gustavo Alonso
Abstract:
Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many dis…
▽ More
Data sketches are a set of widely used approximated data summarizing techniques. Their fundamental property is sub-linear memory complexity on the input cardinality, an important aspect when processing streams or data sets with a vast base domain (URLs, IP addresses, user IDs, etc.). Among the many data sketches available, HyperLogLog has become the reference for cardinality counting (how many distinct data items there are in a data set). Although it does not count every data item (to reduce memory consumption), it provides probabilistic guarantees on the result, and it is, thus, often used to analyze data streams. In this paper, we explore how to implement HyperLogLog on an FPGA to benefit from the parallelism available and the ability to process data streams coming from high-speed networks. Our multi-pipelined high-cardinality HyperLogLog implementation delivers 1.8x higher throughput than an optimized HyperLogLog running on a dual-socket Intel Xeon E5-2630 v3 system with a total of 16 cores and 32 hyper-threads.
△ Less
Submitted 20 October, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
High Bandwidth Memory on FPGAs: A Data Analytics Perspective
Authors:
Kaan Kara,
Christoph Hagleitner,
Dionysios Diamantopoulos,
Dimitris Syrivelis,
Gustavo Alonso
Abstract:
FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwid…
▽ More
FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, faced often by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD).
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)
Authors:
Zeke Wang,
Kaan Kara,
Hantian Zhang,
Gustavo Alonso,
Onur Mutlu,
Ce Zhang
Abstract:
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLW…
▽ More
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods.
△ Less
Submitted 28 March, 2019; v1 submitted 8 March, 2019;
originally announced March 2019.
-
Compressive Sensing Using Iterative Hard Thresholding with Low Precision Data Representation: Theory and Applications
Authors:
Nezihe Merve Gürel,
Kaan Kara,
Alen Stojanov,
Tyler Smith,
Thomas Lemmin,
Dan Alistarh,
Markus Püschel,
Ce Zhang
Abstract:
Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensin…
▽ More
Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized {IHT} that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speed-up with negligible loss of recovery quality.
△ Less
Submitted 22 December, 2020; v1 submitted 13 February, 2018;
originally announced February 2018.
-
Layerwise Systematic Scan: Deep Boltzmann Machines and Beyond
Authors:
Heng Guo,
Kaan Kara,
Ce Zhang
Abstract:
For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order - while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The…
▽ More
For Markov chain Monte Carlo methods, one of the greatest discrepancies between theory and system is the scan order - while most theoretical development on the mixing time analysis deals with random updates, real-world systems are implemented with systematic scans. We bridge this gap for models that exhibit a bipartite structure, including, most notably, the Restricted/Deep Boltzmann Machine. The de facto implementation for these models scans variables in a layerwise fashion. We show that the Gibbs sampler with a layerwise alternating scan order has its relaxation time (in terms of epochs) no larger than that of a random-update Gibbs sampler (in terms of variable updates). We also construct examples to show that this bound is asymptotically tight. Through standard inequalities, our result also implies a comparison on the mixing times.
△ Less
Submitted 9 October, 2017; v1 submitted 15 May, 2017;
originally announced May 2017.
-
The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning
Authors:
Hantian Zhang,
Jerry Li,
Kaan Kara,
Dan Alistarh,
Ji Liu,
Ce Zhang
Abstract:
Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to co…
▽ More
Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to consistent order-of-magnitude speedups? We present a framework called ZipML to answer these questions. For linear models, the answer is yes. We develop a simple framework based on one simple but novel strategy called double sampling. Our framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quantization would introduce significant bias. We validate our framework across a range of applications, and show that it enables an FPGA prototype that is up to 6.5x faster than an implementation using full 32-bit precision. We further develop a variance-optimal stochastic quantization strategy and show that it can make a significant difference in a variety of settings. When applied to linear models together with double sampling, we save up to another 1.7x in data movement compared with uniform quantization. When training deep networks with quantized models, we achieve higher accuracy than the state-of-the-art XNOR-Net. Finally, we extend our framework through approximation to non-linear models, such as SVM. We show that, although using low-precision data induces bias, we can appropriately bound and control the bias. We find in practice 8-bit precision is often sufficient to converge to the correct solution. Interestingly, however, in practice we notice that our framework does not always outperform the naive rounding approach. We discuss this negative result in detail.
△ Less
Submitted 19 June, 2017; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Fault-Tolerant Control of a 2 DOF Helicopter (TRMS System) Based on H_infinity
Authors:
Abderrahmen Bouguerra,
Djamel Saigaa,
Kamel Kara,
Samir Zeghlache,
Keltoum Loukal
Abstract:
In this paper, a Fault-Tolerant control of 2 DOF Helicopter (TRMS System) Based on H-infinity is presented. In particular, the introductory part of the paper presents a Fault-Tolerant Control (FTC), the first part of this paper presents a description of the mathematical model of TRMS, and the last part of the paper presented and a polytypic Unknown Input Observer (UIO) is synthesized using equalit…
▽ More
In this paper, a Fault-Tolerant control of 2 DOF Helicopter (TRMS System) Based on H-infinity is presented. In particular, the introductory part of the paper presents a Fault-Tolerant Control (FTC), the first part of this paper presents a description of the mathematical model of TRMS, and the last part of the paper presented and a polytypic Unknown Input Observer (UIO) is synthesized using equalities and LMIs. This UIO is used to observe the faults and then compensate them, in this part the shown how to design a fault-tolerant control strategy for this particular class of non-linear systems.
△ Less
Submitted 20 June, 2013;
originally announced June 2013.