-
AutoML for Multilayer Perceptron and FPGA Co-design
Authors:
Philip Colangelo,
Oren Segal,
Alex Speicher,
Martin Margala
Abstract:
State-of-the-art Neural Network Architectures (NNAs) are challenging to design and implement efficiently in hardware. In the past couple of years, this has led to an explosion in research and development of automatic Neural Architecture Search (NAS) tools. AutomML tools are now used to achieve state of the art NNA designs and attempt to optimize for hardware usage and design. Much of the recent re…
▽ More
State-of-the-art Neural Network Architectures (NNAs) are challenging to design and implement efficiently in hardware. In the past couple of years, this has led to an explosion in research and development of automatic Neural Architecture Search (NAS) tools. AutomML tools are now used to achieve state of the art NNA designs and attempt to optimize for hardware usage and design. Much of the recent research in the auto-design of NNAs has focused on convolution networks and image recognition, ignoring the fact that a significant part of the workload in data centers is general-purpose deep neural networks. In this work, we develop and test a general multilayer perceptron (MLP) flow that can take arbitrary datasets as input and automatically produce optimized NNAs and hardware designs. We test the flow on six benchmarks. Our results show we exceed the performance of currently published MLP accuracy results and are competitive with non-MLP based results. We compare general and common GPU architectures with our scalable FPGA design and show we can achieve higher efficiency and higher throughput (outputs per second) for the majority of datasets. Further insights into the design space for both accurate networks and high performing hardware shows the power of co-design by correlating accuracy versus throughput, network size versus accuracy, and scaling to high-performance devices.
△ Less
Submitted 13 September, 2020;
originally announced September 2020.
-
Evolutionary Cell Aided Design for Neural Network Architectures
Authors:
Philip Colangelo,
Oren Segal,
Alexander Speicher,
Martin Margala
Abstract:
Mathematical theory shows us that multilayer feedforward Artificial Neural Networks(ANNs) are universal function approximators, capable of approximating any measurable function to any desired degree of accuracy. In practice designing practical and efficient neural network architectures require significant effort and expertise. We present a novel software framework called Evolutionary Cell Aided De…
▽ More
Mathematical theory shows us that multilayer feedforward Artificial Neural Networks(ANNs) are universal function approximators, capable of approximating any measurable function to any desired degree of accuracy. In practice designing practical and efficient neural network architectures require significant effort and expertise. We present a novel software framework called Evolutionary Cell Aided Design(ECAD) meant to aid in the exploration and design of efficient Neural Network Architectures(NNAs) for reconfigurable hardware. Given a general neural network structure and a set of constraints and fitness functions, the framework will explore both the space of possible NNA and the space of possible hardware designs, using evolutionary algorithms, and attempt to find the fittest co-design solutions according to a predefined set of goals. We test the framework on an image classification task and use the MNIST data set of hand written digits with an Intel Arria 10 GX 1150 device as our target platform. We design and implement a modular and scalable 2D systolic array with enhancements for machine learning that can be used by the framework for the hardware search space. Our results demonstrate the ability to pair neural network design and hardware development together using an evolutionary algorithm and removing traditional human-in-the-loop development tasks. By running various experiments of the fittest solutions for neural network and hardware searches, we demonstrate the full end-to-end capabilities of the ECAD framework.
△ Less
Submitted 11 May, 2019; v1 submitted 5 March, 2019;
originally announced March 2019.
-
Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs
Authors:
Philip Colangelo,
Nasibeh Nasiri,
Asit Mishra,
Eriko Nurvitadhi,
Martin Margala,
Kevin Nealis
Abstract:
CNNs have been shown to maintain reasonable classification accuracy when quantized to lower precisions. Quantizing to sub 8-bit activations and weights can result in accuracy falling below an acceptable threshold. Techniques exist for closing the accuracy gap of limited numeric precision typically by increasing computation. This results in a trade-off between throughput and accuracy and can be tai…
▽ More
CNNs have been shown to maintain reasonable classification accuracy when quantized to lower precisions. Quantizing to sub 8-bit activations and weights can result in accuracy falling below an acceptable threshold. Techniques exist for closing the accuracy gap of limited numeric precision typically by increasing computation. This results in a trade-off between throughput and accuracy and can be tailored for different networks through various combinations of activation and weight data widths. Hardware architectures like FPGAs provide the opportunity for data width specific computation through unique logic configurations leading to highly optimized processing that is unattainable by full precision networks. Ternary and binary weighted networks offer an efficient method of inference for 2-bit and 1-bit data respectively. Most hardware architectures can take advantage of the memory storage and bandwidth savings that come along with smaller datapaths, but very few architectures can take advantage of limited numeric precision at the computation level. In this paper, we present a hardware design for FPGAs that takes advantage of bandwidth, memory, power, and computation savings of limited numerical precision data. We provide insights into the trade-offs between throughput and accuracy for various networks and how they map to our framework. Further, we show how limited numeric precision computation can be efficiently mapped onto FPGAs for both ternary and binary cases. Starting with Arria 10, we show a 2-bit activation and ternary weighted AlexNet running in hardware that achieves 3,700 images per second on the ImageNet dataset with a top-1 accuracy of 0.49. Using a hardware modeler designed for our low numeric precision framework we project performance most notably for a 55.5 TOPS Stratix 10 device running a modified ResNet-34 with only 3.7% accuracy degradation compared with single precision.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.
-
A Foray into Efficient Map** of Algorithms to Hardware Platforms on Heterogeneous Systems
Authors:
Oren Segal,
Nasibeh Nasiri,
Martin Margala
Abstract:
Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal map** of algorithms to architectures?" remains an open one. In the past couple of years new types of computing devices such as FPGAs have come into general computing use. In this work we attempt to add to the body of scientifi…
▽ More
Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal map** of algorithms to architectures?" remains an open one. In the past couple of years new types of computing devices such as FPGAs have come into general computing use. In this work we attempt to add to the body of scientific knowledge by comparing Kernel performance and performance per watt of seven key algorithms according to Berkley's dwarf taxonomy. We do so using the Rodinia benchmark suite on three different high-end hardware architecture representatives from the CPU, GPU and FPGA families. We find results that support some distinct map**s between the architecture and performance per watt. Perhaps the most interesting finding is that, for our specific hardware representatives, FPGAs should be considered as alternatives to GPUs and CPUs in several key algorithms: N-body simulations, dense linear algebra and structured grid.
△ Less
Submitted 23 May, 2016; v1 submitted 15 May, 2016;
originally announced May 2016.
-
SparkCL: A Unified Programming Framework for Accelerators on Heterogeneous Clusters
Authors:
Oren Segal,
Philip Colangelo,
Nasibeh Nasiri,
Zhuo Qian,
Martin Margala
Abstract:
We introduce SparkCL, an open source unified programming framework based on Java, OpenCL and the Apache Spark framework. The motivation behind this work is to bring unconventional compute cores such as FPGAs/GPUs/APUs/DSPs and future core types into mainstream programming use. The framework allows equal treatment of different computing devices under the Spark framework and introduces the ability t…
▽ More
We introduce SparkCL, an open source unified programming framework based on Java, OpenCL and the Apache Spark framework. The motivation behind this work is to bring unconventional compute cores such as FPGAs/GPUs/APUs/DSPs and future core types into mainstream programming use. The framework allows equal treatment of different computing devices under the Spark framework and introduces the ability to offload computations to acceleration devices. The new framework is seamlessly integrated into the standard Spark framework via a Java-OpenCL device programming layer which is based on Aparapi and a Spark programming layer that includes new kernel function types and modified Spark transformations and actions. The framework allows a single code base to target any type of compute core that supports OpenCL and easy integration of new core types into a Spark cluster.
△ Less
Submitted 5 May, 2015;
originally announced May 2015.
-
High Level Programming for Heterogeneous Architectures
Authors:
Oren Segal,
Martin Margala,
Sai Rahul Chalamalasetti,
Mitch Wright
Abstract:
This work presents an effort to bridge the gap between abstract high level programming and OpenCL by extending an existing high level Java programming framework (APARAPI), based on OpenCL, so that it can be used to program FPGAs at a high level of abstraction and increased ease of programmability. We run several real world algorithms to assess the performance of the framework on both a low end and…
▽ More
This work presents an effort to bridge the gap between abstract high level programming and OpenCL by extending an existing high level Java programming framework (APARAPI), based on OpenCL, so that it can be used to program FPGAs at a high level of abstraction and increased ease of programmability. We run several real world algorithms to assess the performance of the framework on both a low end and a high end system. On the low end and high end systems respectively we observed up to 78-80 percent power reduction and 4.8X-5.3X speed increase running NBody simulation, as well as up to 65-80 percent power reduction and 6.2X-7X speed increase for a KMeans, MapReduce algorithm running on top of the Hadoop framework and APARAPI.
△ Less
Submitted 21 August, 2014;
originally announced August 2014.