-
Parallel approach to sliding window sums
Authors:
Roman Snytsar,
Yatish Turakhia
Abstract:
Sliding window sums are widely used in bioinformatics applications, including sequence assembly, k-mer generation, hashing and compression. New vector algorithms which utilize the advanced vector extension (AVX) instructions available on modern processors, or the parallel compute units on GPUs and FPGAs, would provide a significant performance boost for the bioinformatics applications. We develop…
▽ More
Sliding window sums are widely used in bioinformatics applications, including sequence assembly, k-mer generation, hashing and compression. New vector algorithms which utilize the advanced vector extension (AVX) instructions available on modern processors, or the parallel compute units on GPUs and FPGAs, would provide a significant performance boost for the bioinformatics applications. We develop a generic vectorized sliding sum algorithm with speedup for window size w and number of processors P is O(P/w) for a generic sliding sum. For a sum with commutative operator the speedup is improved to O(P/log(w)). When applied to the genomic application of minimizer based k-mer table generation using AVX instructions, we obtain a speedup of over 5X.
△ Less
Submitted 3 September, 2019; v1 submitted 25 November, 2018;
originally announced November 2018.
-
HoLiSwap: Reducing Wire Energy in L1 Caches
Authors:
Yatish Turakhia,
Subhasis Das,
Tor M. Aamodt,
William J. Dally
Abstract:
This paper describes HoLiSwap a method to reduce L1 cache wire energy, a significant fraction of total cache energy, by swap** hot lines to the cache way nearest to the processor. We observe that (i) a small fraction (<3%) of cache lines (hot lines) serve over 60% of the L1 cache accesses and (ii) the difference in wire energy between the nearest and farthest cache subarray can be over 6…
▽ More
This paper describes HoLiSwap a method to reduce L1 cache wire energy, a significant fraction of total cache energy, by swap** hot lines to the cache way nearest to the processor. We observe that (i) a small fraction (<3%) of cache lines (hot lines) serve over 60% of the L1 cache accesses and (ii) the difference in wire energy between the nearest and farthest cache subarray can be over 6$\times$. Our method exploits this difference in wire energy to dynamically identify hot lines and swap them to the nearest physical way in a set-associative L1 cache. This provides up to 44% improvement in the wire energy (1.82% saving in overall system energy) with no impact on the cache miss rate and 0.13% performance drop. We also show that HoLiSwap can simplify way-prediction.
△ Less
Submitted 13 January, 2017;
originally announced January 2017.
-
Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications
Authors:
Yatish Turakhia,
Guangshuo Liu,
Siddharth Garg,
Diana Marculescu
Abstract:
Dynamically adaptive multi-core architectures have been proposed as an effective solution to optimize performance for peak power constrained processors. In processors, the micro-architectural parameters or voltage/frequency of each core to be changed at run-time, thus providing a range of power/performance operating points for each core. In this paper, we propose Thread Progress Equalization (TPEq…
▽ More
Dynamically adaptive multi-core architectures have been proposed as an effective solution to optimize performance for peak power constrained processors. In processors, the micro-architectural parameters or voltage/frequency of each core to be changed at run-time, thus providing a range of power/performance operating points for each core. In this paper, we propose Thread Progress Equalization (TPEq), a run-time mechanism for power constrained performance maximization of multithreaded applications running on dynamically adaptive multicore processors. Compared to existing approaches, TPEq (i) identifies and addresses two primary sources of inter-thread heterogeneity in multithreaded applications, (ii) determines the optimal core configurations in polynomial time with respect to the number of cores and configurations, and (iii) requires no modifications in the user-level source code. Our experimental evaluations demonstrate that TPEq outperforms state-of-the-art run-time power/performance optimization techniques proposed in literature for dynamically adaptive multicores by up to 23%.
△ Less
Submitted 21 March, 2016;
originally announced March 2016.
-
Framework for Application Map** over Packet-Switched Network of FPGAs: Case Studies
Authors:
Vinay B. Y. Kumar,
Pinalkumar Engineer,
Mandar Datar,
Yatish Turakhia,
Saurabh Agarwal,
Sanket Diwale,
Sachin B. Patkar
Abstract:
The algorithm-to-hardware High-level synthesis (HLS) tools today are purported to produce hardware comparable in quality to handcrafted designs, particularly with user directive driven or domains specific HLS. However, HLS tools are not readily equipped for when an application/algorithm needs to scale. We present a (work-in-progress) semi-automated framework to map applications over a packet-switc…
▽ More
The algorithm-to-hardware High-level synthesis (HLS) tools today are purported to produce hardware comparable in quality to handcrafted designs, particularly with user directive driven or domains specific HLS. However, HLS tools are not readily equipped for when an application/algorithm needs to scale. We present a (work-in-progress) semi-automated framework to map applications over a packet-switched network of modules (single FPGA) and then to seamlessly partition such a network over multiple FPGAs over quasi-serial links. We illustrate the framework through three application case studies: LDPC Decoding, Particle Filter based Object Tracking, and Matrix Vector Multiplication over GF(2). Starting with high-level representations of each case application, we first express them in an intermediate message passing formulation, a model of communicating processing elements. Once the processing elements are identified, these are either handcrafted or realized using HLS. The rest of the flow is automated where the processing elements are plugged on to a configurable network-on-chip (CONNECT) topology of choice, followed by partitioning the 'on-chip' links to work seamlessly across chips/FPGAs.
△ Less
Submitted 27 August, 2015;
originally announced August 2015.