Search | arXiv e-print repository

doi 10.1145/3623278.3624750

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

Authors: Mahyar Emami, Sahand Kashani, Keisuke Kamahori, Mohammad Sepehr Pourghannad, Ritik Raj, James R. Larus

Abstract: The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved s… ▽ More The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks. △ Less

Submitted 20 October, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

arXiv:2201.00060 [pdf, other]

Statistical Program Slicing: a Hybrid Slicing Technique for Analyzing Deployed Software

Authors: Bogdan Alexandru Stoica, Swarup K. Sahoo, James R. Larus, Vikram S. Adve

Abstract: Dynamic program slicing can significantly reduce the code developers need to inspect by narrowing it down to only a subset of relevant program statements. However, despite an extensive body of research showing its usefulness, dynamic slicing is still short from production-level use due to the high cost of runtime instrumentation. As an alternative, we propose statistical program slicing, a novel… ▽ More Dynamic program slicing can significantly reduce the code developers need to inspect by narrowing it down to only a subset of relevant program statements. However, despite an extensive body of research showing its usefulness, dynamic slicing is still short from production-level use due to the high cost of runtime instrumentation. As an alternative, we propose statistical program slicing, a novel hybrid dynamic-static slicing technique that explores the trade-off between accuracy and runtime cost. Our approach relies on modern hardware support for control flow monitoring and a novel, cooperative heap memory tracing mechanism combined with static program analysis for data flow tracking. We evaluate statistical slicing for debugging on 21 failures from 6 widely deployed applications and show it recovers 94% of the program statements on a dynamic slice with only 5% overhead. △ Less

Submitted 31 December, 2021; originally announced January 2022.

arXiv:1908.10574 [pdf, other]

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Authors: Stuart Byma, Akash Dhasade, Adrian Altenhoff, Christophe Dessimoz, James R. Larus

Abstract: This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of $n$ elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least on… ▽ More This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of $n$ elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full $O(n^2)$ comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604$\times$ on 768 cores (1400$\times$ faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%. △ Less

Submitted 28 August, 2019; originally announced August 2019.

Comments: 11 pages, 11 figures. Submitted for publication

arXiv:1902.03238 [pdf, other]

IMPACT: Interval-based Multi-pass Proteomic Alignment with Constant Traceback

Authors: Sahand Kashani, Stuart Byma, James R. Larus

Abstract: Darwin is a genomics co-processor that achieved a 15000x acceleration on long read assembly through innovative hardware and algorithm co-design. Darwins algorithms and hardware implementation were specifically designed for DNA analysis pipelines. This paper analyzes the feasibility of applying Darwins algorithms to the problem of protein sequence alignment. In addition to a behavioral analysis of… ▽ More Darwin is a genomics co-processor that achieved a 15000x acceleration on long read assembly through innovative hardware and algorithm co-design. Darwins algorithms and hardware implementation were specifically designed for DNA analysis pipelines. This paper analyzes the feasibility of applying Darwins algorithms to the problem of protein sequence alignment. In addition to a behavioral analysis of Darwin when aligning proteins, we propose an algorithmic improvement to Darwins alignment algorithm, GACT, in the form of a multi-pass variant that increases its accuracy on protein sequence alignment. Concretely, our proposed multi-pass variant of GACT achieves on average 14\% better alignment scores. △ Less

Submitted 9 February, 2019; originally announced February 2019.

arXiv:1902.00660 [pdf, other]

doi 10.1145/3297858.3304046

Fine-Grain Checkpointing with In-Cache-Line Logging

Authors: Nachshon Cohen, David T. Aksun, Hillel Avni, James R. Larus

Abstract: Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time.… ▽ More Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost. △ Less

Submitted 2 February, 2019; originally announced February 2019.

Comments: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, USA

arXiv:1709.02610 [pdf, other]

doi 10.1145/3133891

Efficient Logging in Non-Volatile Memory by Exploiting Coherency Protocols

Authors: Nachshon Cohen, Michal Friedman, James R. Larus

Abstract: Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow processors to directly write values to persistent storage at speeds that are significantly faster than previous durable media such as hard drives or SSDs. Many applications of NVM are constructed on a logging subsystem, which enables operations to appear to execute atomically and facilitates recovery from failures. Writes… ▽ More Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow processors to directly write values to persistent storage at speeds that are significantly faster than previous durable media such as hard drives or SSDs. Many applications of NVM are constructed on a logging subsystem, which enables operations to appear to execute atomically and facilitates recovery from failures. Writes to NVM, however, pass through a processor's memory system, which can delay and reorder them and can impair the correctness and cost of logging algorithms. Reordering arises because of out-of-order execution in a CPU and the inter-processor cache coherence protocol. By carefully considering the properties of these reorderings, this paper develops a logging protocol that requires only one round trip to non-volatile memory while avoiding expensive computations. We show how to extend the logging protocol to building a persistent set (hash map) that also requires only a single round trip to non-volatile memory for insertion, updating, or deletion. △ Less

Submitted 8 September, 2017; originally announced September 2017.

Showing 1–6 of 6 results for author: Larus, J R