-
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
Authors:
Mahyar Emami,
Sahand Kashani,
Keisuke Kamahori,
Mohammad Sepehr Pourghannad,
Ritik Raj,
James R. Larus
Abstract:
The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved s…
▽ More
The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.
△ Less
Submitted 20 October, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Statistical Program Slicing: a Hybrid Slicing Technique for Analyzing Deployed Software
Authors:
Bogdan Alexandru Stoica,
Swarup K. Sahoo,
James R. Larus,
Vikram S. Adve
Abstract:
Dynamic program slicing can significantly reduce the code developers need to inspect by narrowing it down to only a subset of relevant program statements. However, despite an extensive body of research showing its usefulness, dynamic slicing is still short from production-level use due to the high cost of runtime instrumentation.
As an alternative, we propose statistical program slicing, a novel…
▽ More
Dynamic program slicing can significantly reduce the code developers need to inspect by narrowing it down to only a subset of relevant program statements. However, despite an extensive body of research showing its usefulness, dynamic slicing is still short from production-level use due to the high cost of runtime instrumentation.
As an alternative, we propose statistical program slicing, a novel hybrid dynamic-static slicing technique that explores the trade-off between accuracy and runtime cost. Our approach relies on modern hardware support for control flow monitoring and a novel, cooperative heap memory tracing mechanism combined with static program analysis for data flow tracking. We evaluate statistical slicing for debugging on 21 failures from 6 widely deployed applications and show it recovers 94% of the program statements on a dynamic slice with only 5% overhead.
△ Less
Submitted 31 December, 2021;
originally announced January 2022.
-
Parallel and Scalable Precise Clustering for Homologous Protein Discovery
Authors:
Stuart Byma,
Akash Dhasade,
Adrian Altenhoff,
Christophe Dessimoz,
James R. Larus
Abstract:
This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of $n$ elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least on…
▽ More
This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of $n$ elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full $O(n^2)$ comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604$\times$ on 768 cores (1400$\times$ faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
IMPACT: Interval-based Multi-pass Proteomic Alignment with Constant Traceback
Authors:
Sahand Kashani,
Stuart Byma,
James R. Larus
Abstract:
Darwin is a genomics co-processor that achieved a 15000x acceleration on long read assembly through innovative hardware and algorithm co-design. Darwins algorithms and hardware implementation were specifically designed for DNA analysis pipelines. This paper analyzes the feasibility of applying Darwins algorithms to the problem of protein sequence alignment. In addition to a behavioral analysis of…
▽ More
Darwin is a genomics co-processor that achieved a 15000x acceleration on long read assembly through innovative hardware and algorithm co-design. Darwins algorithms and hardware implementation were specifically designed for DNA analysis pipelines. This paper analyzes the feasibility of applying Darwins algorithms to the problem of protein sequence alignment. In addition to a behavioral analysis of Darwin when aligning proteins, we propose an algorithmic improvement to Darwins alignment algorithm, GACT, in the form of a multi-pass variant that increases its accuracy on protein sequence alignment. Concretely, our proposed multi-pass variant of GACT achieves on average 14\% better alignment scores.
△ Less
Submitted 9 February, 2019;
originally announced February 2019.
-
Fine-Grain Checkpointing with In-Cache-Line Logging
Authors:
Nachshon Cohen,
David T. Aksun,
Hillel Avni,
James R. Larus
Abstract:
Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time.…
▽ More
Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time.
In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
Efficient Logging in Non-Volatile Memory by Exploiting Coherency Protocols
Authors:
Nachshon Cohen,
Michal Friedman,
James R. Larus
Abstract:
Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow processors to directly write values to persistent storage at speeds that are significantly faster than previous durable media such as hard drives or SSDs. Many applications of NVM are constructed on a logging subsystem, which enables operations to appear to execute atomically and facilitates recovery from failures. Writes…
▽ More
Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow processors to directly write values to persistent storage at speeds that are significantly faster than previous durable media such as hard drives or SSDs. Many applications of NVM are constructed on a logging subsystem, which enables operations to appear to execute atomically and facilitates recovery from failures. Writes to NVM, however, pass through a processor's memory system, which can delay and reorder them and can impair the correctness and cost of logging algorithms.
Reordering arises because of out-of-order execution in a CPU and the inter-processor cache coherence protocol. By carefully considering the properties of these reorderings, this paper develops a logging protocol that requires only one round trip to non-volatile memory while avoiding expensive computations. We show how to extend the logging protocol to building a persistent set (hash map) that also requires only a single round trip to non-volatile memory for insertion, updating, or deletion.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.