-
RobotPerf: An Open-Source, Vendor-Agnostic, Benchmarking Suite for Evaluating Robotics Computing System Performance
Authors:
Víctor Mayoral-Vilches,
Jason Jabbour,
Yu-Shun Hsiao,
Zishen Wan,
Martiño Crespo-Álvarez,
Matthew Stewart,
Juan Manuel Reina-Muñoz,
Prateek Nagras,
Gaurav Vikhe,
Mohammad Bakhshalipour,
Martin Pinzger,
Stefan Rass,
Smruti Panigrahi,
Giulio Corradi,
Niladri Roy,
Phillip B. Gibbons,
Sabrina M. Neuman,
Brian Plancher,
Vijay Janapa Reddi
Abstract:
We introduce RobotPerf, a vendor-agnostic benchmarking suite designed to evaluate robotics computing performance across a diverse range of hardware platforms using ROS 2 as its common baseline. The suite encompasses ROS 2 packages covering the full robotics pipeline and integrates two distinct benchmarking approaches: black-box testing, which measures performance by eliminating upper layers and re…
▽ More
We introduce RobotPerf, a vendor-agnostic benchmarking suite designed to evaluate robotics computing performance across a diverse range of hardware platforms using ROS 2 as its common baseline. The suite encompasses ROS 2 packages covering the full robotics pipeline and integrates two distinct benchmarking approaches: black-box testing, which measures performance by eliminating upper layers and replacing them with a test application, and grey-box testing, an application-specific measure that observes internal system states with minimal interference. Our benchmarking framework provides ready-to-use tools and is easily adaptable for the assessment of custom ROS 2 computational graphs. Drawing from the knowledge of leading robot architects and system architecture experts, RobotPerf establishes a standardized approach to robotics benchmarking. As an open-source initiative, RobotPerf remains committed to evolving with community input to advance the future of hardware-accelerated robotics.
△ Less
Submitted 29 January, 2024; v1 submitted 17 September, 2023;
originally announced September 2023.
-
Speculative Path Planning
Authors:
Mohammad Bakhshalipour,
Mohamad Qadri,
Dominic Guri
Abstract:
Parallelization of A* path planning is mostly limited by the number of possible motions, which is far less than the level of parallelism that modern processors support. In this paper, we go beyond the limitations of traditional parallelism of A* and propose Speculative Path Planning to accelerate the search when there are abundant idle resources. The key idea of our approach is predicting future s…
▽ More
Parallelization of A* path planning is mostly limited by the number of possible motions, which is far less than the level of parallelism that modern processors support. In this paper, we go beyond the limitations of traditional parallelism of A* and propose Speculative Path Planning to accelerate the search when there are abundant idle resources. The key idea of our approach is predicting future state expansions relying on patterns among expansions and aggressively parallelize the computations of prospective states (i.e. pre-evaluate the expensive collision checking operation of prospective nodes). This method allows us to maintain the same search order as of vanilla A* and safeguard any optimality guarantees. We evaluate our method on various configurations and show that on a machine with 32 physical cores, our method improves the performance around 11x and 10x on average over counterpart single-threaded and multi-threaded implementations respectively. The code to our paper can be found here: https://github.com/bakhshalipour/speculative-path-planning.
△ Less
Submitted 14 February, 2021; v1 submitted 11 February, 2021;
originally announced February 2021.
-
A Survey on Recent Hardware Data Prefetching Approaches with An Emphasis on Servers
Authors:
Mohammad Bakhshalipour,
Mehran Shakerinava,
Fatemeh Golshan,
Ali Ansari,
Pejman Lotfi-Karman,
Hamid Sarbazi-Azad
Abstract:
Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: nowadays, almost every high-performance processor incorporates a few data prefetchers for capturin…
▽ More
Data prefetching, i.e., the act of predicting application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely-used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we discuss the fundamental concepts in data prefetching and study state-of-the-art hardware data prefetching approaches. Additional Key Words and Phrases: Data Prefetching, Scale-Out Workloads, Server Processors, and Spatio-Temporal Correlation.
△ Less
Submitted 1 September, 2020;
originally announced September 2020.
-
Die-Stacked DRAM: Memory, Cache, or MemCache?
Authors:
Mohammad Bakhshalipour,
HamidReza Zare,
Pejman Lotfi-Kamran,
Hamid Sarbazi-Azad
Abstract:
Die-stacked DRAM is a promising solution for satisfying the ever-increasing memory bandwidth requirements of multi-core processors. Manufacturing technology has enabled stacking several gigabytes of DRAM modules on the active die, thereby providing orders of magnitude higher bandwidth as compared to the conventional DIMM-based DDR memories. Nevertheless, die-stacked DRAM, due to its limited capaci…
▽ More
Die-stacked DRAM is a promising solution for satisfying the ever-increasing memory bandwidth requirements of multi-core processors. Manufacturing technology has enabled stacking several gigabytes of DRAM modules on the active die, thereby providing orders of magnitude higher bandwidth as compared to the conventional DIMM-based DDR memories. Nevertheless, die-stacked DRAM, due to its limited capacity, cannot accommodate entire datasets of modern big-data applications. Therefore, prior proposals use it either as a sizable memory-side cache or as a part of the software-visible main memory. Cache designs can adapt themselves to the dynamic variations of applications but suffer from the tag storage/latency/bandwidth overhead. On the other hand, memory designs eliminate the need for tags, and hence, provide efficient access to data, but are unable to capture the dynamic behaviors of applications due to their static nature.
In this work, we make a case for using the die-stacked DRAM partly as main memory and partly as a cache. We observe that in modern big-data applications there are many hot pages with a large number of accesses. Based on this observation, we propose to use a portion of the die-stacked DRAM as main memory to host hot pages, enabling serving a significant number of the accesses from the high-bandwidth DRAM without the overhead of tag-checking, and manage the rest of the DRAM as a cache, for capturing the dynamic behavior of applications. In this proposal, a software procedure pre-processes the application and determines hot pages, then asks the OS to map them to the memory portion of the die-stacked DRAM. The cache portion of the die-stacked DRAM is managed by hardware, caching data allocated in the off-chip memory.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.
-
Making Belady-Inspired Replacement Policies More Effective Using Expected Hit Count
Authors:
Seyed Armin Vakil Ghahani,
Sara Mahdizadeh Shahri,
Mohammad Bakhshalipour,
Pejman Lotfi-Kamran,
Hamid Sarbazi-Azad
Abstract:
Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose a significant performance potential. One way to reduce the number of off-chip misses is through using a well-behaved replacement policy in the LLC. Existing processors employ a variation…
▽ More
Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose a significant performance potential. One way to reduce the number of off-chip misses is through using a well-behaved replacement policy in the LLC. Existing processors employ a variation of least recently used (LRU) policy to determine a victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance, and hence, is unfeasible due to the need to know the future. Consequently, Belady-inspired replacement polices use Belady's MIN to derive an indicator to help them choose a victim for replacement.
In this work, we show that the indicator that is used in the state-of-the-art Belady-inspired replacement policy is not decisive in picking a victim in a considerable number of cases, and hence, the policy has to rely on a standard metric (e.g., recency or frequency) to pick a victim, which is inefficient. We observe that there exist strong correlations among the hit counts of cache blocks in the same region of memory when Belady's MIN is the replacement policy. Taking advantage of this observation, we propose an expected-hit-count indicator for the memory regions and use it to improve the victim selection mechanism of Belady-inspired replacement policies when the main indicator is not decisive. Our proposal offers a 5.2\% performance improvement over the baseline LRU and outperforms Hawkeye, which is the state-of-the-art replacement policy.
△ Less
Submitted 15 August, 2018;
originally announced August 2018.
-
Scale-Out Processors & Energy Efficiency
Authors:
Pouya Esmaili-Dokht,
Mohammad Bakhshalipour,
Behnam Khodabandeloo,
Pejman Lotfi-Kamran,
Hamid Sarbazi-Azad
Abstract:
Scale-out workloads like media streaming or Web search serve millions of users and operate on a massive amount of data, and hence, require enormous computational power. As the number of users is increasing and the size of data is expanding, even more computational power is necessary for powering up such workloads. Data centers with thousands of servers are providing the computational power necessa…
▽ More
Scale-out workloads like media streaming or Web search serve millions of users and operate on a massive amount of data, and hence, require enormous computational power. As the number of users is increasing and the size of data is expanding, even more computational power is necessary for powering up such workloads. Data centers with thousands of servers are providing the computational power necessary for executing scale-out workloads. As operating data centers requires enormous capital outlay, it is important to optimize them to execute scale-out workloads efficiently. Server processors contribute significantly to the data center capital outlay, and hence, are a prime candidate for optimizations. While data centers are constrained with power, and power consumption is one of the major components contributing to the total cost of ownership (TCO), a recently-introduced scale-out design methodology optimizes server processors for data centers using performance per unit area. In this work, we use a more relevant performance-per-power metric as the optimization criterion for optimizing server processors and reevaluate the scale-out design methodology. Interestingly, we show that a scale-out processor that delivers the maximum performance per unit area, also delivers the highest performance per unit power.
△ Less
Submitted 14 August, 2018;
originally announced August 2018.
-
Parallelizing Bisection Root-Finding: A Case for Accelerating Serial Algorithms in Multicore Substrates
Authors:
Mohammad Bakhshalipour,
Hamid Sarbazi-Azad
Abstract:
Multicore architectures dominate today's processor market. Even though the number of cores and threads are pretty high and continues to grow, inherently serial algorithms do not benefit from the abundance of cores and threads. In this paper, we propose Runahead Computing, a technique which uses idle threads in a multi-threaded architecture for accelerating the execution time of serial algorithms.…
▽ More
Multicore architectures dominate today's processor market. Even though the number of cores and threads are pretty high and continues to grow, inherently serial algorithms do not benefit from the abundance of cores and threads. In this paper, we propose Runahead Computing, a technique which uses idle threads in a multi-threaded architecture for accelerating the execution time of serial algorithms. Through detailed evaluations targeting both CPU and GPU platforms and a specific serial algorithm, our approach reduces the execution latency up to 9x in our experiments.
△ Less
Submitted 10 May, 2018;
originally announced May 2018.