Search | arXiv e-print repository

arXiv:2406.20037 [pdf, other]

Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Authors: Michael Canesche, Gaurav Verma, Fernando Magno Quintao Pereira

Abstract: Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implem… ▽ More Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 22 pages, 19 figures, original work

MSC Class: 68N20 ACM Class: D.3.4

arXiv:2406.06550 [pdf, other]

ChiBench: a Benchmark Suite for Testing Electronic Design Automation Tools

Authors: Rafael Sumitani, João Victor Amorim, Augusto Mafra, Mirlaine Crepalde, Fernando Magno Quintão Pereira

Abstract: Electronic Design Automation (EDA) tools are software applications used by engineers in the design, development, simulation, and verification of electronic systems and integrated circuits. These tools typically process specifications written in a Hardware Description Language (HDL), such as Verilog, SystemVerilog or VHDL. Thus, effective testing of these tools requires benchmark suites written in… ▽ More Electronic Design Automation (EDA) tools are software applications used by engineers in the design, development, simulation, and verification of electronic systems and integrated circuits. These tools typically process specifications written in a Hardware Description Language (HDL), such as Verilog, SystemVerilog or VHDL. Thus, effective testing of these tools requires benchmark suites written in these languages. However, while there exist some open benchmark suites for these languages, they tend to consist of only a handful of specifications. This paper, in contrast, presents ChiBench, a comprehensive suite comprising 50 thousand Verilog programs. These programs were sourced from GitHub repositories and curated using Verible's syntactic analyzer and Jasper(TM)'s HDL semantic analyzer. Since its inception, ChiBench has already revealed bugs in public tools like Verible's obfuscator and parser. In addition to explaining some of these case studies, this paper demonstrates how ChiBench can be used to evaluate the asymptotic complexity and code coverage of typical electronic design automation tools. △ Less

Submitted 24 May, 2024; originally announced June 2024.

Comments: 5 pages, 6 figures, 12 references

MSC Class: 68N15 ACM Class: D.3

arXiv:2308.14122 [pdf]

Preparing Reproducible Scientific Artifacts using Docker

Authors: Michael Canesche, Roland Leissa, Fernando Magno Quintão Pereira

Abstract: The pursuit of scientific knowledge strongly depends on the ability to reproduce and validate research results. It is a well-known fact that the scientific community faces challenges related to transparency, reliability, and the reproducibility of empirical published results. Consequently, the design and preparation of reproducible artifacts has a fundamental role in the development of science. Re… ▽ More The pursuit of scientific knowledge strongly depends on the ability to reproduce and validate research results. It is a well-known fact that the scientific community faces challenges related to transparency, reliability, and the reproducibility of empirical published results. Consequently, the design and preparation of reproducible artifacts has a fundamental role in the development of science. Reproducible artifacts comprise comprehensive documentation, data, and code that enable replication and validation of research findings by others. In this work, we discuss a methodology to construct reproducible artifacts based on Docker. Our presentation centers around the preparation of an artifact to be submitted to scientific venues that encourage or require this process. This report's primary audience are scientists working with empirical computer science; however, we believe that the presented methodology can be extended to other technology-oriented empirical disciplines. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: 15 pages. Manuscript not submitted to any conference, used as a guideline for authors who want to submit artifacts to artifact evaluation pages

arXiv:1706.03042 [pdf, other]

JetsonLEAP: a Framework to Measure Power on a Heterogeneous System-on-a-Chip Device

Authors: Tarsila Bessa, Christopher Gull, Pedro Quintão, Michael Frank, José Nacif, Fernando Magno Quintão Pereira

Abstract: Computer science marches towards energy-aware practices. This trend impacts not only the design of computer architectures, but also the design of programs. However, developers still lack affordable and accurate technology to measure energy consumption in computing systems. The goal of this paper is to mitigate such problem. To this end, we introduce JetsonLEAP, a framework that supports the implem… ▽ More Computer science marches towards energy-aware practices. This trend impacts not only the design of computer architectures, but also the design of programs. However, developers still lack affordable and accurate technology to measure energy consumption in computing systems. The goal of this paper is to mitigate such problem. To this end, we introduce JetsonLEAP, a framework that supports the implementation of energy-aware programs. JetsonLEAP consists of an embedded hardware, in our case, the Nvidia Tegra TK1 System-on-a-chip device, a circuit to control the flow of energy, of our own design, plus a library to instrument program parts. We discuss two different circuit setups. The most precise setup lets us reliably measure the energy spent by 225,000 instructions, the least precise, although more affordable setup, gives us a window of 975,000 instructions. To probe the precision of our system, we use it in tandem with a high-precision, high-cost acquisition system, and show that results do not differ in any significant way from those that we get using our simpler apparatus. Our entire infrastructure - board, power meter and both circuits - can be reproduced with about $500.00. To demonstrate the efficacy of our framework, we have used it to measure the energy consumed by programs running on ARM cores, on the GPU, and on a remote server. Furthermore, we have studied the impact of OpenACC directives on the energy efficiency of high-performance applications. △ Less

Submitted 29 March, 2017; originally announced June 2017.

Comments: 31 pages, 19 figures

Showing 1–4 of 4 results for author: Pereira, F M Q