Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Michael Canesche UFMGBrazil [email protected] Gaurav Verma Stony Brook UniversityUSA [email protected]  and  Fernando Magno Quintão Pereira UFMGBrazil [email protected]
Abstract.

Machine-learning models consist of kernels, which are algorithms applying operations on tensors—data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel’s optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function—typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor’s search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor’s exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM’s MetaSchedule in June 2024.

1. Introduction

A kernel is an algorithm that applies operations on tensors: chunks of memory indexed by a linear combination of natural numbers. A Tensor Compiler is a compilation infrastructure that generates code for kernels. As Ansel et al. (2024) explains, many tensor compilers, including TVM (Chen et al., 2018), nvFuser (Sarofeen et al., 2022) and NNC (Zolotukhin, 2021), follow a design probably inspired by Halide (Ragan-Kelley et al., 2013). These compilers separate the kernel’s semantics (what the kernel does) from its schedule (when the kernel does it). Since the same kernel semantics can be implemented in many different schedules, tensor compilers face a challenge called kernel scheduling: determining a suitable ordering for the operations a kernel performs on tensors. Kernel scheduling is typically addressed via heuristics because the Kernel Optimization Space—the set of all implementations of a kernel—is extremely vast, as Li et al. (2021) explains.

The Apache TVM tensor compiler employs three distinct optimization infrastructures for solving kernel scheduling: AutoTVM (Chen et al., 2018), Ansor (Zheng et al., 2020) and MetaSchedule (Shao, 2021). AutoTVM finds parameters of kernel sketches. The sketch of a kernel represents the optimizations applied to the abstract description of that kernel, such as loop unrolling, splitting, interchange, and tiling. Many of these optimizations are parameterizable. Examples of parameters include the unrolling factor in loop unrolling and the width of the tiling window in loop tiling. AutoTVM assigns values to these parameters using various search heuristics. One of these heuristics is of interest in this paper: Droplet Search (Canesche et al., 2024). Droplet Search seeks the optimal configuration of an optimization template by determining a descent direction along the objective function that models the running time of the kernel. Ansor and MetaSchedule (in contrast to AutoTVM) have the capability to generate new sketches. In other words, these schedulers are not restricted to a single sequence of optimizations. Section 2.1 provides further details on how Ansor works, whereas Section 2.2 explains how Droplet Search works.

A Combined Search Infrastructure.

Both Ansor and MetaSchedule typically generate higher-quality kernels compared to AutoTVM’s Droplet Search, as they are not limited to a single search space. Each new sketch leads to the exploration of an entirely new search space. However, when constrained to a single sketch, Droplet Search tends to outperform these schedulers. Drawing on the terminology from recent work by Ding et al. (2023), Droplet Search’s coordinate descent approach is “hardware centric”, while Ansor’s genetic algorithm is “input centric”; better embodying this quality that Sorensen et al. (2019) calls “performance portability”. In essence, Droplet Search, by traversing the search space contiguously, is sensitive to cache sizes and levels in the cache hierarchy. Nevertheless, despite Droplet Search’s tendency to identify optimal points within the search space, this algorithm is unable to navigate beyond this space due to its reliance on the initial sketch.

Inspired by these observations, this paper addresses the following research question: “Is it possible to combine the wide exploration phase of Ansor111This paper uses Ansor as the basis for experiments; however, similar results have been reproduced in the MetaSchedule (Canesche, 2024). We focus on Ansor because it is documented in an academic work (Zheng et al., 2020). with Droplet Search’s exploitation; thus, obtaining the advantages of each approach?” By doing so, we can develop a version of Ansor that produces superior kernels compared to the original tool while also reducing search times. This paper brings evidence that such a combination is effective. The core idea presented in this work is as follows: Initially, we allow Ansor to explore the kernel optimization space, leveraging its “space travel ability” to test different sequences of optimizations during this exploration. Subsequently, following this initial exploration phase, we identify the most promising kernel space discovered by Ansor and employ AutoTVM’s Droplet Search—a line search algorithm—to find a good kernel within this space.

Summary of Findings.

This paper describes findings of an eminently empirical nature. The search techniques discussed in Section 2 are not an original contribution of this work: Ansor’s exploration algorithm was designed by Zheng et al. (2020), and Droplet Search was incorporated into AutoTVM by Canesche et al. (2024), and further explored by Li et al. (2024). Nevertheless, combining these two techniques into a practical tool required a number of experimental observations and engineering decisions, which Section 3 organizes into six research questions.

The new search methodology that emerged out of this combination has been considered sufficiently practical to be approved into Apache TVM. A request for comments was submitted to the TVM community in December of 2023, and a patch was approved into Ansor in February of 2024. The patch has not yet been merged into an official release of TVM at the time of this submission. We have, subsequently, incorporated a coordinate descent exploitation phase into MetaSchedule (Shao, 2021), which uses different search algorithms than Ansor. These results are even better than those observed in Ansor. Consequently, a new request for comments was submitted to the TVM community in June 2024, together with a patch for MetaSchedule (Canesche, 2024).

The experiments in Section 3 show that the combined exploration-exploitation methodology outperforms the original implementation of Ansor in terms of kernel quality and search speed. Positive results are reported in four different processors (AMD R7-3700X, Fujitsu ARM A64FX, NVIDIA RTX 3080, and NVIDIA A100), and in 20 popular deep-learning models, including AlexNet, VGG, ResNet, MobileNet, Inception, GoogleNet, and DenseNet. As an illustration, by terminating Ansor after sampling 300 kernels and then optimizing the best candidate with Droplet Search, we outperformed Ansor running with a budget of 10,000 samples in all the 4×204204\times 204 × 20 architecture-model pairs. For instance, the combined search approach, on MnasNet, yields kernel speedups of 1.59x, 1.02x, and 1.08x on x86, ARM, and NVIDIA platforms, respectively. The search time for the combined approach correspondingly decreases by 1.25x, 1.25x, and 1.18x on these architectures.

2. Exploration via Ansor; Exploitation via Droplet Search

A kernel is an abstract concept: it can be represented by operations on memory indexed by a linear combination of natural numbers. The actual implementation of a kernel is determined by its schedule. The schedule of a kernel determines in which order the different memory elements are accessed when the kernel runs. Following the terminology introduced in the original Ansor work (Zheng et al., 2020), a schedule is the combination of two notions: a sketch and an annotation of the sketch. Definition 2.1 enumerates these notions.

Definition 2.1 (The Kernel Search Space).

The naïve implementation of a kernel replaces each linear index in the abstract representation of the kernel with a loop. A sketch is a sequence of transformations, such as loop fusion, splitting, or tiling, that can be applied to the naïve implementation of the kernel. An annotation of the sketch is the set of parameters that control the effect of each optimization in the sketch, such as unrolling factor, length of tiling window, number of threads in parallelization, etc. We call an annotated sketch a “kernel”. A kernel is a concrete program: it effectively runs. Each sketch determines a kernel search space, which is the set of every valid way to annotate that sketch.

Example 2.2.

Figure 1 (a) shows an example of an abstract kernel. Figure 1 (b) shows a naïve implementation of the abstract kernel seen in Figure 1 (a). Figure 1 (c) shows two sketches produced after the application of different code optimizations onto the program in Figure 1 (b).

Refer to caption
Figure 1. (a) Abstract view of a kernel. (b) Naïve implementation of the abstract kernel. (c) Two optimization sketches for the naïve kernel. (d) Different annotations for the sketches.

Figure 1 (d) shows different parameters of the sketches in Figure 1 (c). As introduced in Definition 2.1, the set of every valid configuration of annotations for a given sketch forms the search space of that sketch. This space has one dimension for each parameter that is allowed to vary. Figure 2 shows two views of the optimization space of the two sketches in Figure 1 (c).

Refer to caption
Figure 2. (a) A three-dimensional view of the optimization space formed by the parameters P1 and P2 seen in Figure 1 c-i. (b) A three-dimensional view of the optimization space of parameters P9 and PC.

2.1. Space Exploration via Ansor

Ansor solves kernel scheduling in an interactive process that involves three phases:

Sketch generation::

in this phase, new sketches are created. As explained in Definition 2.1, each sketch determines one kernel search space.

Sketch annotation::

in this phase, an initial population of annotated sketches is created. Following Definition 2.1, each annotated kernel is a point in the search space determined by the sketch that provides the annotations.

Kernel evolution::

in this phase, candidate kernels are sorted according to an estimation of their performance (via Ansor’s cost model). The best candidates are sampled (executed and timed), and this information is used to improve the cost model.

Sketch Generation

The generation of new sketches happens via the application of a small collection of rewriting rules, which Figure 3 enumerates222Rules implemented in https://github.com/apache/tvm/tree/main/src/auto_scheduler/search_policy on 02/01/2024. Each rewriting rule provokes a code transformation, such as tiling, parallelization, or unrolling. The sketch generation rules are hardware-dependent: some rules, like those that bind loop iterations to threads, are only well-defined for GPUs, for instance. Notice that in addition to the rules seen in Figure 3, users can still add custom rules to the set of available sketch transformations.

Refer to caption
Figure 3. Sketch generation rules. Ansor uses these rules to change the kernel search space. Each rule modifies a sketch, e.g., fusing, splitting or tiling loops. However, these rules do not change the annotations in the sketch.
Example 2.3.

Figure 4 (a) shows a variation of the sketch earlier seen in Figure 1 (b). Each loop has an unrolling factor, which will have to be annotated in the initialization phase of the autotuning process. Figure 4 (b) shows an application of the “Always inlining” rule onto Figure 4 (a). It is important to note that neither of these program representations qualifies as a “kernel”, since they are not executable. To transform them into concrete kernels, the annotation within the unrolling factor must be assigned an actual value.

Refer to caption
Figure 4. (a) Sketch of the abstract kernel seen in Figure 1 (a). (b) Sketch that ensues from the application of the “Always inlining” rule. (c) Sketch that ensues from the initialization of the unrolling factor.

The Initialization of Annotations

Sketches are not executable programs: they have annotations which must be replaced with actual values. These values are the parameters of optimizations, as Definition 2.1 explains. Thus, as a second step of the iterative exploration approach of Ansor, an initial population of annotated sketches is produced via the application of “initialization rules”. Figure 5 summarizes the rules available in Ansor at the time this work was produced. Each initialization rule is parameterized by a probability distribution, which associates concrete values with the probability that they can be chosen. Example 2.4 shows how these distributions are used.

Refer to caption
Figure 5. Rules that Ansor uses to create an initial population of kernels, which will be the starting point to the evolutionary search.
Example 2.4.

Figure 4 (c) shows the concrete kernel that comes out of the application of the “Unroll” rule (in Figure 5) to the sketch in Figure 4 (b). Figure 4 (c) also shows the probability distribution used to randomly choose the initial unrolling factor of 32.

Evolution of the Annotated Sketch

To explore different kernel spaces, Ansor keeps a population of promising candidate kernels. This population is updated in an iterative process. At each iteration, the current population of candidates evolves through the application of the mutation strategies enumerated in Figure 6. To select the next set of candidates, Ansor does not run every kernel in the current population. Instead, it uses a cost model to select candidates that are likely to run efficiently. These promising candidates are executed, and the result of these samples is used to recalibrate the cost model; hence, improving the estimates of the next candidates. Periodically, the algorithm reports statistical information regarding the search progress, including maximum and minimum scores, population size, and mutation success rates. Upon completion of the specified number of iterations, the best-performing kernels are recorded as the output.

Refer to caption
Figure 6. Mutation rules for evolutionary search. In contrast to the initialization rules seen in Figure 5, the mutation rules take the history of previous annotations when determining the next value of a given annotation.

Termination in Ansor

The number of possible schedules is very large; hence, Ansor limits the amount of schedules with a budget of trials. Each trial consists of the observation of the execution of an actual schedule, which happens at the end of the evolutionary phase. Because a machine learning model contains many kernels, an initial round of trials is partitioned among these layers. Layers are grouped into a worklist, and receive a quota of trials in round-robin fashion. After an initial round of optimizations, layers that run for a very short time are removed from this worklist. This process ensures that layers that run for the longest time are subject to more extensive optimizations. This approach is what Zheng et al. (2020) call “optimizing with gradient descent”. Because the budget of trials is fixed, Ansor is guaranteed to terminate.

Limitations of Ansor

The main limitation of Ansor is the fact that it is oblivious to the structure of the search space. For instance, if we observe a performance improvement by increasing the unrolling factor of a loop from 4 to 6, it is likely that if we increase it further to 8, another improvement will also be observed. However, if going to 8 results in performance degradation, then further increases are likely to not bring improvements either. Ansor’s exploitation approach, via an evolutionary algorithm, is not aware of this notion of neighborhood between kernels or of potential convex regions in the optimization space.

2.2. Space Exploitation via Droplet Search

Droplet Search (Canesche et al., 2024) is a kernel scheduling algorithm available in AutoTVM. AutoTVM differs from Ansor because it does not create new sketches. Rather, it is restricted to modifying the parameters of a single sketch—the origin of the optimization space. AutoTVM provides several independent scheduling approaches: random sampling, grid sampling, genetic sampling, etc. However, only Droplet Search will be of interest to this presentation333Section 3.4 shall compare Droplet Search with the other techniques available in AutoTVM as when they are used as Ansor’s exploitation mechanism.. Droplet Search is a variation of an exploitation algorithm called Coordinate Descent444It is unclear who invented Coordinate Descent. Descriptions of the algorithm can be found in classic textbooks (Zangwill, 1969). For a comprehensive overview, we recommend the work of Wright (2015).. It relies on the premise that the parameters of a sketch can be arranged into a coordinate space. Figure 7 contains an annotated version of the algorithm.

Refer to caption
Figure 7. The Droplet Search kernel scheduling algorithm. This pseudo-code is a simplified version of the original presentation of the algorithm, taken from (Canesche et al., 2024). We have removed speculation and parallelism from this version as these features are immaterial for the presentation of our ideas.
Example 2.5.

Let us assume a sketch formed by two optimizations: unrolling and tiling. For the sake of this example, unrolling supports five “unrolling factors”: {1,2,3,4,5}12345\{1,2,3,4,5\}{ 1 , 2 , 3 , 4 , 5 }. These are the parameters of the loop unrolling optimization. Tiling is parameterized by the size of the tiling window. Let us assume the following sizes: {1,2,4,8,16}124816\{1,2,4,8,16\}{ 1 , 2 , 4 , 8 , 16 }. The optimization space, in this case, is formed by 5×5555\times 55 × 5 points, such as (1,1)11(1,1)( 1 , 1 ), which means no optimization, or (3,16)316(3,16)( 3 , 16 ), which indicates that the loop must be unrolled three times, and then tiled with a window of size 16. These points, e.g., (1,1)11(1,1)( 1 , 1 ), (3,16)316(3,16)( 3 , 16 ), etc, are the coordinates of the optimization space.

From the notion of coordinates, Droplet Search defines a neighborhood function: a function that returns the neighbors of a given coordinate. Intuitively, the neighbors of a coordinate are the points that are the closest to it. In Example 2.5, the neighbors of (unrolling=3,tiling=8)formulae-sequenceunrolling3tiling8(\mbox{unrolling}=3,\mbox{tiling}=8)( unrolling = 3 , tiling = 8 ) would be the points (2,8)28(2,8)( 2 , 8 ), (4,8)48(4,8)( 4 , 8 ), (3,4)34(3,4)( 3 , 4 ) and (3,16)316(3,16)( 3 , 16 ). From this concept of neighborhood, Droplet Search works iteratively as follows:

  1. (1)

    At iteration zero, let the best current candidate be the set of parameters that implement no optimization.

  2. (2)

    Let (c1,c2,,cn)subscript𝑐1subscript𝑐2subscript𝑐𝑛(c_{1},c_{2},\ldots,c_{n})( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be the best set of parameters discovered up to iteration i𝑖iitalic_i.

    1. (a)

      If there exists ci,1insuperscriptsubscript𝑐𝑖1𝑖𝑛c_{i}^{\prime},1\leq i\leq nitalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ≤ italic_i ≤ italic_n, such that (c1,,ci,,cn)subscript𝑐1superscriptsubscript𝑐𝑖subscript𝑐𝑛(c_{1},\ldots,c_{i}^{\prime},\ldots,c_{n})( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) yields a faster kernel than (c1,,ci,,cn)subscript𝑐1subscript𝑐𝑖subscript𝑐𝑛(c_{1},\ldots,c_{i},\ldots,c_{n})( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), then update the current best candidate to use cisuperscriptsubscript𝑐𝑖c_{i}^{\prime}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT instead of cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

    2. (b)

      If there is no such cisuperscriptsubscript𝑐𝑖c_{i}^{\prime}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then the search terminates.

Limitations of Droplet Search

Droplet search is a fast search algorithm when compared to Ansor or to other approaches available in AutoTVM (Canesche et al., 2024). However, it has two fundamental limitations:

  • Droplet Search is restricted to a single sketch. In other words, it can modify a sketch’s annotations, but it cannot create new sketches. This is a limitation of any search algorithm used in AutoTVM, but it is not a limitation of Ansor.

  • Droplet Search depends highly on the initial schedule it receives as the seed of the search procedure. If this initial schedule does not exist in the same convex region as the optimal schedule, then Droplet Search cannot find the optimal schedule.

By combining Ansor and Droplet Search, we hope to circumvent the limitations of both search techniques. Section 2.3 explains how these two approaches can be used together.

2.3. Combining Ansor with Droplet Search

To combine Droplet Search and Ansor, we determine two parameters:

  • K𝐾Kitalic_K: the budget of trials of Ansor.

  • N<K𝑁𝐾N<Kitalic_N < italic_K: a subset of trials.

We then proceed as follows:

  1. (1)

    Run Ansor on the target model using only N𝑁Nitalic_N trials.

  2. (2)

    Give the best schedule found with N𝑁Nitalic_N trials to Droplet Search.

  3. (3)

    Run Droplet Search up to convergence.

Figure 8 provides some intuition on this modus operandi. As the figure illustrates, the proposed technique seeks to use Ansor to explore the universe of sketches and then use Droplet Search as the core strategy to explore concrete representations of these sketches.

Refer to caption
Figure 8. Coarse exploration of different kernel search spaces with Ansor, and careful exploitation of best candidate with Droplet Search.

In Section 3, we demonstrate that by choosing proper values for K𝐾Kitalic_K and N𝑁Nitalic_N, we can outperform Ansor in two ways: first, producing faster end-to-end machine learning models; second, reducing the search time of Ansor. The modifications needed to add this combination to the current code base of Apache TVM are relatively small: up to 307 lines of code in the TVM repository (Release 17, Apache TVM v0.15.0).

3. Evaluation

This section evaluates the idea proposed in this paper. In particular, it seeks to demonstrate that by exploiting, via Droplet Search, a reduced set of samples explored by Ansor, it is possible to outperform Ansor itself. We shall refer to this new version of Ansor, which uses Droplet Search, as the Combined Approach. Henceforth, we denote it as DPAnsor, reserving Ansor for the original implementation of that tool. In what follows, we explore six research questions:

RQ1::

How many samples does DPAnsor need to observe to produce kernels that outperform those produced by Ansor with 10,000 trials?

RQ2::

How many samples can DPAnsor observe and still outperform Ansor in terms of search time when the latter uses 10,000 trials?

RQ3::

How does the size of models impact the behavior of DPAnsor, in terms of kernel performance and search speed?

RQ4::

How does Droplet Search compare to other search techniques available in AutoTVM, in terms of their ability to exploit Ansor’s results?

RQ5::

How does the average number of samples that Droplet Search gauges per layer vary with the initial budget allocated to DPAnsor’s exploration phase?

RQ6::

How does DPAnsor compare with PyTorch and Ansor in terms of the speed of the kernels that it produces, considering a range of different machine-learning kernels?

Before diving into the research questions, we explain our experimental setup. Notice that a fully containerized version of this methodology has been organized as a docker image, which is publicly available at https://github.com/lac-dcc/bennu.

Hardware and Software.

We evaluated the scheduling approaches on four different architectures, as shown in Figure 9. The hardware consists of a general-purpose desktop architecture (AMD Ryzen 7 (AMD, 2019)), a cluster-based machine (ARM A64FX (Ookami, 2022)), and two graphics processing units (NVIDIA A100 (Nvidia, 2020a) and NVIDIA RTX3080 (Nvidia, 2020b)). The experiments reported in this section use versions of Ansor and AutoTVM (Droplet Search) available at Apache TVM v0.13.0, released in July 2023. The version used for TensorFlow was 2.14, and PyTorch was 2.0+cu118.

Refer to caption
Figure 9. The architectures evaluated in this report.

Benchmarks.

This section evaluates kernel scheduling across twenty neural networks. The first column of Figure 10 contains the complete list of these models. All these models are implemented using the ONNX representation to make a comparison between TVM, PyTorch and TensorFlow possible, as seen in Section 3.6. The models used in our study are sourced from the ONNX model zoo available at https://github.com/onnx/models.

Methodology.

A machine learning model forms a graph of kernels. Ansor optimizes machine learning models per kernel, assuming kernels can be independently optimized. It starts with a budget of trials, where each trial is a transformation that can be applied to a kernel. Let us call this budget K𝐾Kitalic_K. Ansor ensures that each kernel receives a fraction of these K𝐾Kitalic_K trials. Currently, this initial fraction is 𝑚𝑖𝑛(K/L,64)𝑚𝑖𝑛𝐾𝐿64\mathit{min}(K/L,64)italic_min ( italic_K / italic_L , 64 ), where L𝐿Litalic_L is the number of layers (kernels) in the model. After an initial round of optimizations, Ansor applies the remaining trials onto kernels that run for the longest time. This approach directs the optimization effort to the kernels that are more likely to contribute to the overall running time of the end-to-end model. In what follows, all the results we report are relative to a baseline version of Ansor equipped with a budget of 10,000 trials. We shall test DPAnsor with either K=1,10,25,50,100,200,300𝐾1102550100200300K=1,10,25,50,100,200,300italic_K = 1 , 10 , 25 , 50 , 100 , 200 , 300, or 1,00010001,0001 , 000 trials. Suppose we choose K=100𝐾100K=100italic_K = 100, for instance. In that case, we will run Ansor with a budget of 100 trials, pick the best configuration (which results from the independent optimization of the kernels), and give this configuration to AutoTVM’s Droplet Search. We then let Droplet Search run until it reaches convergence.

Confidence

Sections 3.1-3.4 show relative results. It could be possible that the running times presented for kernels and schedulers is similar enough to the point of hindering our conclusions meaningless. However, such is not the case. Scheduling runs for a very long time. To give the reader an idea, Ansor, with a budget of 10,000 trials (or baseline), takes 11,195 seconds to schedule AlexNet, our smallest model. This time drops to 2,839 seconds considering DPAnsor with a budget of 300 trials. The running time of kernels is faster: end-to-end models run for a few seconds. However, in every experiment, we consider averages of three samples and only report speedups if the p-value (produced via the non-parametric Wilcoxon rank-sum test) for the difference between the two populations is below 0.01.

3.1. RQ1 – On the Quality of End-to-End Models

We consider that a version of an end-to-end model is better than another for a given architecture when it runs faster in that architecture. The execution time of a model is determined by the schedule of the kernels that constitute it. If we apply Droplet Search to the best model produced by Ansor after it observes 10,000 trials, we will likely improve the model (at least, we should not make it worse). However, this section shows that obtaining a better model via DPAnsor with a much lower budget is possible in four different architectures. For brevity, we analyze in details results obtained on an x86 CPU. For the other architectures, we show only summarization results.

Discussion: AMD Ryzen 7 (x86-64)

The x86 architecture represents a widely adopted instruction set architecture (ISA), serving as the basis for Intel and AMD processors. Figure 10 compares kernel speed and search time of models running on an x86-64 CPU. The top part of the figure (labeled “10k speedup execution time”) compares the running time of the kernels produced by Ansor and DPAnsor. The lower part (labeled “10k speedup tuning time”) compares the search time of these two approaches, and shall be discussed in Section 3.2. In both cases, bars above 1.0 denote improvements of DPAnsor over Ansor. In terms of kernel speed, sampling 25 configurations with DPAnsor is sufficient—in most models—to outperform Ansor with 10K samples. Speedups improve gradually as more samples are added to DPAnsor, to the point that with 1,000 samples, we see an average speedup (geometric mean) of 34%.

Refer to caption
Figure 10. Comparative Analysis of Optimization Results on the x86 Architecture Using an AMD Ryzen 7 3700X Processor. Numbers show 𝙰𝚗𝚜𝚘𝚛/𝙳𝙿𝙰𝚗𝚜𝚘𝚛𝙰𝚗𝚜𝚘𝚛𝙳𝙿𝙰𝚗𝚜𝚘𝚛\mathtt{Ansor}/\mathtt{DPAnsor}typewriter_Ansor / typewriter_DPAnsor ratios. Thus, results higher than 1.0 (in blue) denote improvements of DPAnsor (this paper) over Ansor.

Discussion: Other Architectures

Figure 11 (Left) compares the relative speed of kernels produced by DPAnsor over Ansor in all the four architectures available for this study: in addition to the AMD x86 of Figure 10, we see an Nvidia A100(Ampere), an Nvidia RTX3080 (Ampere), and an ARM A64FX (aarch64). In every case, speedups of DPAnsor relative to Ansor emerge consistently with K=300𝐾300K=300italic_K = 300. However, with K=100𝐾100K=100italic_K = 100, we have already recorded speedups on the Nvidia A100 and on the ARM A64. We failed to see meaningful speedups on the Nvidia RTX3080 because Ansor, with a budget of 10K samples, seems to be very close to achieving peek performance on this GPU. Even if we increase its budget of samples, we could not obtain better kernels to the 20 different models used in this study. Nevertheless, notice that DPAnsor achieves this peek performance with 300 samples (plus a few—less than 50—samples probed by Droplet Search). On the A100 GPU, in contrast, after 300 samples, DPAnsor already delivers kernels 12% faster than those produced by Ansor.

Refer to caption
Figure 11. Left: relative speed of kernels produced by DPAnsor over Ansor on four different architectures. Right: relative search time of DPAnsor over Ansor. Every bar is the geometric mean of relative times observed on 20 different models. Results above 1.0 represent improvements of DPAnsor over Ansor.

3.2. RQ2 – On the Search Time

We define a scheduling approach as faster than another if it requires less time to converge to an end-to-end model’s final, optimized version. The search time of Ansor encompasses the time spent applying optimizations to kernels, deriving new optimizations, and running the kernels themselves, with a limit set at 10,000 trials. On the other hand, the search time of DPAnsor involves all the steps of Ansor, constrained to a lower number of trials, along with the time it takes to run Droplet Search until convergence on the kernels that compose a model. Although we have restricted Droplet Search to a maximum of 100 trials, it typically converges well before reaching that limit. This section compares the search time between Ansor and DPAnsor.

Discussion

The lower part of Figure 10 compares search times on x86. For most models, DPAnsor is consistently faster than Ansor for any number of trials up to K=300𝐾300K=300italic_K = 300. At 1,000 trials, Ansor becomes consistently faster. Also, Ansor tends to outperform DPAnsor for very large models. This fact happens due to the longer time that Droplet Search takes to converge: the more complex the model, the more room Droplet Search will have to optimize it. Section 3.3 further discusses the impact of the model size on the behavior of DPAnsor. Figure 11 summarizes the search time comparison for the other architectures. The pattern is similar to the one observed on x86: DPAnsor is consistently faster when K300𝐾300K\leq 300italic_K ≤ 300, and slower (except on the Nvidia RTX3080) at K=1,000𝐾1000K=1,000italic_K = 1 , 000.

3.3. RQ3 – On the Impact of Model Size

The behavior of DPAnsor, when compared to Ansor, varies with the model’s size. We summarize this variation with two observations:

  1. (1)

    The larger the model, the less samples DPAnsor needs to observe to outperform Ansor, if Ansor uses a budget of 10,000 samples.

  2. (2)

    The larger the model, the lower the benefit, in terms of search time, of DPAnsor over Ansor.

The rest of this section provides data to support these two conclusions.

Discussion: The Search vs Quality Slope

The kernel optimization technique impacts two core numbers: the search time and the speed of the final model. We can use these two quantities—search time (S) and model performance (P)—to define an S×P𝑆𝑃S\times Pitalic_S × italic_P line characterizing the behavior of the optimization technique. Figure 12 shows these lines regarding four models and two architectures: AMD’s x86 and NVIDIA’s Ampere. We chose these two architectures because x86 is the scenario where DPAnsor performs better, and Ampere is the scenario where it performs worse.

Refer to caption
Figure 12. The search vs quality line that characterizes four models optimized in the AMD (Left) and in the NVIDIA (Right) setting: the larger the model, the lower the slope. Numbers on the X and Y axes are speedup/slowdown relative to Ansor with a budget of 10,000 trials. Numbers for the left chart are available in Figure 10, and numbers for the right chart were observed on the Nvidia RTX3080. The labels on the dots (1 and 1k) refer to the number of trials given to DPAnsor.

Figure 12 uses our two smallest and two largest models. The numbers on the axes show ratios between DPAnsor and Ansor; the latter using a budget of 10,000 trials. Each dot in Figure 12 refers to the number of trials that DPAnsor is allowed to observe before shifting to Droplet Search. The figure labels dots that refer to DPAnsor with one sample (its most restrictive scenario) and dots that refer to 1,000 samples (its least restrictive scenario).

The slopes of the lines in Figure 12 are always negative, meaning that as more samples are given to DPAnsor, the difference between its search time and Ansor’s reduces, but the quality of the kernels that it finds improves. However, the inclination changes with the size of the model. The larger the model, the lower the benefit of DPAnsor over Ansor in terms of search time; but the higher the relative benefit in terms of kernel speed. This result is due to Ansor’s fixed budget of 10,000 trials. In a small model, more trials are distributed to each layer; in a large model, each layer receives only a handful of trials.

Figure 13 provides further data that supports the previous observations. The figure shows how kernel quality and search time vary with the size of models. If we consider AlexNet, which has only 13 layers, each layer might receive, on average, 104/13superscript1041310^{4}/1310 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT / 13 trials. Inevitably, good kernels will emerge from this search. Thus, the benefit of the exploitation phase, which uses Droplet Search, tends to be smaller. On the other hand, if we consider DenseNet201, which has 113 layers, then Ansor allocates, on average, 104/113superscript10411310^{4}/11310 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT / 113 trials per layer—less than 100 samples per each kernel of the computational graph. This number is too low to effectively explore the space of possible kernel implementations. In this case, Droplet Search has more opportunity to improve the kernels that Ansor finds. Even using only one trial to find the origin of the optimization space is already enough to have DPAnsor outperforming Ansor in terms of the quality of the model.

Refer to caption
Figure 13. (Left) Variation in kernel quality (𝙰𝚗𝚜𝚘𝚛/𝙳𝙿𝙰𝚗𝚜𝚘𝚛𝙰𝚗𝚜𝚘𝚛𝙳𝙿𝙰𝚗𝚜𝚘𝚛\mathtt{Ansor}/\mathtt{DPAnsor}typewriter_Ansor / typewriter_DPAnsor)) with model size. (Right) Variation in search time (𝙰𝚗𝚜𝚘𝚛/𝙳𝙿𝙰𝚗𝚜𝚘𝚛𝙰𝚗𝚜𝚘𝚛𝙳𝙿𝙰𝚗𝚜𝚘𝚛\mathtt{Ansor}/\mathtt{DPAnsor}typewriter_Ansor / typewriter_DPAnsor)) with model size.

3.4. RQ4 – On the Search Technique

The core idea of this paper consists in using Droplet Search to exploit results produced by Ansor. Therefore, an immediate question from this proposal is: what if other search techniques are used instead of Droplet Search, as the basic exploitation technique? Indeed, AutoTVM, the framework hosting Droplet Search, provides four other search techniques that could fill the same role as Droplet Search. These techniques are described as follows:

Random::

a random search on the space of valid optimization parameters. In this case, each sample is randomly produced. Search is stateless, meaning that the results of a kernel bear no influence on the choice of the next kernel.

Grid::

a grid search on the space of valid parameters of the optimizations. In this case, each sample derives from a regular and exhaustive variation of each optimization parameter. Search keeps a minimum of state, namely, a counter per optimization parameter, that goes over the range of acceptable values.

GA::

a genetic algorithm that treats the sequence of optimization parameters as chromosomes. Search is stateful: the running times of already seen kernels provide information to guide the synthesis of the next kernels via operations such as mutation, crossover and pruning.

XGB::

search is based on gradient boosting, as implemented by the XGBoost Library (Chen and Guestrin, 2016). Similar to the genetic algorithm, the search is stateful, as the running time of kernels guides the construction of the search tree.

In the rest of this section, we analyze the behavior of DPAnsor, once its search technique (the “DP” part of Ansor) is replaced by each one of the other four algorithms available in AutoTVM.

Methodology

Every experiment in this Section allocates a budget of 300 samples to Ansor for each model that we autotune. Exploitation, via one of the different search techniques in AutoTVM, uses a budget of 100 trials to optimize each layer of AlexNet. Droplet Search converged before 100 trials in every layer, but all the other approaches run 100 samples, as they do not have a notion of convergence. Thus, each other approach samples 1,200 kernel configurations. Each search technique in AutoTVM exploits the best sequence of optimizations found by Ansor, which is the same for all of them. Following the methodology adopted in Section 3.3, we report results for the AMD x86 and the NVIDIA 3080 boards. These are the best and worst scenarios observed for DPAnsor in Sections 3.1 and 3.2.

Discussion: AMD Ryzen 7 (x86-64)

Figure 14 compares Droplet Search against other AutoTVM methods on the x86 architecture. As an exploitation technique, Droplet Search demonstrates superior performance in terms of execution time and kernel speedup. The random search technique is completely oblivious to the seed best provided by Ansor. The grid search is almost oblivious to it: although the grid search starts from a likely good kernel, it diverges quickly from that configuration. The other techniques can benefit from Ansor’s exploration phase. We feed both GA and XGB with a promising seed; however, these search techniques still need random seeds to diversify the initial population of kernels from where they depart. Because the other search techniques lack a convergence criterion, the relative performance of Droplet Search—in turning time—is even better: it is approximately 3x faster.

Refer to caption
Figure 14. Comparison of different exploitation techniques used in tandem with Ansor on the x86-64 setting. Numbers are ratio of “𝑆𝑒𝑎𝑟𝑐ℎ𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒/𝐷𝑟𝑜𝑝𝑙𝑒𝑡𝑆𝑒𝑎𝑟𝑐ℎ𝑆𝑒𝑎𝑟𝑐ℎ𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝐷𝑟𝑜𝑝𝑙𝑒𝑡𝑆𝑒𝑎𝑟𝑐ℎ\mathit{SearchTechnique}/\mathit{DropletSearch}italic_SearchTechnique / italic_DropletSearch”. Thus, numbers above 1.0 demonstrate the effectiveness of the technique proposed in this paper.

Discussion: CUDA RTX 3080 (Ampere)

In the GPU setting, Droplet Search is still a superior exploitation technique in terms of search time and kernel speed, as Figure 15 demonstrates. In contrast to the CPU setting, invalid kernels are common in the GPU scenario. AutoTVM does not provide any way to constrain the search technique to using parameters that yield correct kernels. We observe, for instance, that the Grid search often generates invalid kernels, as the number of threads on the X, Y, and Z dimensions exceeds the total number of threads in the stream multiprocessor.

Refer to caption
Figure 15. Comparison of different exploitation techniques used in tandem with Ansor on the Cuda setting. Numbers are ratio of “𝑆𝑒𝑎𝑟𝑐ℎ𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒/𝐷𝑟𝑜𝑝𝑙𝑒𝑡𝑆𝑒𝑎𝑟𝑐ℎ𝑆𝑒𝑎𝑟𝑐ℎ𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝐷𝑟𝑜𝑝𝑙𝑒𝑡𝑆𝑒𝑎𝑟𝑐ℎ\mathit{SearchTechnique}/\mathit{DropletSearch}italic_SearchTechnique / italic_DropletSearch”. Thus, numbers above 1.0 demonstrate the effectiveness of the technique proposed in this paper.

Discussion: Universality of these Results

Droplet Search is not universally superior to the other search techniques available in AutoTVM, when used as an exploitation method in Ansor. For instance, if we analyze the behavior of these different search approaches per layer of AlexNet, we observe situations where Droplet Search yields slower kernels than the other methods. Figure 16 demonstrates this point. These subpar results happen due to the stochastic nature of all these autotuning techniques. Droplet Search might stop on a suboptimal kernel because its convergence criterion is statistical in nature: if the running time of kernels is considered similar with a confidence level of 95%, then the coordinate descent procedure stops. However, regarding search speed, we have not identified one single layer of AlexNet where Droplet Search would be slower than the other search approaches. Additionally, if we repeat the same analysis per layer of AlexNet on the GPU setting, we observe that Droplet Search is superior to the other search approaches in every layer. We omit these results from this paper for the sake of space.

Refer to caption
Figure 16. Results of Figure 14, analyzed per layer of AlexNet.

3.5. RQ5: The Average Number of Droplet Search Samples

As hinted in Section 3.4, Droplet Search has a convergence criterion discussed in Section 3.3 of its description (Canesche et al., 2024). In this case, the search stops once there is no statistically significant difference between the current kernel and the kernels within its neighborhood. Thus, the more optimized the seed of the coordinate descent algorithm, the fewer iterations it is intuitively expected to take until convergence. This section investigates if this hypothesis is true.

Discussion

We count the number of iterations of Droplet Search on our different architectures, considering different budges for DPAnsor. Figure 17 shows this number for each model evaluated on the x86 setting. The figure reports a total number of trials and an average number of trials per layer. The general tendency is that the larger the budget of trials allocated to DPAnsor, the faster Droplet Search converges.

Refer to caption
Figure 17. Number of trials sampled by Droplet Search until reaching convergence, considering the AMD x86-64 architecture. The average number of trials per layer divides the total number of trials per the number of layers in each deep learning model.

Figure 18 summarizes, for each architecture, the numbers earlier seen in Figure 17. Figure 18 reports averages per model (considering the 20 available models), and averages per layer. In the latter case, we divide the total sum of trials observed for all the models by the sum of the number of layers present in every model. In every case, the same tendency is evident: more trials sampled during exploration imply fewer trials sampled during exploitation. This result is intuitive: as previously mentioned, Droplet Search tends to reach stability the closer to a local optimum it starts.

Refer to caption
Figure 18. Average number of trials per model (considering 20 models) for different architectures. The averages per layer are the quotient of the total number of trials for all the models divided by the total number of layers.

3.6. RQ6: Comparison with Well-Known Machine-Learning Frameworks

Thus far, the results we evaluated in this section are constrained to the Apache TVM software stack. To provide the reader with some perspective on these results, we shall compare Ansor and DPAnsor with two well-known deep-learning frameworks: PyTorch 2.0 (Paszke et al., 2019) and TensorFlow v2.15.0 (Singh et al., 2020). Recently, Ansel et al. (2024) have shown that PyTorch is able to outperform Ansor is many different workloads. However, in Ansel et al.’s setting, Ansor was used as a backend, without autotuning the target kernels. In this section we activate autotuning for Ansor and DPAnsor. To this end, we give Ansor a budget of 1,000 trials. In contrast, we use DPAnsor, either with a budget of 100 or 300 trials before activating Droplet Search.

Refer to caption
Figure 19. Comparison between the running time of kernels produced via TensorFlow, PyTorch, Ansor or DPAnsor running on two different GPUs. The black cells highlight fastest results; the gray cells indicate ties (with confidence level of 95%). TensorFlow and PyTorch do not perform tuning.

Discussion

Figure 19 summarizes the results of the comparison of different deep-learning tools. In our setting, PyTorch is consistently faster than TensorFlow on the two different graphics processing units available for these experiments. However, Ansor and DPAnsor outperform PyTorch on large workloads (matmul, conv2d, depthwise and pooling). Corroborating these results, the benefit of Ansor over PyTorch was also observed by Li et al. (2021). Notice that these results do not include two kernels, reduce and ReLU. In these two cases, neither Ansor nor DPAnsor have much room to perform optimizations. These kernels show minimal memory reuse: they visit each memory cell only once, and little benefit can be acquired from autotuning. As a consequence, the search time is very short when compared with the search time spent on the larger kernels.

4. Related Work

The design and implementation of systems to run machine learning models have been experiencing constant progress. Following Ding et al. (2023), we recognize two main approaches to run deep-learning models, which we call interpretation and compilation. In the former approach, frameworks such as PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016), and the ONNX runtime (** et al., 2020) implement machine learning models by binding operations to libraries such as cuDNN, cuBLAS, and CUTLASS. In the latter approach, tools such as TVM, Halide (Ragan-Kelley et al., 2013), TorchInductor (Ansel et al., 2024) or XLA generate code for the operations that constitute the model. This work concerns compilation; nevertheless, to provide some perspective to the reader about the relative effectiveness of these systems, Section 3.6 compares our approach with standard distributions of PyTorch and TensorFlow. Contrary to the findings in Section 3.6, Ansel et al. have recently shown that PyTorch, in compilation mode, consistently outperforms Ansor. However, in those experiments, Ansor was used solely as a code generator without performing any autotuning. In our setup, once autotuning is enabled, Ansor can outperform PyTorch in most kernels.

Design and Construction of Autotuners

The original presentations of AutoTVM (Chen et al., 2018) and Ansor (Zheng et al., 2020) outline the key techniques utilized in this paper. These tools explore optimizations typically described via the polyhedral model (Feautrier, 1996). Said optimizations, including fusion, tiling, and fission, are commonplace in tools such as Polly (Grosser et al., 2012), Graphite (Trifunovic et al., 2010) and PLuTo (Bondhugula et al., 2008). These tools do not solve the autotuning problem; however, they offer the basic infrastructure to do so, as Tavarageri et al. (2021) have demonstrated with their PolyDL framework.

Recent research focuses on pruning the search space to enhance the efficiency of autotuners. For instance, Tollenaere et al. (2023)’s search algorithm uses a cost model to avoid exploring regions of the kernel space unlikely to yield good schedules. However, building an accurate analytical model seems an illusive problem. In the words of Ritter and Hack (2024): “Modern processor designs use many techniques to improve overall performance that cause complex, irregular performance characteristics.” Additionally, there has been much work into adapting autotuners to deal with tensors whose shape is not statically known (Mururu et al., 2023; Pfaffe et al., 2019). For instance, DietCode (Zheng et al., 2022) and 𝚂𝚘𝙳2superscript𝚂𝚘𝙳2\mathtt{SoD}^{2}typewriter_SoD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (Niu et al., 2024) use a cost model similar to Tollenaere et al.’s, to represent the shape of a tensor as symbols that will only be known at runtime. We notice that autotuning is not restricted to scheduling the order of computations within a kernel. For instance, autotuning techniques can be used to choose the format to store tensors (sparse vs dense) (Ahrens et al., 2022; Dhandhania et al., 2021; Won et al., 2023), or the bitwidth used for quantization (Hubara et al., 2021; Kloberdanz and Le, 2023).

This paper does not propose a new kernel scheduling algorithm. Rather, it is bringing forward the observation that the combination of two well-established heuristics tend to bring much benefit to the generation of high-quality kernels. This benefit is measured not only in terms of the speed of the final code, but also in terms of the efficiency of the search technique. In this sense, an important consequence of our work is the fact that it makes Ansor more hardware-aware. By using coordinate descent as an exploitation approach, Ansor’s search remains circumscribed to a region that is likely to benefit more from the available hardware. Some research groups have, independently, shown that the design of hardware-centric (in contrast to input-centric) autotuners tend to accelerate the search process (Zhu et al., 2022; Ding et al., 2023; Li et al., 2024). This paper demonstrates this point without designing an algorithm that needs to be parameterized with hardware characteristics.

5. Conclusion

This paper has defended the thesis that state-of-the-art kernel scheduling methodologies can greatly benefit from a simple exploitation phase based on coordinate descent. To support this thesis, we have implemented a combined exploration/exploitation search methodology in Ansor. The new approach uses Ansor to explore different kernel optimization spaces and uses Droplet Search as a post-exploration phase. This methodology improves Ansor and Droplet Search in different ways:

  • It enhances Ansor’s capability to exploit “hardware boundaries”. The previous implementation of Ansor was not aware of the relationships between neighboring kernel schedules, as it lacked a concept of “distance” between the implementatios of kernels. The new exploration phase improves Ansor’s ability to better adjust optimization parameters to hardware constraints, such as cache sizes and vector widths.

  • It addresses Droplet Search’s two limitations: its reliance on a well-defined seed (the initial kernel that initiates coordinate descent), and its inability to explore different kernel spaces. Previously, the seed and the kernel space were determined manually, requiring a programmer to provide Droplet Search with an initial annotated sketch (Canesche et al., 2024; Li et al., 2024). The proposed methodology automates the seed generation process by utilizing Ansor.

As Section 3 demonstrates, the proposed extension improves Ansor in terms of kernel quality and search time. A container to reproduce those experiments is available at https://github.com/lac-dcc/bennu. That implementation has been submitted to the Ansor community in late 2023 (Canesche, 2023). A similar extension was later deployed onto TVM’s MetaSchedule (Canesche, 2024), achieving even better results than those seen in Section 3. Thus, we believe that this combination of a wide exploration phase and a fine-grained exploitation step implemented via coordinate descent is general enough to be incorporated into different kernel schedulers.

Acknowledgment

This project was sponsored by Cadence Design Systems, and the authors express their gratitude to Eric Stotzer and Vanderson Rosário for facilitating Cadence’s financial support. Additionally, the authors acknowledge the support of CNPq (grants 314645/2020-9 and 406377/2018-9), FAPEMIG (grant PPM-00333-18), and CAPES (Edital PrInt). Finally, the authors extend their appreciation to the TVM Community for the valuable suggestions from its members, which significantly contributed to the enhancement of this work.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system for large-scale machine learning. In OSDI (Savannah, GA, USA) (OSDI’16). USENIX Association, USA, 265–283.
  • Ahrens et al. (2022) Willow Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. 2022. Autoscheduling for sparse tensor algebra with an asymptotic cost model. In PLDI (San Diego, CA, USA). Association for Computing Machinery, New York, NY, USA, 269–285. https://doi.org/10.1145/3519939.3523442
  • AMD (2019) AMD. 2019. AMD RyzenTM 7 3700X. https://www.amd.com/en/product/8446. [Online; accessed 18-Jan-2024].
  • Ansel et al. (2024) Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Michael Suo, Phil Tillet, Eikan Wang, Xiaodong Wang, William Wen, Shunting Zhang, Xu Zhao, Keren Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng Wu, and Soumith Chintala. 2024. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In ASPLOS. ACM, New York, USA, 623–630.
  • Bondhugula et al. (2008) Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI (Tucson, AZ, USA) (PLDI ’08). Association for Computing Machinery, New York, NY, USA, 101–113. https://doi.org/10.1145/1375581.1375595
  • Canesche (2023) Michael Canesche. 2023. [RFC] Combine Ansor and AutoTVM to Improve Scheduling. https://discuss.tvm.apache.org/t/rfc-combine-ansor-and-autotvm-to-improve-scheduling/16337. Experimental results are listed on a report that is publicly available at https://homepages.dcc.ufmg.br/~michaelcanesche/paper/bennu_meta_version.pdf.
  • Canesche (2024) Michael Canesche. 2024. [RFC] Adding an Exploitation Phase to MetaSchedule to Improve Scheduling. https://discuss.tvm.apache.org/t/rfc-metaschedule-adding-an-exploitation-phase-to-metaschedule-to-improve-scheduling/17365. Experimental results are listed on a report that is publicly available at https://homepages.dcc.ufmg.br/~michaelcanesche/paper/bennu_meta_version.pdf.
  • Canesche et al. (2024) Michael Canesche, Vanderson M. Rosario, Edson Borin, and Fernando Magno Quintão Pereira. 2024. The Droplet Search Algorithm for Kernel Scheduling. , 25 pages. https://doi.org/10.1145/3650109 Just Accepted.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In KDD (San Francisco, California, USA). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
  • Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning. In OSDI. USENIX, Berkeley, USA, 578–594.
  • Dhandhania et al. (2021) Sunidhi Dhandhania, Akshay Deodhar, Konstantin Pogorelov, Swarnendu Biswas, and Johannes Langguth. 2021. Explaining the Performance of Supervised and Semi-Supervised Methods for Automated Sparse Matrix Format Selection. In ICPP (Lemont, IL, USA). Association for Computing Machinery, New York, NY, USA, Article 6, 10 pages. https://doi.org/10.1145/3458744.3474049
  • Ding et al. (2023) Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-Map** Programming Paradigm for Deep Learning Tensor Programs. In ASPLOS (Vancouver, BC, Canada). Association for Computing Machinery, New York, NY, USA, 370–384. https://doi.org/10.1145/3575693.3575702
  • Feautrier (1996) Paul Feautrier. 1996. Automatic Parallelization in the Polytope Model. In The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications. Springer-Verlag, Berlin, Heidelberg, 79–103.
  • Grosser et al. (2012) Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Process. Lett. 22, 4 (2012), 23 pages. https://doi.org/10.1142/S0129626412500107
  • Hubara et al. (2021) Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. 2021. Accurate Post Training Quantization With Small Calibration Sets. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, Maastricht, NL, 4466–4475.
  • ** et al. (2020) Tian **, Gheorghe-Teodor Bercea, Tung D. Le, Tong Chen, Gong Su, Haruki Imai, Yasushi Negishi, Anh Leu, Kevin O’Brien, Kiyokuni Kawachiya, and Alexandre E. Eichenberger. 2020. Compiling ONNX Neural Network Models Using MLIR. arXiv:2008.08272 [cs.PL]
  • Kloberdanz and Le (2023) Eliska Kloberdanz and Wei Le. 2023. MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search. arXiv:2309.17341 [cs.LG]
  • Li et al. (2024) Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan. 2024. Accelerated Auto-Tuning of GPU Kernels for Tensor Computations. In ICS (Kyoto, Japan). Association for Computing Machinery, New York, NY, USA, 549–561. https://doi.org/10.1145/3650200.3656626
  • Li et al. (2021) Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P. Sadayappan. 2021. Analytical characterization and design space exploration for optimization of CNNs. In ASPLOS (Virtual, USA). Association for Computing Machinery, New York, NY, USA, 928–942. https://doi.org/10.1145/3445814.3446759
  • Mururu et al. (2023) Girish Mururu, Sharjeel Khan, Bodhisatwa Chatterjee, Chao Chen, Chris Porter, Ada Gavrilovska, and Santosh Pande. 2023. Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop Characteristics. Proc. ACM Program. Lang. 7, OOPSLA2, Article 228 (oct 2023), 31 pages. https://doi.org/10.1145/3622803
  • Niu et al. (2024) Wei Niu, Gagan Agrawal, and Bin Ren. 2024. SoD2: Statically Optimizing Dynamic Deep Neural Network. arXiv:2403.00176 [cs.PL]
  • Nvidia (2020a) Nvidia. 2020a. Nvidia A100 Tensor Core GPU. https://www.nvidia.com/en-in/data-center/a100/. [Online; accessed 18-Jan-2024].
  • Nvidia (2020b) Nvidia. 2020b. Nvidia RTX3080 Tensor Core GPU. https://www.nvidia.com/en-in/geforce/graphics-cards/30-series/. [Online; accessed 18-Jan-2024].
  • Ookami (2022) Ookami. 2022. ACCESS Research Provider, A64FX Cluster. https://www.stonybrook.edu/ookami/. [Online; accessed 18-Jan-2024].
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-performance deep learning library. In NIPS. Curran Associates Inc., Red Hook, NY, USA, Article 721, 12 pages.
  • Pfaffe et al. (2019) Philip Pfaffe, Tobias Grosser, and Martin Tillmann. 2019. Efficient hierarchical online-autotuning: a case study on polyhedral accelerator map**. In ICS (Phoenix, Arizona) (ICS ’19). Association for Computing Machinery, New York, NY, USA, 354–366. https://doi.org/10.1145/3330345.3330377
  • Ragan-Kelley et al. (2013) Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. 48, 6 (jun 2013), 519–530. https://doi.org/10.1145/2499370.2462176
  • Ritter and Hack (2024) Fabian Ritter and Sebastian Hack. 2024. Explainable Port Map** Inference with Sparse Performance Counters for AMD’s Zen Architectures. arXiv:2403.16063 [cs.PF]
  • Sarofeen et al. (2022) Christian Sarofeen, Piotr Bialecki, Jie Jiang, Kevin Stephano, Masaki Kozuki, Neal Vaidya, and Stas Bekman. 2022. Introducing nvFuser, a deep learning compiler for PyTorch. https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/. [Online; accessed 15-Mar-2024].
  • Shao (2021) Junru Shao. 2021. [RFC] Meta Schedule (AutoTensorIR). https://discuss.tvm.apache.org/t/rfc-meta-schedule-autotensorir/10120.
  • Singh et al. (2020) Pramod Singh, Avinash Manure, Pramod Singh, and Avinash Manure. 2020. Introduction to tensorflow 2.0. , 24 pages.
  • Sorensen et al. (2019) Tyler Sorensen, Sreepathi Pai, and Alastair F. Donaldson. 2019. One Size Doesn’t Fit All: Quantifying Performance Portability of Graph Applications on GPUs. In IISWC. IEEE, New York, US, 155–166. https://doi.org/10.1109/IISWC47752.2019.9042139
  • Tavarageri et al. (2021) Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Bharat Kaul, Gagandeep Goyal, and Ramakrishna Upadrasta. 2021. PolyDL: Polyhedral Optimizations for Creation of High-performance DL Primitives. Transactions on Architecture and Code Optimization 18, 1, Article 11 (jan 2021), 27 pages. https://doi.org/10.1145/3433103
  • Tollenaere et al. (2023) Nicolas Tollenaere, Guillaume Iooss, Stéphane Pouget, Hugo Brunie, Christophe Guillon, Albert Cohen, P. Sadayappan, and Fabrice Rastello. 2023. Autotuning Convolutions Is Easier Than You Think. ACM Trans. Archit. Code Optim. 20, 2, Article 20 (mar 2023), 24 pages. https://doi.org/10.1145/3570641
  • Trifunovic et al. (2010) Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. GRAPHITE Two Years After: First Lessons Learned From Real-World Polyhedral Compilation. In GROW. HAL, Pisa, Italy, 17 pages. https://inria.hal.science/inria-00551516
  • Won et al. (2023) Jaeyeon Won, Charith Mendis, Joel S. Emer, and Saman Amarasinghe. 2023. WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program. In ASPLOS (Vancouver, BC, Canada). Association for Computing Machinery, New York, NY, USA, 920–934. https://doi.org/10.1145/3575693.3575742
  • Wright (2015) Stephen J. Wright. 2015. Coordinate Descent Algorithms. Math. Program. 151, 1 (jun 2015), 3–34. https://doi.org/10.1007/s10107-015-0892-3
  • Zangwill (1969) W. Zangwill. 1969. Nonlinear Programming, A Unified Approach (1st ed.). Prentice Hall, USA.
  • Zheng et al. (2022) Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko. 2022. DietCode: Automatic Optimization for Dynamic Tensor Programs. In MLSys (Vancouver, BC, Canada). Marculescu and Chi, New York, NY, USA, 848–863. https://doi.org/10.1145/3575693.3575702
  • Zheng et al. (2020) Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In OSDI. USENIX Association, USA, Article 49, 17 pages.
  • Zhu et al. (2022) Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. 2022. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning. In OSDI, Marcos K. Aguilera and Hakim Weatherspoon (Eds.). USENIX Association, New York, USA, 233–248. https://www.usenix.org/conference/osdi22/presentation/zhu
  • Zolotukhin (2021) Mikhail Zolotukhin. 2021. NNC walkthrough: how PyTorch ops get fused. https://dev-discuss.pytorch.org/t/nnc-walkthrough-how-pytorch-ops-get-fused/125. [Online; accessed 15-Mar-2024].