Scorch: A Library for Sparse Deep Learning

Bobby Yan1, Alexander J. Root1, Trevor Gale1, David Broman2,1, Fredrik Kjolstad1
1Stanford University   2KTH Royal Institute of Technology
1{bobbyyan,ajroot,tgale,broman,kjolstad}@cs.stanford.edu, 2[email protected]
Abstract

The rapid growth in the size of deep learning models strains the capabilities of traditional dense computation paradigms. Leveraging sparse computation has become increasingly popular for training and deploying large-scale models, but existing deep learning frameworks lack extensive support for sparse operations. To bridge this gap, we introduce Scorch, a library that seamlessly integrates efficient sparse tensor computation into the PyTorch ecosystem, with an initial focus on inference workloads on CPUs. Scorch provides a flexible and intuitive interface for sparse tensors, supporting diverse sparse data structures. Scorch introduces a compiler stack that automates key optimizations, including automatic loop ordering, tiling, and format inference. Combined with a runtime that adapts its execution to both dense and sparse data, Scorch delivers substantial speedups over hand-written PyTorch Sparse (torch.sparse) operations without sacrificing usability. More importantly, Scorch enables efficient computation of complex sparse operations that lack hand-optimized PyTorch implementations. This flexibility is crucial for exploring novel sparse architectures. We demonstrate Scorch’s ease of use and performance gains on diverse deep learning models across multiple domains. With only minimal code changes, Scorch achieves 1.05–5.78×\times× speedups over PyTorch Sparse on end-to-end tasks. Scorch’s seamless integration and performance gains make it a valuable addition to the PyTorch ecosystem. We believe Scorch will enable wider exploration of sparsity as a tool for scaling deep learning and inform the development of other sparse libraries.

1 Introduction

The rapid increase in model size and sophistication has driven the success of deep learning across domains like computer vision, natural language processing, and speech recognition. This growth strains the capabilities of traditional dense computation paradigms, as model parameters often exceed available processing resources and memory. Sparse computation techniques offer a promising solution by focusing compute and storage only on the most relevant elements [25].

Sparsity arises naturally in many contexts within deep learning and can generally be categorized into three groups: data sparsity, weight sparsity, and activation sparsity. It arises in data, as many emerging application domains involve inherently sparse data representations. For instance, real-world graphs often exhibit power-law distributions in node connectivity, yielding sparse adjacency matrices [59]. Similarly, the high-dimensional feature spaces used in recommender systems [46] and natural language processing [12] are highly sparse.

Sparsity also arises within models, where weight and activation sparsity can provide significant efficiency benefits. Techniques like pruning [24, 8, 41] have shown that weights can be sparsified with little impact on accuracy. Sparse architectures, such as mixture-of-experts (MoEs) [53, 36, 16] and sparse transformers [12, 6, 30], sparsify activations using sparsely-gated conditional computation, improving model capability without proportional increases in the memory or computational cost.

However, it is challenging for researchers and end users to take full advantage of sparsity in deep learning due to limitations in existing deep learning frameworks. Mainstream frameworks like PyTorch [48], JAX [9], and TensorFlow [1] provide sparse support for only a handful of operations on a few selected sparse data structures (tensor formats). As a result, leveraging sparsity in these frameworks often demands significant engineering effort to develop custom implementations for new operations. These frameworks lack the flexibility for researchers to explore the diverse range of potentially beneficial sparse data structures and computations in modern deep learning architectures.

Indeed, nearly every domain where sparsity arises has one-off libraries with hand-written and hand-optimized kernels, such as PyTorch Geometric [17] for graph learning, TensorLy [33] for sparse tensor factorization, and MegaBlocks [20] for MoEs. Although these libraries effectively target their specific applications, they do not generalize to other domains, resulting in a fragmented software ecosystem for sparse deep learning. The different abstractions create barriers to adoption, as they require users to learn new APIs and paradigms for each library. This also results in duplicated developer effort, as each domain-specific library implements its own sparse kernels and optimizations. Therefore, there is an opportunity for a unified framework that efficiently handles sparse operations and representations across various domains in deep learning.

To tackle these challenges, we introduce Scorch111Scorch is available under the MIT license at https://anonymous.4open.science/r/scorch., the first library to integrate comprehensive and efficient sparse tensor computation capabilities into PyTorch. Scorch adds sparsity support to existing PyTorch operations, allowing researchers and practitioners to introduce sparsity into their models by simply declaring one or more tensors to be sparse. For instance, weight matrices can be made sparse, sparse gating can be used in architectures like MoEs for conditional computation, and sparse adjacency matrices can be efficiently processed in graph neural networks (GNNs). By preserving PyTorch’s interface, Scorch enables the natural and seamless expression of sparse models without requiring a separate ecosystem of sparse operations. Our initial work focuses on accelerating compute operations on CPUs, which enables sparse inference workloads and lays a foundation for general sparse computing in PyTorch. We leave GPU acceleration and auto-differentiation as future work.

The key insight behind Scorch is that it is possible to provide comprehensive sparsity support with minimal API changes to existing tensor frameworks. Sparsity is just a property of the values of a tensor and should not require a separate programming model. opt that everything doable with dense tensors should also be possible with sparse tensors. Scorch is designed to enable this functionality with a single line of code: import scorch as torch. Scorch enables general sparsity support by integrating state-of-the-art sparse code generation infrastructure [31, 13, 52] into a modern ML framework. However, prior work on sparse compilation lacks several technical components necessary to achieve such minimal API changes without sacrificing performance. We identify several key missing components and make the following technical contributions:

  • A fast auto-scheduler for optimizing loop ordering, fusion, and the insertion of temporary data structures in the generated sparse kernels.

  • A tiling algorithm that optimizes sparse compute kernels to improve cache locality.

  • A format inference algorithm that determines the output format for each operation based on the nature of the operation and the input formats.

We evaluate Scorch on models across a range of domains, including graph neural networks, sparse autoencoders, and sparse transformers. We demonstrate speedups of 1.05–5.78×\times× over PyTorch Sparse. This paper illustrates how to design a high-performance end-to-end sparse ML library. By enabling the natural expression and efficient execution of sparse models, Scorch can facilitate greater research exploration and adoption of sparsity.

2 Related Work

Deep learning frameworks. Several existing frameworks aim to support deep learning models for training and inference, but native sparse support in mainstream frameworks like PyTorch [48], JAX [9], and TensorFlow [1] is limited in generality, efficiency, and usability. Their sparsity support focuses on select formats like COO (coordinate list) and CSR (compressed sparse row), and a limited number of sparse operations like SpMM and SpMV. While libraries like MKL-Sparse and cuSPARSE provide low-level sparse primitives, they do not support high-order tensor operations, most formats across all operations, or fusion. Consequently, libraries that build support for general sparse tensor operation on top of them suffer from excessive data movement, which leads to poor performance.

Domain-specific libraries like PyTorch Geometric (PyG) [17] and Deep Graph Library (DGL) [57] have been developed to target sparse computation in specific applications. While they provide powerful capabilities within their target domains, they lack the generality to support other applications and sparse data structures. Scorch introduces a unified sparse tensor abstraction and compiler infrastructure to enable efficient and composable sparse deep learning across multiple domains.

Tensor algebra and machine learning compilation. Many compiler frameworks have been proposed for optimizing tensor operations and neural networks, such as Halide [49], TVM [11], XLA [22], Glow [50], and Tensor Comprehensions [55]. These compilers introduce new intermediate representations (IRs) and tensor algebra compilation techniques. However, they primarily focus on dense tensor computations and lack first-class support for sparse code generation.

Recent work on sparse tensor algebra compilers, such as TACO [31], Tiramisu [5], MLIR Sparse [7], and SparseTIR [60], has made significant progress in code generation for sparse tensor algebra expressions. However, these compilers are not integrated with deep learning frameworks, making it challenging to use them for sparse deep learning workloads. They also lack efficient and general automatic optimization algorithms necessary for high-performance sparse computation. Scorch incorporates ideas from the sparse code generation literature into the PyTorch ecosystem and adds automatic optimizations machinery to provide a seamless, end-to-end approach to productive and performant sparse deep learning.

3 Design

Scorch’s design is guided by three principles that aim to balance performance, usability, and generality in the context of sparse deep learning:

Seamless integration with PyTorch. To facilitate adoption and usability, a sparse deep learning library must integrate seamlessly with popular deep learning frameworks. Scorch is designed as a natural extension of PyTorch, with a familiar and intuitive interface for defining and manipulating sparse tensors. Users can leverage their existing PyTorch knowledge and codebases and easily switch between dense and sparse computation. This enables smooth adoption of Scorch and allows researchers to focus on develo** new models rather than learning new frameworks.

Unified abstractions for dense and sparse computation. Deep learning computations often involve a mix of dense and sparse tensors, but existing frameworks provide limited support for sparse tensors in their operators. For instance, PyTorch’s einsum and reshape operations do not support sparse tensors. To address this limitation, a sparse library should offer comprehensive operators that can freely operate on any mix of dense and sparse tensors, allowing users to express complex computations in natural tensor notation. This includes element-wise and non-linear operations, linear algebra, einsum operations, and tensor manipulations like reshape and transpose. Scorch accomplishes this by extending popular PyTorch operations to support sparse tensors. By providing unified operator support, Scorch lets users leverage the full power of PyTorch’s API while taking advantage of efficient sparse computation under the hood.

Robust automatic performance optimization. Achieving high performance on sparse workloads requires careful co-optimization of storage, loop orderings, use of intermediates, and tiling schemes. These choices collectively determine both the asymptotic complexity of a sparse kernel and the constant factors contributing to the runtime. The compiler stack of Scorch is capable of making reasonable decisions across this scheduling space using a set of carefully designed heuristics. Scorch’s optimization algorithms are designed around the following hierarchy of goals: First, format inference and loop ordering decisions should match those implemented in the state-of-the-art hand-written kernels. Second, loop ordering and storage decisions should ensure good worst-case asymptotic complexity to avoid unpredictable performance cliffs. Third, tiling should be applied to dense loops within sparse kernels to optimize cache usage. This heuristic hierarchy enables Scorch to match the performance of existing hand-written kernels and achieve predictable performance on unknown kernels, without burdening the user with manual performance tuning.

3.1 Programming Model

Scorch introduces a unified programming model for sparse and dense deep learning. Users opt into sparse execution via an import statement that make standard PyTorch operations compatible with sparse tensors. Figure 2 illustrates the simplicity and flexibility of Scorch’s programming model and how it improves productivity without sacrificing performance. The first example shows a Sampled Dense-Dense Matrix Multiplication (SDDMM) operation, commonly used in GNNs and recommender systems. It takes a sparse matrix A𝐴Aitalic_A and two dense matrices B𝐵Bitalic_B and C𝐶Citalic_C as inputs and computes the sparse matrix D𝐷Ditalic_D as the element-wise product of A𝐴Aitalic_A and the matrix multiplication of B𝐵Bitalic_B and C𝐶Citalic_C. PyTorch does not support fused SDDMM for sparse COO inputs due to the lack of a hand-written kernel. Scorch enables users to express this computation concisely using einsum. By leveraging its compiler, Scorch automatically generates an efficient fused kernel for SDDMM. The generated kernel co-iterates over the sparse and dense tensors, avoiding the asymptotically more expensive dense matrix multiplication in the worst case. As seen in this example, fusion is crucial for performance, as it can reduce the overall computational complexity. By supporting mixed sparse and dense inputs operators like einsum, Scorch unlocks a wide range of sparse operations that would be difficult or impossible to express using PyTorch’s fixed set of hand-written sparse kernels.

Refer to caption
Figure 1: Scorch SDDMM and sparse attention examples
Refer to caption
Figure 2: Scorch architecture

3.2 System Overview

Figure 2 shows an overview of the architecture of Scorch. Its compiler lowers high-level sparse tensor operations into optimized kernels for efficient execution. Compilation begins by transforming Python code to a domain-specific intermediate representation (IR) that exposes optimization opportunities. The compiler then applies the following optimization passes to the IR:

  • Automatic scheduling: Applies loop nest optimizations, tiling, and parallelization techniques specifically tuned for sparse tensor computations.

  • Format inference: Automatically determines the most suitable output storage format for each operation based on its inputs and the operation.

  • Code generation: Translates the optimized IR into efficient low-level kernels specialized for the specific storage formats of the input and output tensors.

A lightweight dispatcher orchestrates the execution of kernels. For common operations like SpMM, the dispatcher retrieves pre-compiled kernels from cache to save compilation time. For kernels that are not in the cache, the dispatcher invokes the above compiler at runtime to generate C++ code, which is then compiled and dynamically linked into the library.

4 Optimizations

Scorch introduces several optimizations to ensure the generated code has good and predictable performance. Efficient execution of sparse tensor operations requires careful consideration of the loop order, tiling, and the use of temporary tensor data structures. Suboptimal choices can lead to poor cache utilization, inefficient memory access patterns, and unnecessary data movement. In fact, the wrong loop order in a sparse loop nest can even change its asymptotic complexity [3]. To address this challenge, Scorch introduces new algorithms for auto-scheduling sparse kernels that determines the loop order, workspace insertion, and tiling strategy.

4.1 Automatic Loop Ordering and Workspace Insertion

Unlike dense computations, where different loop orders have the same asymptotic complexity, the choice of loop order can impact the worst-case asymptotic complexity and thus the observed performance of sparse computations [3]. Our loop ordering algorithm is designed to provide predictable performance to users by avoiding loop orders with poor asymptotic complexity. However, optimizing solely for the worst-case complexity may lead to bad average-case performance. We provide a fast heuristic auto-scheduler that achieves good average-case performance while avoiding the most asymptotically inefficient loop orderings.

Consider the sparse matrix-matrix multiplication (SpGEMM), Cik=jAijBjksubscript𝐶𝑖𝑘subscript𝑗subscript𝐴𝑖𝑗subscript𝐵𝑗𝑘C_{ik}=\sum_{j}A_{ij}B_{jk}italic_C start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT, where the inputs A𝐴Aitalic_A, B𝐵Bitalic_B and the output C𝐶Citalic_C are stored in CSR format. Figure 4 illustrates three possible loop orderings for SpGEMM: outer product, Gustavson, and inner product. The loops iterate over the coordinates of different dimensions of the matrices, and the arrows denote the data structures for each matrix. For example, the A𝐴Aitalic_A matrix in Gustavson stores i𝑖iitalic_i coordinates (denoted Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) followed by j𝑗jitalic_j coordinates (denoted Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)222In Figure 4, Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of coordinates in the dimension of tensor T𝑇Titalic_T indexed by index variable i𝑖iitalic_i.. The inner product algorithm has worse asymptotic complexity than the other two algorithms because the data structures of neither A𝐴Aitalic_A nor B𝐵Bitalic_B connects the coordinates in the two outer loops. They thus have to be iterated densely, meaning every component of the result has to be visited, even those that result in zeros. The other algorithms do not have this problem and thus perform much better. Moreover, algorithms that perform intersections between sparse matrices in higher loops (early filtering) perform better than those that perform intersections in inner loops. Scorch’s loop ordering algorithm avoids bad asymptotic behavior, even at the expense of transposes to allow matrices to be iterated in different orders.

Refer to caption
Figure 3: SpGEMM loop orders
Refer to caption
Figure 4: Tiling SpMM (Cik=jAijBjksubscript𝐶𝑖𝑘subscript𝑗subscript𝐴𝑖𝑗subscript𝐵𝑗𝑘C_{ik}=\sum_{j}A_{ij}B_{jk}italic_C start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT)

Our algorithm (Algorithm 1) first collects and sorts index variables in the tensor expression by descending sparsity level. The sparsity level of an index variable is approximated by the presence of intersections (filters) between levels. In this case, jAjBj𝑗subscript𝐴𝑗subscript𝐵𝑗j\in A_{j}\cap B_{j}italic_j ∈ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∩ italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the only sparse filter as it involves an intersection between the sparse level Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the dense level Bjsubscript𝐵𝑗B_{j}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The tie between i𝑖iitalic_i and k𝑘kitalic_k are broken by choosing the order that requires fewer transposes. This gives an initial loop order =[j,i,k]𝑗𝑖𝑘\mathcal{L}=[j,i,k]caligraphic_L = [ italic_j , italic_i , italic_k ].

The algorithm then takes a greedy approach and iteratively evaluates the cost of pushing a sparse filter down the loop nest. In the SpGEMM example, the algorithm considers the net cost of moving j𝑗jitalic_j to pos=1pos1\textit{pos}=1pos = 1: the filter no longer applies to the rows of A𝐴Aitalic_A in the i𝑖iitalic_i loop, but the 2D workspace can be replaced by a 1D workspace since we no longer scatter into the i𝑖iitalic_i dimension, and A𝐴Aitalic_A no longer needs to be transposed to iterate in j,i𝑗𝑖j,iitalic_j , italic_i order. As the net cost is negative, the loop order is updated to [i,j,k]𝑖𝑗𝑘[i,j,k][ italic_i , italic_j , italic_k ]. In the next iteration, we consider the net cost of moving j𝑗jitalic_j to pos=2pos2\textit{pos}=2pos = 2, i.e., the inner product order [i,k,j]𝑖𝑘𝑗[i,k,j][ italic_i , italic_k , italic_j ]: while the 1D workspace can be eliminated since we no longer scatter into the result, the filter no longer applies to any of the tensors, and B𝐵Bitalic_B needs to be transposed to be iterated in the k,j𝑘𝑗k,jitalic_k , italic_j order. Moreover, as there are no input data structures that allow us to iterative sparsely from i𝑖iitalic_i to k𝑘kitalic_k, it incurs a large penalty in the cost function. This makes net cost positive, so the loop order remains unchanged.

Algorithm 1 Loop Ordering and Workspace Insertion
1:Input: Tensor expression E𝐸{E}italic_E, input tensors 𝒯𝒯\mathcal{T}caligraphic_T, output tensor Toutsubscript𝑇𝑜𝑢𝑡T_{out}italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.
2:Output: Loop order \mathcal{L}caligraphic_L for efficient computation.
3:GetIndexVariables(E)GetIndexVariables𝐸\mathcal{I}\leftarrow\textsc{GetIndexVariables}(E)caligraphic_I ← GetIndexVariables ( italic_E ) \triangleright Set of all index variables in the expression
4:𝒮SortBySparsity(,E)𝒮SortBySparsity𝐸\mathcal{S}\leftarrow\textsc{SortBySparsity}(\mathcal{I},E)caligraphic_S ← SortBySparsity ( caligraphic_I , italic_E ) \triangleright Sort index variables by descending sparsity
5:InitLoopOrder(𝒮)InitLoopOrder𝒮\mathcal{L}\leftarrow\textsc{InitLoopOrder}(\mathcal{S})caligraphic_L ← InitLoopOrder ( caligraphic_S )
6:for all i𝒮𝑖𝒮i\in\mathcal{S}italic_i ∈ caligraphic_S do \triangleright Loop ordering
7:     currentPosGetPosition(,i)currentPosGetPosition𝑖\textit{currentPos}\leftarrow\textsc{GetPosition}(\mathcal{L},i)currentPos ← GetPosition ( caligraphic_L , italic_i ) \triangleright Current position of index i𝑖iitalic_i in \mathcal{L}caligraphic_L
8:     for pos currentPosabsentcurrentPos\leftarrow\textit{currentPos}← currentPos to ||+11|\mathcal{L}|+1| caligraphic_L | + 1 do
9:         if Cost(,i,pos)<0\mathcal{L},i,\textit{pos})<0caligraphic_L , italic_i , pos ) < 0 then \triangleright Net cost of pushing i𝑖iitalic_i to position pos
10:              MoveToPosition(,i,pos)MoveToPosition𝑖pos\mathcal{L}\leftarrow\textsc{MoveToPosition}(\mathcal{L},i,\textit{pos})caligraphic_L ← MoveToPosition ( caligraphic_L , italic_i , pos )               
11:GInitGraph()𝐺InitGraph()G\leftarrow\textsc{InitGraph(}\mathcal{I}\textsc{)}italic_G ← InitGraph( caligraphic_I ) \triangleright Initialize a directed graph with nodes \mathcal{I}caligraphic_I
12:for all T𝒯{Tout}𝑇𝒯subscript𝑇𝑜𝑢𝑡T\in\mathcal{T}\cup\{T_{out}\}italic_T ∈ caligraphic_T ∪ { italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT } do
13:     πGetModeOrdering(T)𝜋GetModeOrdering(𝑇)\pi\leftarrow\textsc{GetModeOrdering(}T\textsc{)}italic_π ← GetModeOrdering( italic_T ) \triangleright Get the mode ordering of tensor T𝑇Titalic_T
14:     for k1𝑘1k\leftarrow 1italic_k ← 1 to |(T)|1𝑇1|\mathcal{I}(T)|-1| caligraphic_I ( italic_T ) | - 1 do
15:         AddEdgeToGraph(G,π[k],π[k+1])AddEdgeToGraph(𝐺𝜋delimited-[]𝑘𝜋delimited-[]𝑘1)\textsc{AddEdgeToGraph(}G,\pi[k],\pi[k+1]\textsc{)}AddEdgeToGraph( italic_G , italic_π [ italic_k ] , italic_π [ italic_k + 1 ] ) \triangleright Add edge to capture ordering constraint      
16:while ContainsCycles(G)ContainsCycles(𝐺)\textsc{ContainsCycles(}G\textsc{)}ContainsCycles( italic_G ) do
17:     RemoveCheapestEdge(G)RemoveCheapestEdge𝐺\textsc{RemoveCheapestEdge}(G)RemoveCheapestEdge ( italic_G ) \triangleright Remove the cheapest edge from the graph
18:     UpdateLoopOrder(,e)UpdateLoopOrder(𝑒)\mathcal{L}\leftarrow\textsc{UpdateLoopOrder(}\mathcal{L},e\textsc{)}caligraphic_L ← UpdateLoopOrder( caligraphic_L , italic_e ) \triangleright Update the loop order by transposing the tensor
19:if HasSparseLevels(Tout)HasSparseLevels(subscript𝑇𝑜𝑢𝑡)\textsc{HasSparseLevels(}T_{out}\textsc{)}HasSparseLevels( italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) then \triangleright Workspace insertion
20:     redGetReductionVariables(𝒯,Tout)subscript𝑟𝑒𝑑GetReductionVariables(𝒯subscript𝑇𝑜𝑢𝑡)\mathcal{I}_{red}\leftarrow\textsc{GetReductionVariables(}\mathcal{T},T_{out}% \textsc{)}caligraphic_I start_POSTSUBSCRIPT italic_r italic_e italic_d end_POSTSUBSCRIPT ← GetReductionVariables( caligraphic_T , italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) \triangleright Get reduction variables
21:     if ShouldInsertWorkspace(red,(Tout),)ShouldInsertWorkspace(subscript𝑟𝑒𝑑subscript𝑇𝑜𝑢𝑡)\textsc{ShouldInsertWorkspace(}\mathcal{I}_{red},\mathcal{I}(T_{out}),\mathcal% {L}\textsc{)}ShouldInsertWorkspace( caligraphic_I start_POSTSUBSCRIPT italic_r italic_e italic_d end_POSTSUBSCRIPT , caligraphic_I ( italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) , caligraphic_L ) then
22:         WInsertWorkspace(i)𝑊InsertWorkspace(𝑖)W\leftarrow\textsc{InsertWorkspace(}i\textsc{)}italic_W ← InsertWorkspace( italic_i ) \triangleright Insert a workspace over loop i𝑖iitalic_i
23:         (p,c)SplitLoopOrder(,i)subscript𝑝subscript𝑐SplitLoopOrder(𝑖)(\mathcal{L}_{p},\mathcal{L}_{c})\leftarrow\textsc{SplitLoopOrder(}\mathcal{L}% ,i\textsc{)}( caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ← SplitLoopOrder( caligraphic_L , italic_i ) \triangleright Split the loop order
24:         pUpdateProducerLoop(p,W,E)subscript𝑝UpdateProducerLoop(subscript𝑝𝑊𝐸)\mathcal{L}_{p}\leftarrow\textsc{UpdateProducerLoop(}\mathcal{L}_{p},W,E% \textsc{)}caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← UpdateProducerLoop( caligraphic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_W , italic_E ) \triangleright Update the producer loop
25:         cUpdateConsumerLoop(c,Tout,W)subscript𝑐UpdateConsumerLoop(subscript𝑐subscript𝑇𝑜𝑢𝑡𝑊)\mathcal{L}_{c}\leftarrow\textsc{UpdateConsumerLoop(}\mathcal{L}_{c},T_{out},W% \textsc{)}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← UpdateConsumerLoop( caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT , italic_W ) \triangleright Update the consumer loop      
26:return \mathcal{L}caligraphic_L

4.2 Automatic Tiling

Tiling is a crucial optimization that improves cache utilization and reduces memory traffic by partitioning the iteration space into smaller blocks (tiles) that fit in the cache. Selecting which loops to tile is challenging, as suboptimal choices may hurt performance. Scorch introduces a novel sparse tiling algorithm that analyzes the tensor expression to determine which loops to tile based on several key observations:

First, opportunities for tiling can be inferred from the index variables present in each tensor access. In the SpMM example Cik=jAijBjksubscript𝐶𝑖𝑘subscript𝑗subscript𝐴𝑖𝑗subscript𝐵𝑗𝑘C_{ik}=\sum_{j}A_{ij}B_{jk}italic_C start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT, there are three tensor accesses: C[i,k]𝐶𝑖𝑘C[i,k]italic_C [ italic_i , italic_k ], A[i,j]𝐴𝑖𝑗A[i,j]italic_A [ italic_i , italic_j ], and B[j,k]𝐵𝑗𝑘B[j,k]italic_B [ italic_j , italic_k ]. If a tensor access is missing an index variable present in the full expression, it means that tensor access is reused across the loop corresponding to the missing index. For instance, C[i,k]𝐶𝑖𝑘C[i,k]italic_C [ italic_i , italic_k ] is missing j𝑗jitalic_j, indicating that it is reused across the j𝑗jitalic_j loop. Thus, tiling the i𝑖iitalic_i and k𝑘kitalic_k loops would improve cache locality for C𝐶Citalic_C if everything were dense. Second, tiling sparse dimensions is often counterproductive, as it requires performing expensive searches in the sparse data structures. Although it may be beneficial in highly tuned systems, Scorch avoids it in order to provide robust and predictable performance. In SpMM, for a CSR input A𝐴Aitalic_A, the j𝑗jitalic_j loop should not be tiled since Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a sparse dimension. Third, not all dense dimensions should be tiled, as illustrated in Figure 4. While tiling i𝑖iitalic_i in addition to k𝑘kitalic_k would improve reuse of the rows of A𝐴Aitalic_A, it would also require a larger working set for B𝐵Bitalic_B since the tiles of k𝑘kitalic_k would be exhausted more frequently. In the dense case, the j𝑗jitalic_j loop can be tiled to offset this, but since sparse dimensions are not tiled, tiling i𝑖iitalic_i should be avoided. In general, tiling loops whose index variables are parent index variables of a sparse dimension should be avoided.

These insights form the basis of Scorch’s tiling algorithm, which can be summarized as follows: First, for each tensor access, collect its index variables if the set of index variables in the tensor access is a strict subset of the set of all index variables in the expression, as they indicate data reuse. Then, remove any index variables that correspond to a sparse dimension of any tensor. Finally, loops whose index variables are parents of a sparse index variable should not be tiled. The full algorithm is provided in Section C.1.

4.3 Format Inference

Tensor format inference is another key optimization in Scorch that automatically determines the storage format of the output tensor based on the storage formats of the input tensors and the semantics of the tensor expression. By inferring storage formats amenable to good performance, Scorch can generate efficient sparse kernels without requiring the user to manually specify the output format.

Scorch’s tensor format inference algorithm operates on a per-dimension basis, considering the interaction between the storage formats of each dimension in the input tensors. The intuition behind the algorithm is that multiplying a sparse level with any other level (sparse or dense) results in a sparse level in the output, while adding a dense level to any other level yields a dense level in the output. Appendix D contains details of the algorithm and its theoretical properties.

5 Evaluation

In this section, we demonstrate that Scorch adds general sparse support to PyTorch and achieves performance comparable to or better than hand-tuned sparse kernels on end-to-end models across several domains: graph neural networks, sparse autoencoders, and sparse transformers. We perform the benchmarks on an Apple M1 Ultra CPU (3.2 GHz, 20 cores) with 64 GB of memory.

5.1 Graph Neural Networks

We evaluate Scorch on the task of node classification using Graph Convolutional Networks (GCNs) [29]. GCNs are a popular class of graph neural networks that learn node representations by iteratively aggregating features from neighboring nodes. The core operation in a GCN layer is:

𝐇(l+1)=σ(𝐀^𝐇(l)𝐖(l))superscript𝐇𝑙1𝜎^𝐀superscript𝐇𝑙superscript𝐖𝑙\mathbf{H}^{(l+1)}=\sigma(\hat{\mathbf{A}}\mathbf{H}^{(l)}\mathbf{W}^{(l)})bold_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( over^ start_ARG bold_A end_ARG bold_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) (1)

where 𝐇(l)superscript𝐇𝑙\mathbf{H}^{(l)}bold_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are the node features at layer l𝑙litalic_l, 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG is the normalized sparse adjacency matrix, 𝐖(l)superscript𝐖𝑙\mathbf{W}^{(l)}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are the learnable weights, and σ𝜎\sigmaitalic_σ is a non-linear activation function.

We compare the inference performance of Scorch against implementations in PyTorch, PyG, and DGL on four benchmark datasets: Cora [43], Citeseer [21], PubMed [51], and OGBN-arXiv [26].

Figure 5 shows the inference times of the GCN model and the normalized speedups relative to PyTorch on the four datasets. Scorch achieves an average speedup of 2.08×\times×. The speedup achieved by Scorch can be attributed to several key optimizations in its SpMM kernel. Scorch applies tiling to improve cache locality and reduce memory traffic. It also uses pragma unroll directives to encourage loop unrolling, which can help reduce loop overhead and improve instruction-level parallelism.

Notably, the performance of PyG and DGL relative to PyTorch varies with the problem size. For smaller datasets like Citeseer and Cora, PyG and DGL achieves significant speedups over PyTorch. However, as the problem size grows, their performance advantage diminishes, for larger datasets like PubMed and OGBN-arXiv. This can be explained by the different algorithms used by these libraries for neighborhood aggregation. Instead of using SpMM directly, the gather-scatter algorithm indexes the source node features with the edge indices and scatters results into the destination nodes. While the gather-scatter algorithm is faster than SpMM for smaller graphs, it does not scale well to larger problems, as evident in the decreasing normalized speedups of PyG and DGL relative to PyTorch.

Refer to caption
Figure 5: GCN Performance

5.2 Sparse Autoencoders

We evaluate the performance of sparse autoencoders implemented using Scorch and compare it against the PyTorch implementation. Autoencoders are unsupervised learning models that learn efficient data representations by encoding the input into a lower-dimensional latent space and then reconstructing the original input from the encoding. Sparse autoencoders introduce sparsity in the latent space, which can improve the interpretability and generalization of the learned representations [47]. The sparsity is typically achieved through regularization techniques like L1 penalty or KL-divergence loss. We train and test the autoencoders on four benchmark datasets: MNIST [35], CIFAR-10 [34], CIFAR-100 [34], and CelebA [39].

The sparse autoencoder architecture consists of an encoder with a sparse linear layer followed by ReLU activation, and a decoder with a dense linear layer followed by sigmoid activation. The models are trained using the mean squared error (MSE) loss and the Adam optimizer with a learning rate of 0.01. We train the models for 10 epochs with a batch size of 64.

Figure 6 shows the speedup achieved by Scorch over PyTorch for sparse autoencoder inference on the different datasets. Scorch achieves speedups of 1.42×\times× to 5.78×\times× over PyTorch, with the speedup being more pronounced on larger datasets like CIFAR-100 and CelebA. This is because the sparse linear layer in the encoder dominates the computation time, and Scorch’s optimized sparse kernels provide acceleration over PyTorch’s sparse linear algebra routines.

Refer to caption
Figure 6: Sparse Autoencoder Performance

5.3 Sparse Transformers

Refer to caption
Figure 7: Sparse Transformer Performance

We evaluate Scorch on the task of text classification using sparse transformer models, specifically the BigBird architecture [61]. BigBird leverages sparse attention mechanisms, including global, window, and random attention, to scale to long sequences with linear memory complexity. We benchmark the inference performance of BigBird implemented using Scorch versus the standard dense PyTorch implementation on three datasets: IMDB [42], AG News [64], and Yahoo Answers [64].

Figure 7 shows the inference times of BigBird. Scorch achieves an average speedup of 1.24×\times×, 1.14×\times×, and 1.05×\times× in wall clock time for the three datasets respectively compared to PyTorch. Scorch’s ability to generate efficient kernels for sparse attention operations allows it to accelerate inference on real-world text classification tasks with varying sequence lengths and sparsity patterns.

5.4 Standard Kernels

We evaluate the performance of Scorch on several standard sparse kernels: Sparse Matrix-Vector Multiplication (SpMV), Sparse Matrix-Matrix Multiplication (SpMM), Sparse Matrix-Sparse Matrix Multiplication (SpMSpM), and Sampled Dense-Dense Matrix Multiplication (SDDMM). These operations are fundamental building blocks in many sparse deep learning models. We benchmark on matrices from the SuiteSparse Matrix Collection [14].

Figure 8 shows the absolute runtime of Scorch and torch.sparse for the four kernels across a range of matrix sizes and sparsity levels. For SpMV and SpMSpM, Scorch achieves performance similar to PyTorch as these operations have limited opportunities for optimization. However, for SpMM, Scorch exhibits better scaling and outperforms PyTorch for all but the smallest problems (<102absentsuperscript102<10^{2}< 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT nonzeros). Scorch’s ability to generate efficient kernels becomes more advantageous as the problem size grows. More significant speedups are observed for SDDMM, where Scorch is orders of magnitude faster than PyTorch on bigger problems (>103absentsuperscript103>10^{3}> 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT nonzeros). PyTorch cannot fuse the SDDMM operation and must perform a dense matrix multiplication followed by an element-wise multiplication with a sparse matrix. In contrast, Scorch is able to generate a single fused kernel that is asymptotically faster.

These results demonstrate the performance benefits of Scorch’s sparse code generation and optimizations, especially for operations like SDDMM where there are more opportunities for optimization. Scorch relieves users of the burden of handwriting and hand-optimizing sparse kernels while offering competitive performance. Appendix E contains additional details on the experimental setup.

Refer to caption
Figure 8: Performance on key sparse operations

6 Limitations and Conclusion

Limitations. While Scorch supports a wide range of sparse tensor operations, it does not yet provide complete coverage of all possible operations. In particular, Scorch does not yet support automatic differentiation with sparse tensors, which is a problem outside the scope of this paper but a direction for future work to enable end-to-end training of sparse models. The current implementation of the compiler stack supports lowering to CPU-optimized C++ code. Additional performance may be achieved by lowering to Triton or CUDA code, and GPU code generation is left as future work.

Conclusion. Scorch is a PyTorch-based framework that accelerates sparse deep learning by providing efficient sparse computation capabilities and seamless integration with PyTorch. Scorch enables users to leverage the benefits of sparsity with minimal modifications to their existing code. Experimental evaluation demonstrates Scorch’s effectiveness in accelerating sparse ML workloads across various domains, achieving significant speedups over PyTorch Sparse. With its generality and seamless PyTorch integration, Scorch makes using sparsity in deep learning more accessible and paves the way for more exploration of efficient sparse models.

Acknowledgments and Disclosure of Funding

This work is supported in part by PRISM, one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, by the Swedish Research Council (Grant No. 2018-04329), and by Digital Futures. Alexander J. Root is supported by an NSF Graduate Research Fellowship. We thank James Dong, Christophe Gyurgyik, Scott Kovach, Rubens Lacouture, Shiv Sundram, and Rohan Yadav for valuable discussion and feedback on early drafts of the paper.

References

  • Abadi et al. [2016] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016. URL https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
  • Adams et al. [2019] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. Learning to optimize halide with tree search and random programs. ACM Trans. Graph., 38(4):121:1–121:12, 2019. doi: 10.1145/3306346.3322967. URL https://doi.org/10.1145/3306346.3322967.
  • Ahrens et al. [2022] Willow Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. Autoscheduling for sparse tensor algebra with an asymptotic cost model. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2022, page 269–285, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392655. doi: 10.1145/3519939.3523442. URL https://doi.org/10.1145/3519939.3523442.
  • Ansel et al. [2024] Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C. K. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, and Soumith Chintala. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 929–947, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703850. doi: 10.1145/3620665.3640366. URL https://doi.org/10.1145/3620665.3640366.
  • Baghdadi et al. [2019] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Mahmut Taylan Kandemir, Alexandra Jimborean, and Tipp Moseley, editors, IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019, pages 193–205. IEEE, 2019. doi: 10.1109/CGO.2019.8661197. URL https://doi.org/10.1109/CGO.2019.8661197.
  • Beltagy et al. [2020] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer. CoRR, abs/2004.05150, 2020. URL https://arxiv.longhoe.net/abs/2004.05150.
  • Bik et al. [2022] Aart J. C. Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad. Compiler support for sparse tensor computations in MLIR. ACM Trans. Archit. Code Optim., 19(4):50:1–50:25, 2022. doi: 10.1145/3544559. URL https://doi.org/10.1145/3544559.
  • Blalock et al. [2020] Davis W. Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John V. Guttag. What is the State of Neural Network Pruning? In Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze, editors, Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020. URL https://proceedings.mlsys.org/book/296.pdf.
  • Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Buluç et al. [2009] Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, page 233–244, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605586069. doi: 10.1145/1583991.1584053. URL https://doi.org/10.1145/1583991.1584053.
  • Chen et al. [2018] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In Andrea C. Arpaci-Dusseau and Geoff Voelker, editors, 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, pages 578–594. USENIX Association, 2018. URL https://www.usenix.org/conference/osdi18/presentation/chen.
  • Child et al. [2019] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.longhoe.net/abs/1904.10509.
  • Chou et al. [2018] Stephen Chou, Fredrik Kjolstad, and Saman P. Amarasinghe. Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang., 2(OOPSLA):123:1–123:30, 2018. doi: 10.1145/3276493. URL https://doi.org/10.1145/3276493.
  • Davis and Hu [2011] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, 2011. doi: 10.1145/2049662.2049663. URL https://doi.org/10.1145/2049662.2049663.
  • Elbadawi et al. [2021] Moe Elbadawi, Simon Gaisford, and Abdul W. Basit. Advanced machine-learning techniques in drug discovery. Drug Discovery Today, 26(3):769–777, 2021. ISSN 1359-6446. doi: https://doi.org/10.1016/j.drudis.2020.12.003. URL https://www.sciencedirect.com/science/article/pii/S1359644620305213.
  • Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res., 23:120:1–120:39, 2022. URL http://jmlr.org/papers/v23/21-0998.html.
  • Fey and Lenssen [2019] Matthias Fey and Jan Eric Lenssen. Fast Graph Representation Learning with PyTorch Geometric. CoRR, abs/1903.02428, 2019. URL http://arxiv.longhoe.net/abs/1903.02428.
  • Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
  • Gale et al. [2020] Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse GPU kernels for deep learning. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 17. IEEE/ACM, 2020. doi: 10.1109/SC41405.2020.00021. URL https://doi.org/10.1109/SC41405.2020.00021.
  • Gale et al. [2023] Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proceedings of Machine Learning and Systems, 5, 2023. URL https://doi.org/10.48550/arXiv.2211.15841.
  • Giles et al. [1998] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. Citeseer: an automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, DL ’98, page 89–98, New York, NY, USA, 1998. Association for Computing Machinery. ISBN 0897919653. doi: 10.1145/276675.276685. URL https://doi.org/10.1145/276675.276685.
  • Google [2024] Google. Open xla, 2024. URL https://openxla.org/xla.
  • Hamilton et al. [2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 1024–1034, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html.
  • Han et al. [2016] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.longhoe.net/abs/1510.00149.
  • Hoefler et al. [2021] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22:241:1–241:124, 2021. URL http://jmlr.org/papers/v22/21-0366.html.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html.
  • Kanakagiri and Solomonik [2023] Raghavendra Kanakagiri and Edgar Solomonik. Minimum cost loop nests for contraction of a sparse tensor with a tensor network. arXiv preprint arXiv:2307.05740, 2023. URL https://doi.org/10.48550/arXiv.2307.05740.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.longhoe.net/abs/1412.6980.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=SJU4ayYgl.
  • Kitaev et al. [2020] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
  • Kjolstad et al. [2017] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman P. Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1–77:29, 2017. doi: 10.1145/3133901. URL https://doi.org/10.1145/3133901.
  • Kjolstad et al. [2019] Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman P. Amarasinghe. Tensor algebra compilation with workspaces. In Mahmut Taylan Kandemir, Alexandra Jimborean, and Tipp Moseley, editors, IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019, pages 180–192. IEEE, 2019. doi: 10.1109/CGO.2019.8661185. URL https://doi.org/10.1109/CGO.2019.8661185.
  • Kossaifi et al. [2019] Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. Tensorly: Tensor learning in python. Journal of Machine Learning Research, 20(26):1–6, 2019. URL http://jmlr.org/papers/v20/18-277.html.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URL https://doi.org/10.1109/5.726791.
  • Lepikhin et al. [2021] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan** Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
  • Li et al. [2018] Jiajia Li, Jimeng Sun, and Richard Vuduc. Hicoo: Hierarchical storage of sparse tensors. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 238–252, 2018. doi: 10.1109/SC.2018.00022.
  • Liu and Vinter [2015] Weifeng Liu and Brian Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, page 339–350, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335591. doi: 10.1145/2751205.2751209. URL https://doi.org/10.1145/2751205.2751209.
  • Liu et al. [2015] Ziwei Liu, ** Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3730–3738. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.425. URL https://doi.org/10.1109/ICCV.2015.425.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Luo et al. [2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5068–5076. IEEE Computer Society, 2017. doi: 10.1109/ICCV.2017.541. URL https://doi.org/10.1109/ICCV.2017.541.
  • Maas et al. [2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  • McCallum et al. [2000] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Inf. Retr., 3(2):127–163, 2000. doi: 10.1023/A:1009953814988. URL https://doi.org/10.1023/A:1009953814988.
  • Mocanu et al. [2018] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-04316-3. URL https://www.nature.com/articles/s41467-018-04316-3. Publisher: Nature Publishing Group.
  • Mostafa and Wang [2019] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 4646–4655. PMLR, 2019. URL http://proceedings.mlr.press/v97/mostafa19a.html.
  • Naumov et al. [2019] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. CoRR, abs/1906.00091, 2019. URL https://arxiv.longhoe.net/abs/1906.00091.
  • Ng et al. [2011] Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011. URL https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  • Ragan-Kelley et al. [2013] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, page 519–530, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320146. doi: 10.1145/2491956.2462176. URL https://doi.org/10.1145/2491956.2462176.
  • Rotem et al. [2018] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. Glow: Graph lowering compiler techniques for neural networks. CoRR, abs/1805.00907, 2018. URL http://arxiv.longhoe.net/abs/1805.00907.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. Collective classification in network data. AI Mag., 29(3):93–106, 2008. doi: 10.1609/AIMAG.V29I3.2157. URL https://doi.org/10.1609/aimag.v29i3.2157.
  • Senanayake et al. [2020] Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. A sparse iteration space transformation framework for sparse tensor algebra. Proc. ACM Program. Lang., 4(OOPSLA), nov 2020. doi: 10.1145/3428226. URL https://doi.org/10.1145/3428226.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
  • Sze et al. [2017] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE, 105(12):2295–2329, 2017. doi: 10.1109/JPROC.2017.2761740. URL https://doi.org/10.1109/JPROC.2017.2761740.
  • Vasilache et al. [2018] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR, abs/1802.04730, 2018. URL http://arxiv.longhoe.net/abs/1802.04730.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  • Wang et al. [2019] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, ****g Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019. URL http://arxiv.longhoe.net/abs/1909.01315.
  • Wen et al. [2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2074–2082, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/41bfd20a38bb1b0bec75acf0845530a7-Abstract.html.
  • Wu et al. [2021] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst., 32(1):4–24, 2021. doi: 10.1109/TNNLS.2020.2978386. URL https://doi.org/10.1109/TNNLS.2020.2978386.
  • Ye et al. [2023] Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. Sparsetir: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 660–678, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399180. doi: 10.1145/3582016.3582047. URL https://doi.org/10.1145/3582016.3582047.
  • Zaheer et al. [2020] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html.
  • Zhang et al. [2024] Genghan Zhang, Olivia Hsu, and Fredrik Kjolstad. Compilation of modular and general sparse workspaces. arXiv preprint arXiv:2404.04541, 2024. URL https://arxiv.longhoe.net/abs/2404.04541.
  • Zhang et al. [2019] Wen Zhang, Kanghong **g, Feng Huang, Yanlin Chen, Bo-Sheng Li, **ghao Li, and **g Gong. Sflln: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Inf. Sci., 497:189–201, 2019. URL https://api.semanticscholar.org/CorpusID:182751868.
  • Zhang et al. [2015] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html.
  • Zheng et al. [2020a] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020, pages 863–879. USENIX Association, 2020a. URL https://www.usenix.org/conference/osdi20/presentation/zheng.
  • Zheng et al. [2020b] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In James R. Larus, Luis Ceze, and Karin Strauss, editors, ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, pages 859–873. ACM, 2020b. doi: 10.1145/3373376.3378508. URL https://doi.org/10.1145/3373376.3378508.

Appendix A Compilation

Scorch is designed as a layer on top of PyTorch that adds sparse functionality to existing operations. Scorch intercepts calls to PyTorch operations and dispatches them to either the traditional PyTorch infrastructure or to the Scorch compiler.

Scorch dispatches dense computations to the existing PyTorch computing infrastructure. This includes both tensor operations that operate on fully dense tensors and the fully dense sub-computations of blocked-sparse operations. By reusing the existing dense functionality, we make it easier for users to transition to Scorch, as dense operations continue to execute in the same execution environment and with the same performance as before. Moreover, we reduce the complexity of the Scorch compiler, as it does not need to be tuned to match existing dense computing libraries on completely dense operations.

For kernels with at least one sparse tensor, Scorch first performs format inference to decide the format of the output tensor. Then, Scorch generates a fused loop-level intermediate representation (IR) along the lines of TACO’s Concrete Index Notation IR [32]. The code generation algorithm for the loop-level IR also optimizes loop ordering, as described in Section 4.1, inserting intermediate temporary tensors as necessary. Unlike in dense tensor computations, the loop order of a sparse tensor computation affects its asymptotic complexity [3]. That is, changing a matrix multiplication from an inner product algorithm to a linear combination of row algorithm can lead to an arbitrarily large speedup. Loop ordering optimization is therefore critical and one of the contributions in Scorch is an algorithm that can quickly find a loop order that avoids a bad asymptotic complexity.

Next, Scorch tiles any dense loops inside sparse computations to improve cache utilization. Such loops appear because a sparse tensor computation often also contains dense tensors (e.g., SpMM and SDDMM) and because a sparse tensor may have some dimensions that are stored in a dense array (e.g., the first dimension of a CSR matrix is dense). The tiling algorithm for dense loops in Scorch (Section 4.2) is designed to avoid introducing costly random searches in sparse data structures. For example, Scorch does not tile sparse loops, although that can sometimes lead to performance improvements, because that would introduce costly searches in sparse data structures. It is difficult for a compiler to determine whether the cache benefits outweighs the searching cost. We therefore designed Scorch to not tile sparse loops so that the compiler produces predictable good performance.

Finally, the tiled loops are lowered to C++ code, JIT compiled and linked via PyTorch’s custom C++ extensions loader. A high-level description of this process is provided in Algorithm 2.

Algorithm 2 Scorch Kernel Compilation
1:Input: Tensor expression E𝐸{E}italic_E with input tensors 𝒯𝒯\mathcal{T}caligraphic_T and untyped output tensor Toutsubscript𝑇𝑜𝑢𝑡T_{out}italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.
2:if all 𝒯𝒯\mathcal{T}caligraphic_T are dense then
3:     return torch.compile(E𝐸{E}italic_E) \triangleright Dynamic dispatch for dense kernels.
4:Toutsubscript𝑇𝑜𝑢𝑡T_{out}italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.format \leftarrow InferFormat(E𝐸{E}italic_E, 𝒯𝒯\mathcal{T}caligraphic_T) \triangleright Appendix D
5:absent\mathcal{L}\leftarrowcaligraphic_L ← CompileToLoops(E𝐸{E}italic_E, 𝒯𝒯\mathcal{T}caligraphic_T, Toutsubscript𝑇𝑜𝑢𝑡T_{out}italic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT) \triangleright Algorithm 1
6:tiledsubscripttiledabsent\mathcal{L}_{\mathrm{tiled}}\leftarrowcaligraphic_L start_POSTSUBSCRIPT roman_tiled end_POSTSUBSCRIPT ← TileOperations(E𝐸{E}italic_E, 𝒯𝒯\mathcal{T}caligraphic_T, \mathcal{L}caligraphic_L) \triangleright Algorithm 3
7:return LowerToC(tiledsubscripttiled\mathcal{L}_{\mathrm{tiled}}caligraphic_L start_POSTSUBSCRIPT roman_tiled end_POSTSUBSCRIPT)

The current implementation of Scorch compiles a single user-provided expression at a time and does not fuse across expressions. We leave extending our compiler infrastructure to fuse sparse operations across expressions. Such work could, for example, use a tracing facility such as the one introduced in PyTorch 2.0 [4] to capture the computational graph that serves as the basis for global optimization.

Appendix B Implementation Details

Scorch is implemented as a Python library, with its compiler written in Python as well. The Scorch compiler generates C++ code and pybind bindings, which are then compiled and loaded using PyTorch’s custom C++ extension machinery.

B.1 Sparse Tensor Representation

Scorch represents sparse tensors using a unified abstraction that can represent different underlying physical sparse storage formats like COO, CSR, and DCSR. This allows specifying computations independently of storage details, leaving the task of generating low-level code that interacts with specific data structures to the compiler.

The Scorch implementation has two primary classes that manage sparse tensors. A tensor class contains general information about the tensor that is independent of specific storage, such as its name, shape, and component type. The tensor class also has a reference to an object of a storage class. A tensor storage contains the physical storage of a tensor, including a storage descriptor (i.e., a format), arrays storing sparse indices that identify the stored (typically non-zero) values of a tensor, such as coordinate lists and CSR row pointers, and an array containing the stored values of the tensor.

B.2 Sparse Tensor Operations

Scorch supports arbitrary tensor contractions and element-wise operations on sparse tensors. This includes, but is not limited to:

  • sparse matrix-vector multiplication,

  • sparse matrix-matrix multiplication,

  • tensor contractions,

  • element-wise sparse addition, subtraction, multiplication, and

  • fused operations.

At the core, Scorch can compile generalized einsum contraction operations. The data structures that store each tensor in these operations can be determined separately. The code generator builds on prior work on sparse tensor algebra code generation from the TACO line of work [31, 32]. Such code generators generate code that co-iterates over any number of data structures stored in different formats. Thus, they support generating fused code that operates on tensors stored in disparate formats.

B.3 Sparse Temporary Workspaces

One key feature in Scorch that enables general sparse computations is the support for sparse temporary tensors, often called sparse workspaces [62]. Such temporaries are useful in kernels where the loop order causes scattering into the result tensor, but where the result tensor data structure does not support random inserts (e.g., a linear combination of rows matrix multiplication with a CSR result). The sparse temporary tensors in Scorch allow computations to scatter and gather results in arbitrary order into a sparse output tensor of any format, while also offering significant memory efficiency by storing temporary intermediate tensors in a compressed format, avoiding the need to materialize intermediate zeros values as in dense tensors. In Scorch, a red-black tree is used to store intermediate non-zero values. We chose red-black trees because they are simple and inherently provide the features we need, namely the ability to insert coordinates-value pairs and the ability to iterate over them in sorted order. Workspaces improve performance by avoiding unnecessary format materialization and compression.

Appendix C Auto-scheduling Algorithms

C.1 Tiling

The tiling algorithm analyzes the tensor expression and determines which loops to tile based on several key observations, as described in Section 4.2. The pseudocode for the tiling algorithm is provided in Algorithm 3.

Algorithm 3 Tiling Sparse Tensor Operations
1:Input: A tensor expression E𝐸Eitalic_E containing input tensors 𝒯𝒯\mathcal{T}caligraphic_T, and a loop structure \mathcal{L}caligraphic_L.
2:Output: Tiled loop structure tiledsubscripttiled\mathcal{L}_{\mathrm{tiled}}caligraphic_L start_POSTSUBSCRIPT roman_tiled end_POSTSUBSCRIPT for efficient computation.
3:𝒲𝒲\mathcal{W}\leftarrow\emptysetcaligraphic_W ← ∅ \triangleright Initialize the working set of index variables
4:𝒮GetIndexVariables(E)𝒮GetIndexVariables(𝐸)\mathcal{S}\leftarrow\textsc{GetIndexVariables(}E\textsc{)}caligraphic_S ← GetIndexVariables( italic_E ) \triangleright Get the set of all index variables in the expression
5:for each tensor access T[]𝑇delimited-[]T[\mathcal{I}]italic_T [ caligraphic_I ] in the expression E𝐸Eitalic_E do
6:     if 𝒮𝒮\mathcal{I}\subset\mathcal{S}caligraphic_I ⊂ caligraphic_S then \triangleright Check if indices of T𝑇Titalic_T are a strict subset of 𝒮𝒮\mathcal{S}caligraphic_S
7:         𝒲𝒲𝒲𝒲\mathcal{W}\leftarrow\mathcal{W}\cup\mathcal{I}caligraphic_W ← caligraphic_W ∪ caligraphic_I \triangleright Add indices to the working set      
8:for all i𝒲𝑖𝒲i\in\mathcal{W}italic_i ∈ caligraphic_W do \triangleright Remove sparse index variables
9:     if there exists a tensor T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T where the dimension of T𝑇Titalic_T corresponding to i𝑖iitalic_i is sparse then
10:         𝒲𝒲{i}𝒲𝒲𝑖\mathcal{W}\leftarrow\mathcal{W}\setminus\{i\}caligraphic_W ← caligraphic_W ∖ { italic_i }      
11:for all i𝒲𝑖𝒲i\in\mathcal{W}italic_i ∈ caligraphic_W do \triangleright Remove index variables that are parents of sparse dimensions
12:     if there exists a sparse index variable j𝑗jitalic_j such that i𝑖iitalic_i is a parent of j𝑗jitalic_j in the loop nest \mathcal{L}caligraphic_L then
13:         𝒲𝒲{i}𝒲𝒲𝑖\mathcal{W}\leftarrow\mathcal{W}\setminus\{i\}caligraphic_W ← caligraphic_W ∖ { italic_i }      
14:for all i𝒲𝑖𝒲i\in\mathcal{W}italic_i ∈ caligraphic_W do \triangleright Tiling
15:     (iouter,iinner)TileIndex(i)subscript𝑖outersubscript𝑖innerTileIndex𝑖(i_{\text{outer}},i_{\text{inner}})\leftarrow\textsc{TileIndex}(i)( italic_i start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ) ← TileIndex ( italic_i ) \triangleright Split loop i𝑖iitalic_i into outer and inner loops
16:     ReorderLoops(,iouter,0)ReorderLoopssubscript𝑖outer0\mathcal{L}\leftarrow\textsc{ReorderLoops}(\mathcal{L},i_{\text{outer}},0)caligraphic_L ← ReorderLoops ( caligraphic_L , italic_i start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT , 0 ) \triangleright Reorder ioutersubscript𝑖outeri_{\text{outer}}italic_i start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT to outermost in the loop nest
17:return tiledsubscripttiled\mathcal{L}_{\mathrm{tiled}}\leftarrow\mathcal{L}caligraphic_L start_POSTSUBSCRIPT roman_tiled end_POSTSUBSCRIPT ← caligraphic_L

Appendix D Tensor Formats and Format Inference

Scorch uses the sparse tensor format description language proposed by Kjolstad et al. [31] and Chou et al. [13]. When users create a PyTorch tensor, it is dense by default as it is in vanilla PyTorch. Scorch extends the PyTorch tensor constructor to allow the user to specify that a tensor should be sparse. Sparse tensors are stored in the coordinate format by default, but also lets users specify a specific tensor formats. The Scorch tensor format description language is flexible and support many different types of data structures, such as compressed vectors, compressed matrices, compressed tensors, blocked-sparse matrices and tensors, dense tensors, and coordinate matrices and tensors.

Although users must specify whether a tensor defined using a constructor is sparse, Scorch automatically infers the (dense or sparse) data structures of tensors that result from some operations. So if the user multiplies two sparse matrices (SpGEMM), Scorch will infer that the result should be stored in a sparse matrix, as well as the specific sparse data structure. And if the user multiplies a sparse matrix with a dense matrix (SpMM), Scorch automatically infers that the result should be stored in a dense array.

D.1 Tensor Formats

A tensor format describes the data structures that store the non-zero elements of a tensor. Following Kjolstad et al. [31], we allow a separate description of the data structure of each dimension of a tensor. So a compressed sparse row (CSR) matrix stores the set of rows in a dense data structure and each row in a compressed data structure, while a doubly-compressed sparse row (DCSR) matrix stores both dimensions in a compressed data structure. We support three types of per-dimension data structures that can be composed any way to store a tensor of any dimensionality:

  • dense: All elements, including zeros, are explicitly stored.

  • compressed: Non-zeros are stored using compressed index arrays storing their coordinates.

  • coordinate: Non-zeros are stored as a list of coordinate-value pairs.

The full format of a tensor is determined by both the ordering of dimensions and the data structure for each dimension. For example, a matrix (2D tensor) can be stored in many ways: a row-major dense matrix (dense rows, dense columns), a row-major compressed sparse row (CSR) matrix (dense rows, compressed columns), a column-major compressed sparse column (CSC) matrix (dense columns, compressed rows), a doubly compressed sparse row (DCSR) matrix (compressed rows, compressed columns), and a row-major coordinate list (COO) matrix (coordinate rows, coordinate columns).

Our way to specify coordinate tensors simplifies their specification in prior work by Chou et al. [13]. They introduced a singleton data structure that needed to be composed with a duplicate version of a compressed data structure, so a coordinate matrix became (unordered compressed rows, singleton columns). Scorch, on the other hand, represents coordinate tensors more uniformly using a coordinate data structure, so a coordinate matrix becomes (coordinate rows, coordinate columns).

D.2 Format Inference Algorithm

Given a tensor computation expressed as an einsum operation or as a general tensor algebra expression, we provide an algorithm to infer the format of the output tensor based on the formats of the input tensors. Our inference algorithm determines the storage format for each level of the output tensor independently and thus naturally extends to tensors of any dimensionality.

The inference algorithm traverses the tensor expression and predicts the format of the result by separately predicting the data structure of each of its dimensions. This is done using basic algebraic reasoning: the element-wise multiplication of a sparse vector by any other vector type should be sparse, since the result will be at least as sparse as either operand. And the element-wise addition of a dense vector with any other vector type should be dense, since the result will be at least as dense as either operand. Furthermore, when two sparse vectors are added, Scorch makes the result sparse. Although the result is the union of the operands, and thus denser than either operand, we elect to make Scorch conservative. The reason for such conservativeness is that instantiating a dense tensor when the result is very sparse can asymptotically increase memory and compute usage, while instantiating a sparse tensor when the result is sparse only increases memory and compute usage by a constant factor. Finally, when two dense vectors are multiplied, Scorch makes the result dense. We provide the inference algorithm in Algorithm 4, where compressed and coordinate formats are both considered Sparse.

Algorithm 4 Tensor Format Inference
1:procedure Infer(E𝐸Eitalic_E, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l)
2:     match E
3:     | A+B𝐴𝐵A+Bitalic_A + italic_B \rightarrow match Infer(A𝐴Aitalic_A, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l), Infer(B𝐵Bitalic_B, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l) with
4:            | Dense, _ \rightarrow return Dense
5:            | _, Dense \rightarrow return Dense
6:            | _, _ \rightarrow return Sparse
7:     | AB𝐴𝐵A*Bitalic_A ∗ italic_B \rightarrowmatch Infer(A𝐴Aitalic_A, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l), Infer(B𝐵Bitalic_B, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l) with
8:            | Sparse, _ \rightarrow return Sparse
9:            | _, Sparse \rightarrow return Sparse
10:            | _, _ \rightarrow return Dense
11:     | T𝑇Titalic_T \rightarrow TensorFormat(T𝑇Titalic_T, lvl𝑙𝑣𝑙lvlitalic_l italic_v italic_l)
12:     end

D.3 Example

Consider the following tensor computation:

Dij=kAikBkj+Cijsubscript𝐷𝑖𝑗subscript𝑘subscript𝐴𝑖𝑘subscript𝐵𝑘𝑗subscript𝐶𝑖𝑗D_{ij}=\sum_{k}A_{ik}B_{kj}+C_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (2)

Suppose the input tensors have the following formats:

  • A𝐴Aitalic_A: CSR (dense i𝑖iitalic_i, compressed k𝑘kitalic_k)

  • B𝐵Bitalic_B: CSR (dense k𝑘kitalic_k, compressed j𝑗jitalic_j)

  • C𝐶Citalic_C: DCSR (compressed i𝑖iitalic_i, compressed j𝑗jitalic_j)

To infer the format of the output tensor D𝐷Ditalic_D, we apply the format inference algorithm:

  1. 1.

    First, we consider the multiplication sub-expression Tij=kAikBkjsubscript𝑇𝑖𝑗subscript𝑘subscript𝐴𝑖𝑘subscript𝐵𝑘𝑗T_{ij}=\sum_{k}A_{ik}B_{kj}italic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT:

    • For T[i]𝑇delimited-[]𝑖T[i]italic_T [ italic_i ]: dense A[i]𝐴delimited-[]𝑖A[i]italic_A [ italic_i ] contributes through multiplication, so T[i]𝑇delimited-[]𝑖T[i]italic_T [ italic_i ] should be dense.

    • For T[j]𝑇delimited-[]𝑗T[j]italic_T [ italic_j ]: compressed B[j]𝐵delimited-[]𝑗B[j]italic_B [ italic_j ] contributes through multiplication, so T[j]𝑇delimited-[]𝑗T[j]italic_T [ italic_j ] should be compressed.

  2. 2.

    Next, we consider the addition of T𝑇Titalic_T and C𝐶Citalic_C to form the final output D𝐷Ditalic_D:

    • For D[i]𝐷delimited-[]𝑖D[i]italic_D [ italic_i ]: dense T[i]𝑇delimited-[]𝑖T[i]italic_T [ italic_i ] contributes through addition, while compressed C[i]𝐶delimited-[]𝑖C[i]italic_C [ italic_i ] contributes through addition. Since T[i]𝑇delimited-[]𝑖T[i]italic_T [ italic_i ] is dense, D[i]𝐷delimited-[]𝑖D[i]italic_D [ italic_i ] should be dense.

    • For D[j]𝐷delimited-[]𝑗D[j]italic_D [ italic_j ]: compressed T[j]𝑇delimited-[]𝑗T[j]italic_T [ italic_j ] contributes through addition, while compressed C[j]𝐶delimited-[]𝑗C[j]italic_C [ italic_j ] contributes through addition. Therefore, D[j]𝐷delimited-[]𝑗D[j]italic_D [ italic_j ] should be compressed.

    Therefore, the final output tensor D𝐷Ditalic_D should be stored in CSR format.

This example demonstrates how the format inference algorithm handles expressions with a combination of multiplications and additions by recursively applying the inference rules and combining the results.

D.4 Discussion

The tensor format inference algorithm is implemented in the Scorch framework as part of the compiler pipeline. When a tensor computation is encountered, Scorch analyzes the formats of the input tensors and applies the format inference algorithm to determine an appropriate format for the output tensor(s).

The inferred formats are then used to guide the code generation process, ensuring that the appropriate sparse or dense kernels are generated. This automatic format inference capability lets Scorch optimize tensor computations based on the data structures of the input data, resulting in improved memory efficiency and computational performance.

By integrating format inference into the compiler workflow, Scorch enables the seamless mixing of sparse and dense tensors in deep learning models, without requiring manual specification of the result of each and every compute operation. This simplifies the development process and lets users focus on the high-level logic of their models while leveraging the benefits of sparse computations where appropriate.

One limitation of the current algorithm is that it does not consider potential trade-offs between storage efficiency and computational efficiency. In some cases, it may be beneficial to use a denser format for the output tensor to enable more efficient computation, even if it requires more storage. Extending the format inference algorithm to take into account these trade-offs is an interesting direction for future work.

Overall, format inference is a key component of Scorch that enables efficient and transparent sparse tensor computation. By automatically inferring the optimal sparse format, Scorch simplifies the user experience and ensures high performance across a wide range of sparse tensor operations and sparsity patterns.

Appendix E Experimental Details

All experiments were performed on an Apple M1 Ultra CPU (3.2 GHz, 20 cores) with 64 GB of memory.

E.1 Standard Kernels Benchmark Details

We run each benchmark 10 times and report the average runtimes in Figure 8. For SpMV, we perform a matrix-vector multiplication of each SuiteSparse matrix with a randomly generated dense vector. For SpMM, we perform the matrix multiplication of each SuiteSparse matrix with a randomly generated dense matrix. For SpMSpM, we truncate any non-square SuiteSparse matrices to be square and matrix multiply them with their transpose. For SDDMM, we perform the element-wise multiplication of each SuiteSparse matrix with the matrix multiplication of two randomly generated dense matrices.

Matrix Formats and Characteristics.

For the SpMV, SpMM, and SDDMM benchmarks, we use sparse matrices in the Compressed Sparse Row (CSR) format, which is widely supported across sparse libraries. However, for SpMSpM, PyTorch does not support matrix multiplication between two CSR matrices on non-Intel machines due to the absence of the Intel Math Kernel Library (MKL). The specific error message is: “addmm: computation on CPU is not implemented for SparseCsr + SparseCsr @ SparseCsr without MKL.”

To work around this limitation and still evaluate SpMSpM performance, we use the COO (Coordinate) format for SpMSpM. Additionally, since not all matrices in the SuiteSparse collection are square, we truncate non-square matrices to be square and multiply them with their transpose to obtain a valid SpMSpM operation.

PyTorch Performance on Small SpMM Problems.

We observe that PyTorch Sparse outperforms Scorch on SpMM for small problems with fewer than 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT nonzeros. This is because PyTorch’s SpMM implementation is optimized for small matrices. Their gather-scatter approach is more efficient for small inputs but does not scale well to larger matrices. Scorch’s generated kernels have some overhead that dominates for small problems, but they scale much better to larger matrices, resulting in the speedups seen in Figure 8.

E.2 Graph Neural Networks

We evaluate the inference performance of Graph Convolutional Networks (GCNs) [29] on four node classification datasets: Cora [43], Citeseer [21], PubMed [51], and OGBN-arXiv [26]. The model architecture and hyperparameters are listed in Table 1. We use the Adam optimizer with a learning rate of 0.01 and weight decay of 5e-4 to train the model for 200 epochs with a batch size of 1 (full-batch). Dropout with a rate of 0.5 is applied to the hidden layer during training.

For each dataset, we train the models on the training set and evaluate the inference time and accuracy on the test set using an Apple M1 Ultra CPU with 64 GB of memory. The experiments are run in PyTorch 2.2.1, PyTorch Geometric 2.5.0, and DGL 2.0.0, and Scorch 0.1.0. We run each inference experiment 50 times with 5 warm-up runs and report the average speedups of various frameworks relative to PyTorch, as well as the absolute inference times, in Figure 5.

To ensure a fair comparison, we use the same GCN architecture across all frameworks. The PyTorch and Scorch implementations use a custom GCN layer, while PyG and DGL use their built-in GCN layers. The same model is trained with PyG and the train weights are loaded into the PyTorch, Scorch, PyG, and DGL models after adjusting for any differences in the parameter shapes and orderings.

Hyperparameter Value
Hidden channels 128
Activation function ReLU
Dropout rate 0.5
Optimizer Adam
Learning rate 0.01
Weight decay 5e-4
Batch size 1 (full-batch)
Training epochs 200
Table 1: GCN model architecture and hyperparameters for node classification.

E.3 Sparse Autoencoders

We evaluate the performance of sparse autoencoders on four datasets: MNIST [35], CIFAR-10 [34], CIFAR-100 [34], and CelebA [39]. The datasets are preprocessed as follows:

  • MNIST: The images are converted to tensors and flattened into a 1D vector of size 784 (28×\times×28).

  • CIFAR-10 and CIFAR-100: The images are converted to grayscale, then to tensors, and flattened into a 1D vector of size 1024 (32×\times×32).

  • CelebA: The images are converted to grayscale, resized to 64×\times×64, then to tensors, and flattened into a 1D vector of size 4096 (64×\times×64).

The sparse autoencoder architecture and hyperparameters are summarized in Table 2. The encoder is a sparse linear layer followed by ReLU activation, and the decoder is a dense linear layer followed by sigmoid activation.

Component Details
Encoder Sparse Linear (input size \rightarrow 256)
Encoder Activation ReLU
Decoder Dense Linear (256 \rightarrow input size)
Decoder Activation Sigmoid
Loss Function Mean Squared Error (MSE)
Optimizer Adam
Learning Rate 0.01
Batch Size 64
Training Epochs 10
Table 2: Sparse autoencoder architecture and hyperparameters.

The models are trained using the mean squared error (MSE) loss and the Adam optimizer [28] with a learning rate of 0.01. We use a batch size of 64 and train the models for 10 epochs. The experiments are conducted on an Apple M1 Ultra CPU with 64GB of memory.

During inference, we measure the average reconstruction loss and the wall-clock time on the test set. The reconstruction loss is computed using the MSE loss between the input images and the reconstructed images. The wall-clock time includes the time taken to convert the input data to a sparse CSR tensor, forward pass through the model, and compute the loss.

The experiments are implemented in PyTorch 2.2.1 and Scorch 0.1.0. To ensure a fair comparison, we use the same model architecture, hyperparameters, and random seed across both frameworks. We run each inference experiment 50 times with 5 warm-up runs and report the average speedups of Scorch relative to PyTorch, as well as the absolute inference times, in Figure 6.

E.4 Sparse Transformers

Sparse transformers are variants of the Transformer architecture [56] that leverages sparse attention mechanisms to improve computational efficiency and scalability. In this evaluation, we implement BigBird [61], a specific type of sparse transformer that employs a combination of global, sliding window, and random sparse attention patterns.

BigBird is designed to handle long sequences while maintaining a manageable computational complexity. It achieves this by using a sparse attention mechanism that attends to a subset of tokens in the sequence, rather than attending to all tokens as in the standard transformer. The sparse attention pattern in BigBird consists of three components:

  1. 1.

    Global attention: Attend to all tokens in fixed-size blocks at regular intervals.

  2. 2.

    Sliding window attention: Attend to neighboring tokens within a fixed-size window that slides over the sequence.

  3. 3.

    Random attention: Attend to a fixed number of randomly selected tokens in the sequence.

By combining these attention patterns, sparse transformers can capture both local and global dependencies while significantly reducing the computational cost compared to the standard Transformer.

Sparse transformers have been successfully applied to various natural language processing tasks, such as text classification, question answering, and language modeling, where the input sequences can be very long. It has also shown promising results in other domains, such as genomics and time series analysis, where the ability to handle long sequences is crucial.

We evaluate the inference performance of the BigBird model [61] on three text classification datasets: AG News [64], IMDB [42], and Yahoo Answers [64]. The model architecture and hyperparameters are listed in Table 3. We use the AdamW optimizer [40] with a learning rate of 0.001 to train the models for 5 epochs with a batch size of 64. The sparse attention is configured to use a block size of 16, 2 global tokens, 2 random blocks, and 2 sliding blocks.

For each dataset, we train the models on the training set and evaluate the inference time on the test set using an Apple M1 Ultra CPU. The experiments are implemented in PyTorch 2.2.1 and Scorch 0.1.0. We run each inference experiment 10 times with 5 warm-up runs and report the average speedups of Scorch relative to PyTorch, as well as the absolute inference times, in Figure 7.

Hyperparameter Value
Embedding size 128
Hidden size 128
Intermediate size 256
Number of hidden layers 2
Number of attention heads 4
Hidden activation gelu
Attention dropout 0.1
Hidden dropout 0.1
Sparse attention block size 16
Sparse attention global tokens 2
Sparse attention random blocks 2
Sparse attention sliding blocks 2
Optimizer AdamW
Learning rate 0.001
Batch size 64
Training epochs 5
Table 3: BigBird model architecture and hyperparameters for text classification.

Appendix F Extended Related Work

Sparse Tensor Formats. To avoid unnecessary computation and storage of zero values, various compressed sparse tensor formats have been developed. These aim to store only the meaningful non-zero values and indices required to fully represent a sparse tensor. Popular formats that store irregular tensors include compressed sparse row/column storage (CSR/CSC) which compactly stores non-zero values and their row/column indices. Coordinate format (COO) uses tuples of indices and corresponding values. Block sparse formats partition tensors into dense sub-blocks for efficiency. Compressed sparse fiber exploits skewed sparsity patterns by orienting storage along one dimension. These formats are all supported by Scorch.

There are also many tensor formats that are not supported by Scorch. Many of these have been developed to take advantage of additional structure in the tensors, such as ELLPACK, which assumes a bounded number of non-zeros per row, and the diagonal matrix format, which assumes all non-zeros are located on a small number of diagonals. Other formats store additional data to matrices in both row- and column-major order, such as the CSB matrix format [10] and the blocked matrix format used by MegaBlocks [20]. Finally, several tensor formats have been proposed in recent years to further improve the performance of tensor algebra operations at the expense of more complexity in the data structures, such as CSR5 [38] and HiCOO [37]. In the current version of Scorch, we chose to focus on the data structures that account for the vast majority of use, to manage engineering complexity. However, we believe it is interesting future work to add more formats into Scorch while supporting any operations on any combination of data structures..

Each format makes different trade-offs between compression, indexing overhead, and read/write efficiency for varied sparsity characteristics. Choosing appropriate sparse storage and operations can significantly reduce computational complexity and memory usage compared to dense defaults. However, current deep learning frameworks lack native support for sparse data structures across all compute operations.

Sparse Kernels. In addition to compressed storage, many specialized sparse kernels have also been created, including for sparse matrix multiplication (SpMM), sparse matrix-vector multiply (SpMV), sparse convolution, and more. Key optimizations in these kernels include techniques like iteration space tiling, loop reordering to improve locality, vectorization, and load balancing to handle irregular sparsity. Highly optimized sparse libraries like NVIDIA cuSPARSE and Intel MKL provide some essential sparse operations, but composing and scheduling sparse kernels in deep learning frameworks remains a challenging problem.

Despite the potential benefits of sparsity, adoption in deep learning still faces multiple key challenges: the lack of versatile software frameworks to natively exploit sparsity in the data and models; the mismatch between existing dense architectures like GPUs/TPUs and sparse computation; the difficulty of fusing sparse operations with other layers in an end-to-end model; the irregular memory access patterns of sparse kernels impairing performance compared to dense routines; the increased difficulty of parallelization and load balancing due to sparse connectivity; and finding the right abstraction level to balance performance and productivity.

Auto-scheduling. Scorch contributes techniques for automatically determining optimal schedules for sparse kernels. These are related to auto-scheduling approaches in compilers. The automatic scheduling of computational kernels is challenging due to the exponential search space of possible optimizations like loop transforms and data layouts. Prior auto-scheduling techniques for dense [2, 66, 65] and sparse tensors [3, 27] rely on cost modeling, optimization heuristics, and black-box search. Many auto-schedulers for dense tensor programs exist, such as the Halide auto-scheduler [2], FlexTensor [66], and Ansor [65]. Some recent work has focused on auto-scheduling for tensor algebra. CIN-P [3] enumerates schedules based on asymptotic costs to find optimal schedules for general sparse tensor algebra expressions. SpTTN-Cyclops [27] tunes contraction paths and index orders for SpTTN (contractions of a single large sparse tensor with several dense tensors) kernels using both enumeration and efficient search. Scorch contributes a lightweight heuristic-based auto-scheduler for general sparse kernels that leverages the structure of sparse tensor operations to efficiently explore the search space of different tensor kernels. Moreover, we show in our evaluation that the Scorch heuristic auto-scheduler is able to find good implementations of many operations.

Sparsity in Deep Learning. Sparsity is inherent in many emerging domains for deep learning, including recommender systems [46], drug discovery [63, 15], and web-scale graph analytics [23]. For example, in recommendation systems, user-item interaction matrices are often sparse, as each user interacts with a small subset of the catalog. Molecular graph representations are similarly sparsely connected. Large-scale graphs such as web and social network graphs also exhibit power law distributions, resulting in highly sparse adjacency matrices.

Despite this sparsity, the default approach in deep learning relies heavily on dense matrix multiplication and convolution operations [54]. Using dense matrix multiplication algorithms on extremely sparse data wastes both computation and memory. The massive computational and memory burdens incurred by ignoring sparsity quickly become prohibitive as data continues growing in scale and dimensionality [25].

Several works have developed sparse neural network architectures. Methods like sparse evolutionary training [44] and dynamic sparse reparameterization [45] induce sparsity during training through pruning or regularization. The lottery ticket hypothesis [18] shows that networks can be trained from scratch to be sparse. Structured sparsity techniques induce sparsity in regular patterns, such as block sparsity and channel sparsity, to optimize memory access and computation for specific hardware architectures [58]. While these methods demonstrate the potential benefits of sparsity in deep learning, they often require custom sparse kernels to be implemented for the specific sparse operations used [19]. Scorch enables researchers to more easily explore novel sparse architectures by providing a general framework for efficient sparse computation. Rather than needing to implement custom kernels, users can rely on Scorch’s compiler to automatically generate performant code for their particular sparse operations. This reduces the engineering burden and allows research to focus on modeling innovations.