-
Curved Space-Filling Tiles Using Voronoi Decomposition with Line, and Curve Segments Closed Under Wallpaper Symmetries
Authors:
Haard Panchal,
Ergun Akleman,
Vinayak Krishnamurthy,
Tolga Talha Yildiz,
Varda Grover
Abstract:
In this paper, we present a new approach to obtain symmetric tiles with curved edges. Our approach is based on using higher-order Voronoi sites that are closed under wallpaper symmetries. The resulting Voronoi tessellations provide us with symmetric tiles with curved edges. We have developed a web application that provides real-time tile design. Our application can be found at https://voronoi.viz.…
▽ More
In this paper, we present a new approach to obtain symmetric tiles with curved edges. Our approach is based on using higher-order Voronoi sites that are closed under wallpaper symmetries. The resulting Voronoi tessellations provide us with symmetric tiles with curved edges. We have developed a web application that provides real-time tile design. Our application can be found at https://voronoi.viz.tamu.edu. One of our key findings in this paper is that not all symmetry operations are useful for creating curved tiles. In particular, all symmetries that use mirror operation produce straight lines that are useless for creating new tiles. This result is interesting because it suggests that we need to avoid mirror transformations to produce unusual space-filling tiles in 2D and 3D using Voronoi tessellations.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Axon: A Language for Dynamic Shapes in Deep Learning Graphs
Authors:
Alexander Collins,
Vinod Grover
Abstract:
Axon is a language that enables shape and rank inference for tensors in a Deep Learning graphs. It aims to make shapes implicit and inferred, in a similar manner to how types are implicit and inferred in many functional programming languages. Tensor dimensions are represented by expressions consisting of symbolic variables, constants, and arithmetic operators. Tensor shapes can be expressed as eit…
▽ More
Axon is a language that enables shape and rank inference for tensors in a Deep Learning graphs. It aims to make shapes implicit and inferred, in a similar manner to how types are implicit and inferred in many functional programming languages. Tensor dimensions are represented by expressions consisting of symbolic variables, constants, and arithmetic operators. Tensor shapes can be expressed as either a sequence of these dimension expressions, as a symbolic variable, or as an appending of other shapes. This allows complex constraints on shapes to be expressed. Axon is functional in style, with a type system similar in to Standard ML, extended to include shape information. It provides a suite of built in operators over tensors, including pointwise arithmetic operators, maps, reduction, loops and user defined functions. We describe a shape inference algorithm based on constraint solving which infers information about shapes, from both shape information provided by the programmer and the structure of the program. This allows fully automatic inference of the shapes of tensors for complex Deep Learning graphs. This approach reduces programmer effort when specifying graphs, as tensor shapes are not explicit, allows composition of Deep Learning graphs while maintaining input and output tensor shape compatibility, and aids in automated error detection by identifying shape mismatches at runtime.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Probabilistic Programming with CuPPL
Authors:
Alexander Collins,
Vinod Grover
Abstract:
Probabilistic Programming Languages (PPLs) are a powerful tool in machine learning, allowing highly expressive generative models to be expressed succinctly. They couple complex inference algorithms, implemented by the language, with an expressive modelling language that allows a user to implement any computable function as the generative model. Such languages are usually implemented on top of exis…
▽ More
Probabilistic Programming Languages (PPLs) are a powerful tool in machine learning, allowing highly expressive generative models to be expressed succinctly. They couple complex inference algorithms, implemented by the language, with an expressive modelling language that allows a user to implement any computable function as the generative model. Such languages are usually implemented on top of existing high level programming languages and do not make use of hardware accelerators. PPLs that do make use of accelerators exist, but restrict the expressivity of the language in order to do so. In this paper, we present a language and toolchain that generates highly efficient code for both CPUs and GPUs. The language is functional in style, and the tool chain is built on top of LLVM. Our implementation uses de-limited continuations on CPU to perform inference, and custom CUDA codes on GPU. We obtain significant speed ups across a suite of PPL workloads, compared to other state of the art approaches on CPU. Furthermore, our compiler can also generate efficient code that runs on CUDA GPUs.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Automatic Kernel Generation for Volta Tensor Cores
Authors:
Somashekaracharya G. Bhaskaracharya,
Julien Demouth,
Vinod Grover
Abstract:
A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by libraries such as cuBLAS and c…
▽ More
A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning compilers. When compiling to a GPU target, these computations can be individually mapped to manually tuned implementations provided by libraries such as cuBLAS and cuDNN. These libraries also provide off-the-shelf support for targeting tensor cores in NVIDIA GPUs, which can lead to huge performance boosts through their specialized support for mixed-precision matrix math. Alternatively, tensor cores can be programmed directly using CUDA APIs or inline assembly instructions, which opens up the possibility of generating efficient CUDA kernels automatically for such computations.
Automatic kernel generation is particularly crucial when it is beneficial to generate efficient code for an entire computation graph by fusing several operations into a single device function instead of invoking a separate kernel for each of them. Polyhedral compilation techniques provide a systematic approach for the analysis and transformation of a sequence of affine loop-nests. In this paper, we describe a polyhedral approach to generate efficient CUDA kernels for matrix multiplication using inline assembly instructions for programming tensor cores on NVIDIA Volta GPUs. Furthermore, we build on this approach to generate fused kernels for computation sequences involving matrix multiplication and pointwise operations such as bias addition, ReLU activation etc. Experimental evaluation of these techniques show that automatically generated kernels can provide significantly better performance than manually tuned library implementations, with speedups ranging up to 2.55X.
△ Less
Submitted 1 August, 2020; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs
Authors:
Bastian Hagedorn,
Archibald Samuel Elliott,
Henrik Barthels,
Rastislav Bodik,
Vinod Grover
Abstract:
Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is incre…
▽ More
Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is incredibly challenging: for each provided algorithm, high-performance implementations have to be developed for all commonly used architectures, input sizes, and different storage formats. These implementations are generally provided as optimized assembly code because performance-critical architectural features are only exposed at this level. This prevents reuse between different implementations of even the same algorithm, as simple differences can have major effects on low-level implementation details. In this paper we introduce Fireiron, a DSL and compiler which allows the specification of high-performance GPU implementations as compositions of simple and reusable building blocks. We show how to use Fireiron to optimize matrix multiplication implementations, achieving performance matching hand-coded CUDA kernels, even when using specialised hardware such as NIVIDA Tensor Cores, and outperforming state-of-the-art implementations provided by cuBLAS by more than 2x.
△ Less
Submitted 13 March, 2020;
originally announced March 2020.
-
Automatic acceleration of Numpy applications on GPUs and multicore CPUs
Authors:
Mahesh Ravishankar,
Vinod Grover
Abstract:
Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bio-informatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each operation needs to be sto…
▽ More
Frameworks like Numpy are a popular choice for application developers from varied fields such as image processing to bio-informatics to machine learning. Numpy is often used to develop prototypes or for deployment since it provides efficient implementation for operations involving arrays. Such an approach requires every operation to be executed eagerly. The result of each operation needs to be stored in memory which increases the memory footprint of the application. It also increases the bandwidth requirements since all uses must read from this memory. We propose an approach that records the sequence of Numpy operations for defered execution. When the values of an array are needed, for example when the values are stored to disk or displayed on screen, the sequence of operations required to compute these value are compiled into a function and executed. This removes the need to store/load intermediates in slow memory, resulting in better performance. In cases where the library implementation is more efficient (like matrix-matrix multiply), those are used instead. The approach also allows us to seamlessly target both multicore CPUs and NVIDIA GPUs, thereby porting the Numpy application to these architectures without changing the user program. The benefit of the approach is evaluated by targeting computation samples from various domains and on average on order of magnitude performance improvement over Numpy is observed.
△ Less
Submitted 11 January, 2019;
originally announced January 2019.