Search | arXiv e-print repository

CAT: Cellular Automata on Tensor cores

Authors: Cristóbal A. Navarro, Felipe A. Quezada, Enzo Meneses, Héctor Ferrada, Nancy Hitschfeld

Abstract: Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage t… ▽ More Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range $1 \le r \le 16$, and its theoretical speedups agree with the empirical results. At low radius $r=1,2$, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from $r=3$, CAT progressively outperforms all other approaches, reaching speedups of up to $101\times$ over a GPU baseline and up to $\sim 14\times$ over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range $1 \le r \le 4$ and from $r \ge 5$ it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. The results obtained in this work put CAT as an attractive GPU approach for scientists that need to study emerging phenomena on CA with large neighborhood radius. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 15 pages

arXiv:2306.10959 [pdf, other]

RaViTT: Random Vision Transformer Tokens

Authors: Felipe A. Quezada, Carlos F. Navarro, Cristian Muñoz, Manuel Zamorano, Jorge Jara-Wilde, Violeta Chang, Cristóbal A. Navarro, Mauricio Cerda

Abstract: Vision Transformers (ViTs) have successfully been applied to image classification problems where large annotated datasets are available. On the other hand, when fewer annotations are available, such as in biomedical applications, image augmentation techniques like introducing image variations or combinations have been proposed. However, regarding ViT patch sampling, less has been explored outside… ▽ More Vision Transformers (ViTs) have successfully been applied to image classification problems where large annotated datasets are available. On the other hand, when fewer annotations are available, such as in biomedical applications, image augmentation techniques like introducing image variations or combinations have been proposed. However, regarding ViT patch sampling, less has been explored outside grid-based strategies. In this work, we propose Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy that can be incorporated into existing ViTs. We experimentally evaluated RaViTT for image classification, comparing it with a baseline ViT and state-of-the-art (SOTA) augmentation techniques in 4 datasets, including ImageNet-1k and CIFAR-100. Results show that RaViTT increases the accuracy of the baseline in all datasets and outperforms the SOTA augmentation techniques in 3 out of 4 datasets by a significant margin +1.23% to +4.32%. Interestingly, RaViTT accuracy improvements can be achieved even with fewer tokens, thus reducing the computational load of any ViT model for a given accuracy value. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 9 pages, 6 figures

MSC Class: 68T07

arXiv:2306.03282 [pdf, other]

Accelerating Range Minimum Queries with Ray Tracing Cores

Authors: Enzo Meneses, Cristóbal A. Navarro, Héctor Ferrada, Felipe A. Quezada

Abstract: During the last decade GPU technology has shifted from pure general purpose computation to the inclusion of application specific integrated circuits (ASICs), such as Tensor Cores and Ray Tracing (RT) cores. Although these special purpose GPU cores were designed to further accelerate specific fields such as AI and real-time rendering, recent research has managed to exploit them to further accelerat… ▽ More During the last decade GPU technology has shifted from pure general purpose computation to the inclusion of application specific integrated circuits (ASICs), such as Tensor Cores and Ray Tracing (RT) cores. Although these special purpose GPU cores were designed to further accelerate specific fields such as AI and real-time rendering, recent research has managed to exploit them to further accelerate other tasks that typically used regular GPU computing. In this work we present RTXRMQ, a new approach that can compute range minimum queries (RMQs) with RT cores. The main contribution is the proposal of a geometric solution for RMQ, where elements become triangles that are placed and shaped according to the element's value and position in the array, respectively, such that the closest hit of a ray launched from a point given by the query parameters corresponds to the result of that query. Experimental results show that RTXRMQ is currently best suited for small query ranges relative to the problem size, achieving up to $5\times$ and $2.3\times$ of speedup over state of the art CPU (HRMQ) and GPU (LCA) approaches, respectively. Although for medium and large query ranges RTXRMQ is currently surpassed by LCA, it is still competitive by being $2.5\times$ and $4\times$ faster than HRMQ which is a highly parallel CPU approach. Furthermore, performance scaling experiments across the latest RTX GPU architectures show that if the current RT scaling trend continues, then RTXRMQ's performance would scale at a higher rate than HRMQ and LCA, making the approach even more relevant for future high performance applications that employ batches of RMQs. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 17 Figures

arXiv:2303.10581 [pdf, other]

An Evaluation of GPU Filters for Accelerating the 2D Convex Hull

Authors: Roberto Carrasco, Héctor Ferrada, Cristóbal A. Navarro, Nancy Hitschfeld

Abstract: The Convex Hull algorithm is one of the most important algorithms in computational geometry, with many applications such as in computer graphics, robotics, and data mining. Despite the advances in the new algorithms in this area, it is often needed to improve the performance to solve more significant problems quickly or in real-time processing. This work presents an experimental evaluation of GPU… ▽ More The Convex Hull algorithm is one of the most important algorithms in computational geometry, with many applications such as in computer graphics, robotics, and data mining. Despite the advances in the new algorithms in this area, it is often needed to improve the performance to solve more significant problems quickly or in real-time processing. This work presents an experimental evaluation of GPU filters to reduce the cost of computing the 2D convex hull. The technique first performs a preprocessing of the input set, filtering all points within an eight-vertex polygon in logarithmic time, to obtain a reduced set of candidate points. We use parallel computation and the use of the Manhattan distance as a metric to find the vertices of the polygon and perform the point filtering. For the filtering stage we study different approaches; from custom CUDA kernels to libraries such as Thrust and CUB. Three types of point distributions are tested: a normal distribution (favorable case), circumference (the worst case), and a case where points are shifted randomly from the circumference (intermediate case). Experimental evaluation shows that the GPU filtering algorithm can be up to 23x faster than a sequential CPU implementation, and the whole convex hull computation can be up to 30x faster than the fastest implementation provided by the CGAL library. △ Less

Submitted 19 March, 2023; originally announced March 2023.

arXiv:2209.12310 [pdf, other]

Accelerating the Convex Hull Computation with a Parallel GPU Algorithm

Authors: Alan Keith, Héctor Ferrada, Cristóbal A. Navarro

Abstract: The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being used in applications, their computation time is often considered an issue for time-sensitive tasks such as real-time collision detection, clustering or image proc… ▽ More The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being used in applications, their computation time is often considered an issue for time-sensitive tasks such as real-time collision detection, clustering or image processing for virtual reality, among others, where fast response times are required. In this work we propose a parallel GPU-based adaptation of heaphull, which is a state of the art CPU algorithm that computes the convex hull by first doing a efficient filtering stage followed by the actual convex hull computation. More specifically, this work parallelizes the filtering stage, adapting it to the GPU programming model as a series of parallel reductions. Experimental evaluation shows that the proposed implementation significantly improves the performance of the convex hull computation, reaching up to $4\times$ of speedup over the sequential CPU-based heaphull and between $3\times \sim 4\times$ over existing GPU based approaches. △ Less

Submitted 25 September, 2022; originally announced September 2022.

Comments: 7 pages, in Spanish language

arXiv:2209.00117 [pdf, other]

GPU Voronoi Diagrams for Random Moving Seeds

Authors: Rodrigo Stevenson, Cristóbal A. Navarro

Abstract: The Voronoi Diagram is a geometrical structure that is widely used in scientific or technological applications where proximity is a relevant aspect to consider, and it also resembles natural phenomena such as cellular banks, rock formations or bee hives, among others. Typically, computing the Voronoi Diagram is done in a static context, that is, the location of the input seeds is defined once and… ▽ More The Voronoi Diagram is a geometrical structure that is widely used in scientific or technological applications where proximity is a relevant aspect to consider, and it also resembles natural phenomena such as cellular banks, rock formations or bee hives, among others. Typically, computing the Voronoi Diagram is done in a static context, that is, the location of the input seeds is defined once and does not change. In this work we study the dynamic case where seeds move, which leads to a dynamic Voronoi Diagram that changes over time. In particular, we consider uniform random moving seeds, for which we propose the \textit{dynamic Jump Flooding Algorithm} (dJFA), a variant of JFA that uses less iterations than the standard JFA. An experimental evaluation shows that dJFA achieves a speedup of up to $\sim 5.3 \times$ over JFA, while maintaining a similarity of at least $88\%$ and close to $100\%$ in many cases. These results contribute with a step towards the achievement of real-time GPU-based computation of dynamic Voronoi diagrams for any particle simulation. △ Less

Submitted 31 August, 2022; originally announced September 2022.

Comments: 6 pages

arXiv:2209.00103 [pdf, other]

GGArray: A Dynamically Growable GPU Array

Authors: Enzo Meneses, Cristóbal A. Navarro, Héctor Ferrada

Abstract: We present a dynamically Growable GPU array (GGArray) fully implemented in GPU that does not require synchronization with the host. The idea is to improve the programming of GPU applications that require dynamic memory, by offering a structure that does not require pre-allocating GPU VRAM for the worst case scenario. The GGArray is based on the LFVector, by utilizing an array of them in order to t… ▽ More We present a dynamically Growable GPU array (GGArray) fully implemented in GPU that does not require synchronization with the host. The idea is to improve the programming of GPU applications that require dynamic memory, by offering a structure that does not require pre-allocating GPU VRAM for the worst case scenario. The GGArray is based on the LFVector, by utilizing an array of them in order to take advantage of the GPU architecture and the synchronization offered by thread blocks. This structure is compared to other state of the art ones such as a pre-allocated static array and a semi-static array that needs to be resized through communication with the host. Experimental evaluation shows that the GGArray has a competitive insertion and resize performance, but it is slower for regular parallel memory accesses. Given the results, the GGArray is a potentially useful structure for applications with high uncertainty on the memory usage as well as applications that have phases, such as an insertion phase followed by a regular GPU phase. In such cases, the GGArray can be used for the first phase and then data can be flattened for the second phase in order to allow the classical GPU memory accesses which are faster. These results constitute a step towards achieving a parallel efficient C++ like vector for modern GPU architectures. △ Less

Submitted 7 September, 2022; v1 submitted 31 August, 2022; originally announced September 2022.

Comments: 8 pages

arXiv:2208.11617 [pdf, other]

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

Authors: Cristóbal A. Navarro, Felipe A. Quezada, Benjamin Bustos, Nancy Hitschfeld, Rolando Kindelan

Abstract: This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and… ▽ More This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and ii) the experimental evaluation in terms of speedup over a bounding box approach and energy efficiency as elements per second per Watt. Results from the analysis show that $\mathcal{H}$ has a potential speedup of up to $2\times$ and $6\times$ for $2$ and $3$-simplices, respectively. Experimental evaluation shows that $\mathcal{H}$ is competitive for $2$-simplices, reaching $1.2\times \sim 2.0\times$ of speedup for different tests, which is on par with the fastest state of the art approaches. For $3$-simplices $\mathcal{H}$ reaches up to $1.3\times \sim 6.0\times$ of speedup making it the fastest of all. The extension of $\mathcal{H}$ to higher dimensional $m$-simplices is feasible and has a potential speedup that scales as $m!$ given a proper selection of parameters $r, β$ which are the scaling and replication factors, respectively. In terms of energy consumption, although $\mathcal{H}$ is among the highest in power consumption, it compensates by its short duration, making it one of the most energy efficient approaches. Lastly, further improvements with Tensor and Ray Tracing Cores are analyzed, giving insights to leverage each one of them. The results obtained in this work show that $\mathcal{H}$ is a scalable and energy efficient map that can contribute to the efficiency of GPU applications when they need to process $m$-simplex domains, such as Cellular Automata or PDE simulations. △ Less

Submitted 12 September, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: 13 pages

arXiv:2206.02255 [pdf, other]

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Authors: Felipe A. Quezada, Cristóbal A. Navarro, Miguel Romero, Cristhian Aguilera

Abstract: Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also speedup problems that exhibit a heterogeneous data layout by focusing, through a subdivision process, the finite GPU resources on the sub-regions that exhibit more p… ▽ More Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also speedup problems that exhibit a heterogeneous data layout by focusing, through a subdivision process, the finite GPU resources on the sub-regions that exhibit more parallelism. However, doing an optimal subdivision process is not trivial, as there are different parameters that play an important role in the final performance of DP. Moreover, the current programming abstraction for DP also introduces an overhead that can penalize the final performance. In this work we present a subdivision cost model for problems that exhibit self similar density (SSD) workloads (such as fractals), in order understand what parameters provide the fastest subdivision approach. Also, we introduce a new subdivision implementation, named \textit{Adaptive Serial Kernels} (ASK), as a smaller overhead alternative to CUDA's Dynamic Parallelism. Using the cost model on the Mandelbrot Set as a case study shows that the optimal scheme is to start with an initial subdivision between $g=[2,16]$, then keep subdividing in regions of $r=2,4$, and stop when regions reach a size of $B \sim 32$. The experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the proposed ASK approach runs up to $\sim 60\%$ faster than Dynamic Parallelism in the Mandelbrot set, and up to $12\times$ faster than a basic exhaustive implementation, whereas DP is up to $7.5\times$. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: submitted to Journal

arXiv:2201.00613 [pdf, other]

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

Authors: Felipe A. Quezada, Cristóbal A. Navarro, Nancy Hitschfeld, Benjamin Bustos

Abstract: This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and… ▽ More This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as compact-to-expanded and expanded-to-compact space functions, respectively. The cost of the maps is $\mathcal{O}(\log_2 \log_s(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form, and $s$ the linear scaling factor. The proposed approach works for any fractal that belongs to the Non-overlap**-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Experimental results using a discrete Sierpinski Triangle as a case study shows up to $\sim12\times$ of speedup and a memory reduction factor of up to $\sim 315\times$ with respect to a GPU-based expanded-space bounding box approach. These results show that the proposed compact approach will allow the scientific community to efficiently tackle problems that up to now could not fit into GPU memory. △ Less

Submitted 3 January, 2022; originally announced January 2022.

arXiv:2110.12952 [pdf, other]

Accelerating Compact Fractals with Tensor Core GPUs

Authors: Felipe A. Quezada, Cristóbal A. Navarro

Abstract: This work presents a GPU thread map** approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as threadspace-to-dataspace and dataspace-to-threadspace functions, respectively. By combining these maps, threads can access co… ▽ More This work presents a GPU thread map** approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as threadspace-to-dataspace and dataspace-to-threadspace functions, respectively. By combining these maps, threads can access compact space and interact with their neighbors. The cost of the maps is $\mathcal{O}(\log \log(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form. The technique works on any fractal that belongs to the Non-overlap**-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Results using an A100 GPU on the Sierpinski Triangle as a case study show up to $\sim11\times$ of speedup and a memory usage reduction of $234\times$ with respect to a Bounding Box approach. These results show that the proposed compact approach can allow the scientific community to tackle larger problems that did not fit in GPU memory before, and run even faster than a bounding box approach. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: Tech Report

arXiv:2004.13475 [pdf, other]

Efficient GPU Thread Map** on Embedded 2D Fractals

Authors: Cristóbal A. Navarro, Felipe A. Quezada, Nancy Hitschfeld, Raimundo Vega, Benjamin Bustos

Abstract: This work proposes a new approach for map** GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads wit… ▽ More This work proposes a new approach for map** GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H}$ being the Hausdorff dimension of the fractal, making it parallel space efficient. When compared to a bounding-box (BB) approach, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup $n \ge n_0$. The Sierpinski gasket fractal is used as a particular case study and the experimental performance results show that $λ(ω)$ reaches up to $9\times$ of speedup over the bounding-box approach. A tensor-core based implementation of $λ(ω)$ is also proposed for modern GPUs, providing up to $\sim40\%$ of extra performance. The results obtained in this work show that doing efficient GPU thread map** on fractal domains can significantly improve the performance of several applications that work with this type of geometry. △ Less

Submitted 25 April, 2020; originally announced April 2020.

Comments: 20 Pages. arXiv admin note: text overlap with arXiv:1706.04552

ACM Class: C.1.4; G.2.0

arXiv:2001.05585 [pdf, ps, other]

GPU Tensor Cores for fast Arithmetic Reductions

Authors: Cristóbal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, Javier A. Riquelme, Raimundo Vega

Abstract: This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ para… ▽ More This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Comments: 14 pages, 11 figures

arXiv:1903.03640 [pdf, ps, other]

doi 10.29007/zlmg

Analyzing GPU Tensor Core Potential for Fast Reductions

Authors: Roberto Carrasco, Raimundo Vega, Cristóbal A. Navarro

Abstract: The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep Learning} applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose… ▽ More The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep Learning} applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of $n$ numbers as a set of $m\times m$ MMA tensor-core operations (for Nvidia's Volta architecture $m=16$) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of $n$ numbers in $T(n) = 5\log_{m^2}(n)$ steps with a speedup of $S = \frac{4}{5}\log_2(m^2)$. △ Less

Submitted 8 March, 2019; originally announced March 2019.

Comments: This paper was presented in the SCCC 2018 Conference, November 5

Journal ref: 37th Internatioinal Conference of the Chilean Computer Science Society, SCCC 2018, November 5-9, Santiago, Chile, 2018

arXiv:1706.04552 [pdf, ps, other]

Block-space GPU Map** for Embedded Sierpiński Gasket Fractals

Authors: Cristóbal A. Navarro, Benjamín Bustos, Raimundo Vega, Nancy Hitschfeld

Abstract: This work studies the problem of GPU thread map** for a Sierpiński gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than… ▽ More This work studies the problem of GPU thread map** for a Sierpiński gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H} \approx 1.58...$ being the Hausdorff dimension, making it parallel space efficient. When compared to a bounding-box map, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup once $n > n_0$. Experimental performance tests show that in practice $λ(ω)$ can produce performance improvement at any block-size once $n > n_0 = 2^8$, reaching approximately $10\times$ of speedup for $n=2^{16}$ under optimal block configurations. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 7 pages, 8 Figures

arXiv:1610.07394 [pdf, other]

Possibilities of Recursive GPU Map** for Discrete Orthogonal Simplices

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitscheld

Abstract: The problem of parallel thread map** is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $λ: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view of parallel space efficiency and potential performance improvement. The $2$-simplex and $3$-simplex are analyzed as special cases, where constant time maps are found,… ▽ More The problem of parallel thread map** is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $λ: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view of parallel space efficiency and potential performance improvement. The $2$-simplex and $3$-simplex are analyzed as special cases, where constant time maps are found, providing a potential improvement of up to $2\times$ and $6\times$ more efficient than a bounding-box approach, respectively. For the general case it is shown that finding an efficient recursive parallel space for an $m$-simplex depends of the choice of two parameters, for which some insights are provided which can lead to a volume that matches the $m$-simplex for $n>n_0$, making parallel space approximately $m!$ times more efficient than a bounding-box. △ Less

Submitted 24 October, 2016; originally announced October 2016.

arXiv:1609.01490 [pdf, ps, other]

A Non-linear GPU Thread Map for Triangular Domains

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitschfeld

Abstract: There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the map** approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations whi… ▽ More There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the map** approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. In this work we study the case of map** threads efficiently onto triangular domain problems and propose a block-space linear map $λ(ω)$, based on the properties of the lower triangular matrix, that reduces the number of unnnecessary threads from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$. Performance results for global memory accesses show an improvement of up to $18\%$ with respect to the \textit{bounding-box} approach, placing $λ(ω)$ on second place below the \textit{rectangular-box} approach and above the \textit{recursive-partition} and \textit{upper-triangular} approaches. For shared memory scenarios $λ(ω)$ was the fastest approach achieving $7\%$ of performance improvement while preserving thread locality. The results obtained in this work make $λ(ω)$ an interesting map for efficient GPU computing on parallel problems that define a triangular domain with or without neighborhood interactions. The extension to tetrahedral domains is analyzed, with applications to triplet-interaction n-body applications. △ Less

Submitted 6 September, 2016; originally announced September 2016.

Comments: 16 pages, 7 Figures

arXiv:1606.08881 [pdf, ps, other]

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitschfeld

Abstract: The study of data-parallel domain re-organization and thread-map** techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an eff… ▽ More The study of data-parallel domain re-organization and thread-map** techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an efficient block-space GPU map of the form $g:\mathbb{N} \rightarrow \mathbb{N}^3$. Results from the analysis suggest that in theory the combination of these two optimizations produce significant performance improvement as block-based data re-organization allows a coalesced one-to-one correspondence at local thread-space while $g(λ)$ produces an efficient block-space spatial correspondence between groups of data and groups of threads, reducing the number of unnecessary threads from $O(n^3)$ to $O(n^2ρ^3)$ where $ρ$ is the linear block-size and typically $ρ^3 \ll n$. From the analysis, we obtained that a block based succint data re-organization can provide up to $2\times$ improved performance over a linear data organization while the map can be up to $6\times$ more efficient than a bounding box approach. The results from this work can serve as a useful guide for a more efficient GPU computation on tetrahedral domains found in spin lattice, finite element and special n-body problems, among others. △ Less

Submitted 28 June, 2016; originally announced June 2016.

arXiv:1508.06268 [pdf, other]

doi 10.1016/j.cpc.2016.04.007

Adaptive Multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

Authors: C. A. Navarro, Wei Huang, You** Deng

Abstract: We present an adaptive multi-GPU Exchange Monte Carlo method designed for the simulation of the 3D Random Field Model. The algorithm design is based on a two-level parallelization scheme that allows the method to scale its performance in the presence of faster and GPUs as well as multiple GPUs. The set of temperatures is adapted according to the exchange rate observed from short trial runs, leadin… ▽ More We present an adaptive multi-GPU Exchange Monte Carlo method designed for the simulation of the 3D Random Field Model. The algorithm design is based on a two-level parallelization scheme that allows the method to scale its performance in the presence of faster and GPUs as well as multiple GPUs. The set of temperatures is adapted according to the exchange rate observed from short trial runs, leading to an increased exchange rate at zones where the exchange process is sporadic. Performance results show that parallel tempering is an ideal strategy for being implemented on the GPU, and runs between one to two orders of magnitude with respect to a single-core CPU version, with multi-GPU scaling being approximately $99\%$ efficient. The results obtained extend the possibilities of simulation to sizes of $L = 32, 64$ for a workstation with two GPUs. △ Less

Submitted 22 September, 2015; v1 submitted 25 August, 2015; originally announced August 2015.

Comments: 15 pages, 10 figures

Journal ref: Computer Physics Communications, Volume 205, August 2016, pp 48-60

arXiv:1308.1419 [pdf, ps, other]

Improving the GPU space of computation under triangular domain problems

Authors: Cristobal A. Navarro, Nancy Hitschfeld

Abstract: There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall inside the problem domain perform computations, otherwise they are discarded at runtime. For problems with non-square geometry, this is not always the best idea be… ▽ More There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall inside the problem domain perform computations, otherwise they are discarded at runtime. For problems with non-square geometry, this is not always the best idea because part of the space of computation is executed without any practical use. Two- dimensional triangular domain problems, alias td-problems, are a particular case of interest. Problems such as the Euclidean distance map, LU decomposition, collision detection and simula- tions over triangular tiled domains are all td-problems and they appear frequently in many areas of science. In this work, we propose an improved GPU map** function g(lambda), that maps any lambda block to a unique location (i, j) in the triangular domain. The map** is based on the properties of the lower triangular matrix and it works at a block level, thus not compromising thread organization within a block. The theoretical improvement from using g(lambda) is upper bounded as I < 2 and the number of wasted blocks is reduced from O(n^2) to O(n). We compare our strategy with other proposed methods; the upper-triangular map** (UTM), the rectangular box (RB) and the recursive partition (REC). Our experimental results on Nvidias Kepler GPU architecture show that g(lambda) is between 12% and 15% faster than the bounding box (BB) strategy. When compared to the other strategies, our map** runs significantly faster than UTM and it is as fast as RB in practical use, with the advantage that thread organization is not compromised, as in RB. This work also contributes at presenting, for the first time, a fair comparison of all existing strategies running the same experiments under the same hardware. △ Less

Submitted 6 August, 2013; originally announced August 2013.

Comments: 6 pages, 9 Figures

arXiv:1305.6325 [pdf, ps, other]

doi 10.1109/HPCC.and.EUC.2013.27

Multi-core computation of transfer matrices for strip lattices in the Potts model

Authors: Cristobal A. Navarro, Fabrizio Canfora, Nancy Hitschfeld Kahler

Abstract: The transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the compu- tational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T(|V|, |E|) needed to compute the matrix grows ex- ponentially as a function of the grap… ▽ More The transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the compu- tational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T(|V|, |E|) needed to compute the matrix grows ex- ponentially as a function of the graph width. In this work, we present a parallel transfer-matrix implementation that scales performance under multi-core architectures. The construction of the matrix is based on several repetitions of the deletion- contraction technique, allowing parallelism suitable to multi-core machines. Our experimental results show that the multi-core implementation achieves speedups of 3.7X with p = 4 processors and 5.7X with p = 8. The efficiency of the implementation lies between 60% and 95%, achieving the best balance of speedup and efficiency at p = 4 processors for actual multi-core architectures. The algorithm also takes advantage of the lattice symmetry, making the transfer matrix computation to run up to 2X faster than its non-symmetric counterpart and use up to a quarter of the original space. △ Less

Submitted 13 August, 2013; v1 submitted 27 May, 2013; originally announced May 2013.

Showing 1–21 of 21 results for author: Navarro, C A