Search | arXiv e-print repository

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

Authors: Cristóbal A. Navarro, Felipe A. Quezada, Benjamin Bustos, Nancy Hitschfeld, Rolando Kindelan

Abstract: This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and… ▽ More This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and ii) the experimental evaluation in terms of speedup over a bounding box approach and energy efficiency as elements per second per Watt. Results from the analysis show that $\mathcal{H}$ has a potential speedup of up to $2\times$ and $6\times$ for $2$ and $3$-simplices, respectively. Experimental evaluation shows that $\mathcal{H}$ is competitive for $2$-simplices, reaching $1.2\times \sim 2.0\times$ of speedup for different tests, which is on par with the fastest state of the art approaches. For $3$-simplices $\mathcal{H}$ reaches up to $1.3\times \sim 6.0\times$ of speedup making it the fastest of all. The extension of $\mathcal{H}$ to higher dimensional $m$-simplices is feasible and has a potential speedup that scales as $m!$ given a proper selection of parameters $r, β$ which are the scaling and replication factors, respectively. In terms of energy consumption, although $\mathcal{H}$ is among the highest in power consumption, it compensates by its short duration, making it one of the most energy efficient approaches. Lastly, further improvements with Tensor and Ray Tracing Cores are analyzed, giving insights to leverage each one of them. The results obtained in this work show that $\mathcal{H}$ is a scalable and energy efficient map that can contribute to the efficiency of GPU applications when they need to process $m$-simplex domains, such as Cellular Automata or PDE simulations. △ Less

Submitted 12 September, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: 13 pages

arXiv:2201.00613 [pdf, other]

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

Authors: Felipe A. Quezada, Cristóbal A. Navarro, Nancy Hitschfeld, Benjamin Bustos

Abstract: This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and… ▽ More This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as compact-to-expanded and expanded-to-compact space functions, respectively. The cost of the maps is $\mathcal{O}(\log_2 \log_s(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form, and $s$ the linear scaling factor. The proposed approach works for any fractal that belongs to the Non-overlap**-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Experimental results using a discrete Sierpinski Triangle as a case study shows up to $\sim12\times$ of speedup and a memory reduction factor of up to $\sim 315\times$ with respect to a GPU-based expanded-space bounding box approach. These results show that the proposed compact approach will allow the scientific community to efficiently tackle problems that up to now could not fit into GPU memory. △ Less

Submitted 3 January, 2022; originally announced January 2022.

arXiv:2103.14785 [pdf, other]

A Comprehensive Review of the Video-to-Text Problem

Authors: Jesus Perez-Martin, Benjamin Bustos, Silvio Jamil F. Guimarães, Ivan Sipiran, Jorge Pérez, Grethel Coello Said

Abstract: Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews… ▽ More Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions. △ Less

Submitted 30 November, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: 66 pages, 6 figures. Accepted by Artificial Intelligence Review

arXiv:2103.03764 [pdf, other]

A Convolutional Architecture for 3D Model Embedding

Authors: Arniel Labrada, Benjamin Bustos, Ivan Sipiran

Abstract: During the last years, many advances have been made in tasks like3D model retrieval, 3D model classification, and 3D model segmentation.The typical 3D representations such as point clouds, voxels, and poly-gon meshes are mostly suitable for rendering purposes, while their use forcognitive processes (retrieval, classification, segmentation) is limited dueto their high redundancy and complexity. We… ▽ More During the last years, many advances have been made in tasks like3D model retrieval, 3D model classification, and 3D model segmentation.The typical 3D representations such as point clouds, voxels, and poly-gon meshes are mostly suitable for rendering purposes, while their use forcognitive processes (retrieval, classification, segmentation) is limited dueto their high redundancy and complexity. We propose a deep learningarchitecture to handle 3D models as an input. We combine this architec-ture with other standard architectures like Convolutional Neural Networksand autoencoders for computing 3D model embeddings. Our goal is torepresent a 3D model as a vector with enough information to substitutethe 3D model for high-level tasks. Since this vector is a learned repre-sentation which tries to capture the relevant information of a 3D model,we show that the embedding representation conveys semantic informationthat helps to deal with the similarity assessment of 3D objects. Our ex-periments show the benefit of computing the embeddings of a 3D modeldata set and use them for effective 3D Model Retrieval. △ Less

Submitted 5 March, 2021; originally announced March 2021.

arXiv:2004.13475 [pdf, other]

Efficient GPU Thread Map** on Embedded 2D Fractals

Authors: Cristóbal A. Navarro, Felipe A. Quezada, Nancy Hitschfeld, Raimundo Vega, Benjamin Bustos

Abstract: This work proposes a new approach for map** GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads wit… ▽ More This work proposes a new approach for map** GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H}$ being the Hausdorff dimension of the fractal, making it parallel space efficient. When compared to a bounding-box (BB) approach, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup $n \ge n_0$. The Sierpinski gasket fractal is used as a particular case study and the experimental performance results show that $λ(ω)$ reaches up to $9\times$ of speedup over the bounding-box approach. A tensor-core based implementation of $λ(ω)$ is also proposed for modern GPUs, providing up to $\sim40\%$ of extra performance. The results obtained in this work show that doing efficient GPU thread map** on fractal domains can significantly improve the performance of several applications that work with this type of geometry. △ Less

Submitted 25 April, 2020; originally announced April 2020.

Comments: 20 Pages. arXiv admin note: text overlap with arXiv:1706.04552

ACM Class: C.1.4; G.2.0

arXiv:2002.01462 [pdf]

Semantic Search of Memes on Twitter

Authors: Jesus Perez-Martin, Benjamin Bustos, Magdalena Saldana

Abstract: Memes are becoming a useful source of data for analyzing behavior on social media. However, a problem to tackle is how to correctly identify a meme. As the number of memes published every day on social media is huge, there is a need for automatic methods for classifying and searching in large meme datasets. This paper proposes and compares several methods for automatically classifying images as me… ▽ More Memes are becoming a useful source of data for analyzing behavior on social media. However, a problem to tackle is how to correctly identify a meme. As the number of memes published every day on social media is huge, there is a need for automatic methods for classifying and searching in large meme datasets. This paper proposes and compares several methods for automatically classifying images as memes. Also, we propose a method that allows us to implement a system for retrieving memes from a dataset using a textual query. We experimentally evaluate the methods using a large dataset of memes collected from Twitter users in Chile, which was annotated by a group of experts. Though some of the evaluated methods are effective, there is still room for improvement. △ Less

Submitted 20 May, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

Comments: Computational Methods Interest Group of the 70th International Communication Association Conference, May 2020 Virtual conference presentation link: https://player.vimeo.com/video/418320378

arXiv:1706.04552 [pdf, ps, other]

Block-space GPU Map** for Embedded Sierpiński Gasket Fractals

Authors: Cristóbal A. Navarro, Benjamín Bustos, Raimundo Vega, Nancy Hitschfeld

Abstract: This work studies the problem of GPU thread map** for a Sierpiński gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than… ▽ More This work studies the problem of GPU thread map** for a Sierpiński gasket fractal embedded in a discrete Euclidean space of $n \times n$. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H} \approx 1.58...$ being the Hausdorff dimension, making it parallel space efficient. When compared to a bounding-box map, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup once $n > n_0$. Experimental performance tests show that in practice $λ(ω)$ can produce performance improvement at any block-size once $n > n_0 = 2^8$, reaching approximately $10\times$ of speedup for $n=2^{16}$ under optimal block configurations. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 7 pages, 8 Figures

arXiv:1610.07394 [pdf, other]

Possibilities of Recursive GPU Map** for Discrete Orthogonal Simplices

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitscheld

Abstract: The problem of parallel thread map** is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $λ: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view of parallel space efficiency and potential performance improvement. The $2$-simplex and $3$-simplex are analyzed as special cases, where constant time maps are found,… ▽ More The problem of parallel thread map** is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $λ: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view of parallel space efficiency and potential performance improvement. The $2$-simplex and $3$-simplex are analyzed as special cases, where constant time maps are found, providing a potential improvement of up to $2\times$ and $6\times$ more efficient than a bounding-box approach, respectively. For the general case it is shown that finding an efficient recursive parallel space for an $m$-simplex depends of the choice of two parameters, for which some insights are provided which can lead to a volume that matches the $m$-simplex for $n>n_0$, making parallel space approximately $m!$ times more efficient than a bounding-box. △ Less

Submitted 24 October, 2016; originally announced October 2016.

arXiv:1609.01490 [pdf, ps, other]

A Non-linear GPU Thread Map for Triangular Domains

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitschfeld

Abstract: There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the map** approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations whi… ▽ More There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the map** approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. In this work we study the case of map** threads efficiently onto triangular domain problems and propose a block-space linear map $λ(ω)$, based on the properties of the lower triangular matrix, that reduces the number of unnnecessary threads from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$. Performance results for global memory accesses show an improvement of up to $18\%$ with respect to the \textit{bounding-box} approach, placing $λ(ω)$ on second place below the \textit{rectangular-box} approach and above the \textit{recursive-partition} and \textit{upper-triangular} approaches. For shared memory scenarios $λ(ω)$ was the fastest approach achieving $7\%$ of performance improvement while preserving thread locality. The results obtained in this work make $λ(ω)$ an interesting map for efficient GPU computing on parallel problems that define a triangular domain with or without neighborhood interactions. The extension to tetrahedral domains is analyzed, with applications to triplet-interaction n-body applications. △ Less

Submitted 6 September, 2016; originally announced September 2016.

Comments: 16 pages, 7 Figures

arXiv:1606.08881 [pdf, ps, other]

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

Authors: Cristóbal A. Navarro, Benjamín Bustos, Nancy Hitschfeld

Abstract: The study of data-parallel domain re-organization and thread-map** techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an eff… ▽ More The study of data-parallel domain re-organization and thread-map** techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an efficient block-space GPU map of the form $g:\mathbb{N} \rightarrow \mathbb{N}^3$. Results from the analysis suggest that in theory the combination of these two optimizations produce significant performance improvement as block-based data re-organization allows a coalesced one-to-one correspondence at local thread-space while $g(λ)$ produces an efficient block-space spatial correspondence between groups of data and groups of threads, reducing the number of unnecessary threads from $O(n^3)$ to $O(n^2ρ^3)$ where $ρ$ is the linear block-size and typically $ρ^3 \ll n$. From the analysis, we obtained that a block based succint data re-organization can provide up to $2\times$ improved performance over a linear data organization while the map can be up to $6\times$ more efficient than a bounding box approach. The results from this work can serve as a useful guide for a more efficient GPU computation on tetrahedral domains found in spin lattice, finite element and special n-body problems, among others. △ Less

Submitted 28 June, 2016; originally announced June 2016.

arXiv:1102.4258 [pdf, other]

SHREC 2011: robust feature detection and description benchmark

Authors: E. Boyer, A. M. Bronstein, M. M. Bronstein, B. Bustos, T. Darom, R. Horaud, I. Hotz, Y. Keller, J. Keustermans, A. Kovnatsky, R. Litman, J. Reininghaus, I. Sipiran, D. Smeets, P. Suetens, D. Vandermeulen, A. Zaharescu, V. Zobel

Abstract: Feature-based approaches have recently become very popular in computer vision and image analysis applications, and are becoming a promising direction in shape retrieval. SHREC'11 robust feature detection and description benchmark simulates the feature detection and description stages of feature-based shape retrieval algorithms. The benchmark tests the performance of shape feature detectors and des… ▽ More Feature-based approaches have recently become very popular in computer vision and image analysis applications, and are becoming a promising direction in shape retrieval. SHREC'11 robust feature detection and description benchmark simulates the feature detection and description stages of feature-based shape retrieval algorithms. The benchmark tests the performance of shape feature detectors and descriptors under a wide variety of transformations. The benchmark allows evaluating how algorithms cope with certain classes of transformations and strength of the transformations that can be dealt with. The present paper is a report of the SHREC'11 robust feature detection and description benchmark results. △ Less

Submitted 21 February, 2011; originally announced February 2011.

Comments: This is a full version of the SHREC'11 report published in 3DOR

Showing 1–11 of 11 results for author: Bustos, B