Showing 1–2 of 2 results for author: Adinets, A

Search v0.5.6 released 2020-02-24

arXiv:2206.01784 [pdf]

cs.DC cs.DS

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

Authors: Andy Adinets, Duane Merrill

Abstract: We present Onesweep, a least-significant digit (LSD) radix sorting algorithm for large GPU sorting problems residing in global memory. Our parallel algorithm employs a method of single-pass prefix sum that only requires ~2n global read/write operations for each digit-binning iteration. This exhibits a significant reduction in last-level memory traffic versus contemporary GPU radix sorting implemen… ▽ More We present Onesweep, a least-significant digit (LSD) radix sorting algorithm for large GPU sorting problems residing in global memory. Our parallel algorithm employs a method of single-pass prefix sum that only requires ~2n global read/write operations for each digit-binning iteration. This exhibits a significant reduction in last-level memory traffic versus contemporary GPU radix sorting implementations, where each iteration of digit binning requires two passes through the dataset totaling ~3n global memory operations. On the NVIDIA A100 GPU, our approach achieves 29.4 GKey/s when sorting 256M random 32-bit keys. Compared to CUB, the current state-of-the-art GPU LSD radix sort, our approach provides a speedup of ~1.5x. For 32-bit keys with varied distributions, our approach provides more consistent performance compared to HRS, the current state-of-the-art GPU MSD radix sort, and outperforms it in almost all cases. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 12 pages, 11 figures, 2 tables

ACM Class: F.2.2; D.1.3
arXiv:1806.11248 [pdf, other]

cs.LG stat.ML

XGBoost: Scalable GPU Accelerated Learning

Authors: Rory Mitchell, Andrey Adinets, Thejaswi Rao, Eibe Frank

Abstract: We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (https://github.com/dmlc/xgboost). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library. We employ data compression techniques to minimise the usage of scarce GPU memory while still allowing highly efficient implementation. Using our algorithm we sho… ▽ More We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (https://github.com/dmlc/xgboost). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library. We employ data compression techniques to minimise the usage of scarce GPU memory while still allowing highly efficient implementation. Using our algorithm we show that it is possible to process 115 million training instances in under three minutes on a publicly available cloud computing instance. The algorithm is implemented using end-to-end GPU parallelism, with prediction, gradient calculation, feature quantisation, decision tree construction and evaluation phases all computed on device. △ Less

Submitted 28 June, 2018; originally announced June 2018.

Search v0.5.6 released 2020-02-24