-
CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Authors:
Pranav Ajit Nair,
Arun Sai Suggala
Abstract:
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-t…
▽ More
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
Authors:
Muhammad Adnan,
Amar Phanishayee,
Janardhan Kulkarni,
Prashant J. Nair,
Divya Mahajan
Abstract:
In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP und…
▽ More
In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Accelerating Recommender Model Training by Dynamically Skip** Stale Embeddings
Authors:
Yassaman Ebrahimzadeh Maboud,
Muhammad Adnan,
Divya Mahajan,
Prashant J. Nair
Abstract:
Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variatio…
▽ More
Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
△ Less
Submitted 21 March, 2024;
originally announced April 2024.
-
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Authors:
Muhammad Adnan,
Akhil Arunkumar,
Gaurav Jain,
Prashant J. Nair,
Ilya Soloveychik,
Purushotham Kamath
Abstract:
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phas…
▽ More
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs.
This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.
△ Less
Submitted 5 April, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Tandem Transformers for Inference Efficient LLMs
Authors:
Aishwarya P S,
Pranav Ajit Nair,
Yashas Samaga,
Toby Boyd,
Sanjiv Kumar,
Prateek Jain,
Praneeth Netrapalli
Abstract:
The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations.
We introduce a novel architectu…
▽ More
The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face limitations: either relying on less accurate smaller models for generation or failing to fully leverage the base LLM's representations.
We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines (1) a small autoregressive model and (2) a large model operating in block mode (processing multiple tokens simultaneously). The small model's predictive accuracy is substantially enhanced by granting it attention to the large model's richer representations. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16x speedup compared to a PaLM2-Otter model with comparable downstream performance. We further incorporate the tandem model within the speculative decoding (SPEED) framework where the large model validates tokens from the small model. This ensures that the Tandem of PaLM2-Bison and PaLM2-Gecko achieves substantial speedup (around 1.14x faster than using vanilla PaLM2-Gecko in SPEED) while maintaining identical downstream task accuracy.
△ Less
Submitted 26 March, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Amortized Reparametrization: Efficient and Scalable Variational Inference for Latent SDEs
Authors:
Kevin Course,
Prasanth B. Nair
Abstract:
We consider the problem of inferring latent stochastic differential equations (SDEs) with a time and memory cost that scales independently with the amount of data, the total length of the time series, and the stiffness of the approximate differential equations. This is in stark contrast to typical methods for inferring latent differential equations which, despite their constant memory cost, have a…
▽ More
We consider the problem of inferring latent stochastic differential equations (SDEs) with a time and memory cost that scales independently with the amount of data, the total length of the time series, and the stiffness of the approximate differential equations. This is in stark contrast to typical methods for inferring latent differential equations which, despite their constant memory cost, have a time complexity that is heavily dependent on the stiffness of the approximate differential equation. We achieve this computational advancement by removing the need to solve differential equations when approximating gradients using a novel amortization strategy coupled with a recently derived reparametrization of expectations under linear SDEs. We show that, in practice, this allows us to achieve similar performance to methods based on adjoint sensitivities with more than an order of magnitude fewer evaluations of the model in training.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
Predicting the Age of Astronomical Transients from Real-Time Multivariate Time Series
Authors:
Hali Huang,
Daniel Muthukrishna,
Prajna Nair,
Zimi Zhang,
Michael Fausnaugh,
Torsha Majumder,
Ryan J. Foley,
George R. Ricker
Abstract:
Astronomical transients, such as supernovae and other rare stellar explosions, have been instrumental in some of the most significant discoveries in astronomy. New astronomical sky surveys will soon record unprecedented numbers of transients as sparsely and irregularly sampled multivariate time series. To improve our understanding of the physical mechanisms of transients and their progenitor syste…
▽ More
Astronomical transients, such as supernovae and other rare stellar explosions, have been instrumental in some of the most significant discoveries in astronomy. New astronomical sky surveys will soon record unprecedented numbers of transients as sparsely and irregularly sampled multivariate time series. To improve our understanding of the physical mechanisms of transients and their progenitor systems, early-time measurements are necessary. Prioritizing the follow-up of transients based on their age along with their class is crucial for new surveys. To meet this demand, we present the first method of predicting the age of transients in real-time from multi-wavelength time-series observations. We build a Bayesian probabilistic recurrent neural network. Our method can accurately predict the age of a transient with robust uncertainties as soon as it is initially triggered by a survey telescope. This work will be essential for the advancement of our understanding of the numerous young transients being detected by ongoing and upcoming astronomical surveys.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications
Authors:
Bo Fang,
Xinyi Li,
Harvey Dam,
Cheng Tan,
Siva Kumar Sastry Hari,
Timothy Tsai,
Ignacio Laguna,
Dingwen Tao,
Ganesh Gopalakrishnan,
Prashant Nair,
Kevin Barker,
Ang Li
Abstract:
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significan…
▽ More
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user b…
▽ More
Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user behaviour. Traditional feature interaction techniques have limitations in achieving optimal performance in this context.
This work introduces Ad-Rec, an advanced network that leverages feature interaction techniques to address covariate shifts. This helps eliminate irrelevant interactions in recommendation tasks. Ad-Rec leverages masked transformers to enable the learning of higher-order cross-features while mitigating the impact of data distribution drift. Our approach improves model quality, accelerates convergence, and reduces training time, as measured by the Area Under Curve (AUC) metric. We demonstrate the scalability of Ad-Rec and its ability to achieve superior model quality through comprehensive ablation studies.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout
Authors:
Irene Wang,
Prashant J. Nair,
Divya Mahajan
Abstract:
Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, straggler devices with lower performance often dictate the overall traini…
▽ More
Federated Learning (FL) allows machine learning models to train locally on individual mobile devices, synchronizing model updates via a shared server. This approach safeguards user privacy; however, it also generates a heterogeneous training environment due to the varying performance capabilities across devices. As a result, straggler devices with lower performance often dictate the overall training time in FL. In this work, we aim to alleviate this performance bottleneck due to stragglers by dynamically balancing the training load across the system. We introduce Invariant Dropout, a method that extracts a sub-model based on the weight update threshold, thereby minimizing potential impacts on accuracy. Building on this dropout technique, we develop an adaptive training framework, Federated Learning using Invariant Dropout (FLuID). FLuID offers a lightweight sub-model extraction to regulate computational intensity, thereby reducing the load on straggler devices without affecting model quality. Our method leverages neuron updates from non-straggler devices to construct a tailored sub-model for each straggler based on client performance profiling. Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime conditions shift. We evaluate FLuID using five real-world mobile clients. The evaluations show that Invariant Dropout maintains baseline model efficiency while alleviating the performance bottleneck of stragglers through a dynamic, runtime approach.
△ Less
Submitted 26 September, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Domain Aligned Prefix Averaging for Domain Generalization in Abstractive Summarization
Authors:
Pranav Ajit Nair,
Sukomal Pal,
Pradeepika Verma
Abstract:
Domain generalization is hitherto an underexplored area applied in abstractive summarization. Moreover, most existing works on domain generalization have sophisticated training algorithms. In this paper, we propose a lightweight, weight averaging based, Domain Aligned Prefix Averaging approach to domain generalization for abstractive summarization. Given a number of source domains, our method firs…
▽ More
Domain generalization is hitherto an underexplored area applied in abstractive summarization. Moreover, most existing works on domain generalization have sophisticated training algorithms. In this paper, we propose a lightweight, weight averaging based, Domain Aligned Prefix Averaging approach to domain generalization for abstractive summarization. Given a number of source domains, our method first trains a prefix for each one of them. These source prefixes generate summaries for a small number of target domain documents. The similarity of the generated summaries to their corresponding documents is used for calculating weights required to average source prefixes. In DAPA, prefix tuning allows for lightweight finetuning, and weight averaging allows for the computationally efficient addition of new source domains. When evaluated on four diverse summarization domains, DAPA shows comparable or better performance against the baselines, demonstrating the effectiveness of its prefix averaging scheme.
△ Less
Submitted 29 May, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing
Authors:
Debayan Banerjee,
Pranav Ajit Nair,
Ricardo Usbeck,
Chris Biemann
Abstract:
In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs)…
▽ More
In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering
Authors:
Debayan Banerjee,
Pranav Ajit Nair,
Ricardo Usbeck,
Chris Biemann
Abstract:
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces correspondin…
▽ More
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
△ Less
Submitted 28 March, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
Biologically inspired ChaosNet architecture for Hypothetical Protein Classification
Authors:
Sneha K H,
Adhithya Sudeesh,
Pramod P Nair,
Prashanth Suravajhala
Abstract:
ChaosNet is a type of artificial neural network framework developed for classification problems and is influenced by the chaotic property of the human brain. Each neuron of the ChaosNet architecture is the one-dimensional chaotic map called the Generalized Luroth Series (GLS). The addition of GLS as neurons in ChaosNet makes the computations straightforward while utilizing the advantageous element…
▽ More
ChaosNet is a type of artificial neural network framework developed for classification problems and is influenced by the chaotic property of the human brain. Each neuron of the ChaosNet architecture is the one-dimensional chaotic map called the Generalized Luroth Series (GLS). The addition of GLS as neurons in ChaosNet makes the computations straightforward while utilizing the advantageous elements of chaos. With substantially less data, ChaosNet has been demonstrated to do difficult classification problems on par with or better than traditional ANNs. In this paper, we use Chaosnet to perform a functional classification of Hypothetical proteins [HP], which is indeed a topic of great interest in bioinformatics. The results obtained with significantly lesser training data are compared with the standard machine learning techniques used in the literature.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
Physics-informed Neural Networks approach to solve the Blasius function
Authors:
Greeshma Krishna,
Malavika S Nair,
Pramod P Nair,
Anil Lal S
Abstract:
Deep learning techniques with neural networks have been used effectively in computational fluid dynamics (CFD) to obtain solutions to nonlinear differential equations. This paper presents a physics-informed neural network (PINN) approach to solve the Blasius function. This method eliminates the process of changing the non-linear differential equation to an initial value problem. Also, it tackles t…
▽ More
Deep learning techniques with neural networks have been used effectively in computational fluid dynamics (CFD) to obtain solutions to nonlinear differential equations. This paper presents a physics-informed neural network (PINN) approach to solve the Blasius function. This method eliminates the process of changing the non-linear differential equation to an initial value problem. Also, it tackles the convergence issue arising in the conventional series solution. It is seen that this method produces results that are at par with the numerical and conventional methods. The solution is extended to the negative axis to show that PINNs capture the singularity of the function at $η=-5.69$
△ Less
Submitted 5 February, 2023; v1 submitted 30 December, 2022;
originally announced January 2023.
-
Scalable and Secure Row-Swap: Efficient and Safe Row Hammer Mitigation in Memory Systems
Authors:
Jeonghyun Woo,
Gururaj Saileshwar,
Prashant J. Nair
Abstract:
As Dynamic Random Access Memories (DRAM) scale, they are becoming increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM cells (aggressor rows), attackers can exploit inter-cell interference through Row Hammer to flip bits in neighboring rows (victim rows). A recent work, called Randomized Row-Swap (RRS), proposed proactively swap** aggressor rows with randomly selected rows…
▽ More
As Dynamic Random Access Memories (DRAM) scale, they are becoming increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM cells (aggressor rows), attackers can exploit inter-cell interference through Row Hammer to flip bits in neighboring rows (victim rows). A recent work, called Randomized Row-Swap (RRS), proposed proactively swap** aggressor rows with randomly selected rows before an aggressor row can cause Row Hammer.
Our paper observes that RRS is neither secure nor scalable. We first propose the `Juggernaut attack pattern' that breaks RRS in under 1 day. Juggernaut exploits the fact that the mitigative action of RRS, a swap operation, can itself induce additional target row activations, defeating such a defense. Second, this paper proposes a new defense Secure Row-Swap mechanism that avoids the additional activations from swap (and unswap) operations and protects against Juggernaut. Furthermore, this paper extends Secure Row-Swap with attack detection to defend against even future attacks. While this provides better security, it also allows for securely reducing the frequency of swaps, thereby enabling Scalable and Secure Row-Swap. The Scalable and Secure Row-Swap mechanism provides years of Row Hammer protection with 3.3X lower storage overheads as compared to the RRS design. It incurs only a 0.7% slowdown as compared to a not-secure baseline for a Row Hammer threshold of 1200.
△ Less
Submitted 23 December, 2022;
originally announced December 2022.
-
The Dirty Secret of SSDs: Embodied Carbon
Authors:
Swamit Tannu,
Prashant J. Nair
Abstract:
Scalable Solid-State Drives (SSDs) have ushered in a transformative era in data storage and accessibility, spanning both data centers and portable devices. However, the strides made in scaling this technology can bear significant environmental consequences. On a global scale, a notable portion of semiconductor manufacturing relies on electricity derived from coal and natural gas sources. A strikin…
▽ More
Scalable Solid-State Drives (SSDs) have ushered in a transformative era in data storage and accessibility, spanning both data centers and portable devices. However, the strides made in scaling this technology can bear significant environmental consequences. On a global scale, a notable portion of semiconductor manufacturing relies on electricity derived from coal and natural gas sources. A striking example of this is the manufacturing process for a single Gigabyte of Flash memory, which emits approximately 0.16 Kg of CO2 - a considerable fraction of the total carbon emissions attributed to the system. Remarkably, the manufacturing of storage devices alone contributed to an estimated 20 million metric tonnes of CO2 emissions in the year 2021.
In light of these environmental concerns, this paper delves into an analysis of the sustainability trade-offs inherent in Solid-State Drives (SSDs) when compared to traditional Hard Disk Drives (HDDs). Moreover, this study proposes methodologies to gauge the embodied carbon costs associated with storage systems effectively. The research encompasses four key strategies to enhance the sustainability of storage systems. In summation, this paper critically addresses the embodied carbon issues associated with SSDs, comparing them with HDDs, and proposes a comprehensive framework of strategies to enhance the sustainability of storage systems.
△ Less
Submitted 28 September, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Upper Bounds to Genome Rearrangement Problem using Prefix Transpositions
Authors:
Pramod P Nair
Abstract:
A Genome rearrangement problem studies large-scale mutations on a set of DNAs in living organisms. Various rearrangements like reversals, transpositions, translocations, fissions, fusions, and combinations and different variations have been studied extensively by computational biologists and computer scientists over the past four decades. From a mathematical point of view, a genome is represented…
▽ More
A Genome rearrangement problem studies large-scale mutations on a set of DNAs in living organisms. Various rearrangements like reversals, transpositions, translocations, fissions, fusions, and combinations and different variations have been studied extensively by computational biologists and computer scientists over the past four decades. From a mathematical point of view, a genome is represented by a permutation. The genome rearrangement problem is interpreted as a problem that transforms one permutation into another in a minimum number of moves under certain constraints depending on the chosen rearrangements. Finding the minimum number of moves is equivalent to sorting the permutation with the given rearrangement. A transposition is an operation on a permutation that moves a sublist of a permutation to a different position in the same permutation. A \emph{Prefix Transposition}, as the name suggests, is a transposition that moves a sublist which is a prefix of the permutation.
In this thesis, we study prefix transpositions on permutations and present a better upper bound for sorting permutations with prefix transpositions. A greedy algorithm called the \emph{generalised sequence length algorithm} is defined as an extension of the sequence length algorithm where suitable alternate moves are also considered. This algorithm is used to sequentially improve the upper bound to $n-\log_{3.3} n$ and $n-\log_3 n$. In the latter part of the thesis, we defined the concept of a \emph{block}. We used it along with the greedy moves of the generalised sequence length algorithm to get an upper bound of $n-\log_2 n$ to sort permutations by prefix transpositions.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
Modern Baselines for SPARQL Semantic Parsing
Authors:
Debayan Banerjee,
Pranav Ajit Nair,
Jivat Neet Kaur,
Ricardo Usbeck,
Chris Biemann
Abstract:
In this work, we focus on the task of generating SPARQL queries from natural language questions, which can then be executed on Knowledge Graphs (KGs). We assume that gold entity and relations have been provided, and the remaining task is to arrange them in the right order along with SPARQL vocabulary, and input tokens to produce the correct SPARQL query. Pre-trained Language Models (PLMs) have not…
▽ More
In this work, we focus on the task of generating SPARQL queries from natural language questions, which can then be executed on Knowledge Graphs (KGs). We assume that gold entity and relations have been provided, and the remaining task is to arrange them in the right order along with SPARQL vocabulary, and input tokens to produce the correct SPARQL query. Pre-trained Language Models (PLMs) have not been explored in depth on this task so far, so we experiment with BART, T5 and PGNs (Pointer Generator Networks) with BERT embeddings, looking for new baselines in the PLM era for this task, on DBpedia and Wikidata KGs. We show that T5 requires special input tokenisation, but produces state of the art performance on LC-QuAD 1.0 and LC-QuAD 2.0 datasets, and outperforms task-specific models from previous works. Moreover, the methods enable semantic parsing for questions where a part of the input needs to be copied to the output query, thus enabling a new paradigm in KG semantic parsing.
△ Less
Submitted 14 September, 2023; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Heterogeneous Acceleration Pipeline for Recommendation System Training
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer tim…
▽ More
Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns.
This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.
△ Less
Submitted 28 April, 2024; v1 submitted 11 April, 2022;
originally announced April 2022.
-
TQSim: A Case for Reuse-Focused Tree-Based Quantum Circuit Simulation
Authors:
Meng Wang,
Rui Huang,
Swamit Tannu,
Prashant Nair
Abstract:
Quantum computers can speed up computationally hard problems. However, to realize their full potential, we must mitigate qubit errors (from noise) by develo** noise-aware algorithms, compilers, and architectures. Thus, simulating quantum programs on classical computers with different noise models is a de-facto tool that is used by researchers and practitioners. Unfortunately, noisy quantum simul…
▽ More
Quantum computers can speed up computationally hard problems. However, to realize their full potential, we must mitigate qubit errors (from noise) by develo** noise-aware algorithms, compilers, and architectures. Thus, simulating quantum programs on classical computers with different noise models is a de-facto tool that is used by researchers and practitioners. Unfortunately, noisy quantum simulators iteratively execute the same circuit across multiple trials (shots), thereby incurring high-performance overheads. To address this, we propose a noisy simulation technique called Tree-Based Quantum Circuit Simulation (TQSim). TQSim exploits the reusability of the intermediate results during the noisy simulation and reduces computation. TQSim dynamically partitions a circuit into several subcircuits. It then reuses the intermediate results from these subcircuits during computation. As compared to a noisy Qulacs-based baseline simulator, TQSim achieves an average speedup of 2.51x across 48 different benchmark circuits. Additionally, across benchmarks, TQSim produces results with a normalized fidelity that is within the 0.016 range of the baseline normalized fidelity.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
Fixed-Point and Objective Convergence of Plug-and-Play Algorithms
Authors:
Pravin Nair,
Ruturaj G. Gavaskar,
Kunal N. Chaudhury
Abstract:
A standard model for image reconstruction involves the minimization of a data-fidelity term along with a regularizer, where the optimization is performed using proximal algorithms such as ISTA and ADMM. In plug-and-play (PnP) regularization, the proximal operator (associated with the regularizer) in ISTA and ADMM is replaced by a powerful image denoiser. Although PnP regularization works surprisin…
▽ More
A standard model for image reconstruction involves the minimization of a data-fidelity term along with a regularizer, where the optimization is performed using proximal algorithms such as ISTA and ADMM. In plug-and-play (PnP) regularization, the proximal operator (associated with the regularizer) in ISTA and ADMM is replaced by a powerful image denoiser. Although PnP regularization works surprisingly well in practice, its theoretical convergence -- whether convergence of the PnP iterates is guaranteed and if they minimize some objective function -- is not completely understood even for simple linear denoisers such as nonlocal means. In particular, while there are works where either iterate or objective convergence is established separately, a simultaneous guarantee on iterate and objective convergence is not available for any denoiser to our knowledge. In this paper, we establish both forms of convergence for a special class of linear denoisers. Notably, unlike existing works where the focus is on symmetric denoisers, our analysis covers non-symmetric denoisers such as nonlocal means and almost any convex data-fidelity. The novelty in this regard is that we make use of the convergence theory of averaged operators and we work with a special inner product (and norm) derived from the linear denoiser; the latter requires us to appropriately define the gradient and proximal operators associated with the data-fidelity term. We validate our convergence results using image reconstruction experiments.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
Weak Form Generalized Hamiltonian Learning
Authors:
Kevin L. Course,
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We present a method for learning generalized Hamiltonian decompositions of ordinary differential equations given a set of noisy time series measurements. Our method simultaneously learns a continuous time model and a scalar energy function for a general dynamical system. Learning predictive models in this form allows one to place strong, high-level, physics inspired priors onto the form of the lea…
▽ More
We present a method for learning generalized Hamiltonian decompositions of ordinary differential equations given a set of noisy time series measurements. Our method simultaneously learns a continuous time model and a scalar energy function for a general dynamical system. Learning predictive models in this form allows one to place strong, high-level, physics inspired priors onto the form of the learnt governing equations for general dynamical systems. Moreover, having shown how our method extends and unifies some previous work in deep learning with physics inspired priors, we present a novel method for learning continuous time models from the weak form of the governing equations which is less computationally taxing than standard adjoint methods.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
Pushing the Limits of Capsule Networks
Authors:
Prem Nair,
Rohan Doshi,
Stefan Keselj
Abstract:
Convolutional neural networks use pooling and other downscaling operations to maintain translational invariance for detection of features, but in their architecture they do not explicitly maintain a representation of the locations of the features relative to each other. This means they do not represent two instances of the same object in different orientations the same way, like humans do, and so…
▽ More
Convolutional neural networks use pooling and other downscaling operations to maintain translational invariance for detection of features, but in their architecture they do not explicitly maintain a representation of the locations of the features relative to each other. This means they do not represent two instances of the same object in different orientations the same way, like humans do, and so training them often requires extensive data augmentation and exceedingly deep networks. A team at Google Brain recently made news with an attempt to fix this problem: Capsule Networks. While a normal CNN works with scalar outputs representing feature presence, a CapsNet works with vector outputs representing entity presence. We want to stress test CapsNet in various incremental ways to better understand their performance and expressiveness. In broad terms, the goals of our investigation are: (1) test CapsNets on datasets that are like MNIST but harder in a specific way, and (2) explore the internal embedding space and sources of error for CapsNets.
△ Less
Submitted 14 March, 2021;
originally announced March 2021.
-
Accelerating Recommendation System Training by Leveraging Popular Choices
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation model…
▽ More
Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy
△ Less
Submitted 28 September, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Group Equivariant Deep Reinforcement Learning
Authors:
Arnab Kumar Mondal,
Pratheeksha Nair,
Kaleem Siddiqi
Abstract:
In Reinforcement Learning (RL), Convolutional Neural Networks(CNNs) have been successfully applied as function approximators in Deep Q-Learning algorithms, which seek to learn action-value functions and policies in various environments. However, to date, there has been little work on the learning of symmetry-transformation equivariant representations of the input environment state. In this paper,…
▽ More
In Reinforcement Learning (RL), Convolutional Neural Networks(CNNs) have been successfully applied as function approximators in Deep Q-Learning algorithms, which seek to learn action-value functions and policies in various environments. However, to date, there has been little work on the learning of symmetry-transformation equivariant representations of the input environment state. In this paper, we propose the use of Equivariant CNNs to train RL agents and study their inductive bias for transformation equivariant Q-value approximation. We demonstrate that equivariant architectures can dramatically enhance the performance and sample efficiency of RL agents in a highly symmetric environment while requiring fewer parameters. Additionally, we show that they are robust to changes in the environment caused by affine transformations.
△ Less
Submitted 30 June, 2020;
originally announced July 2020.
-
Quadruply Stochastic Gaussian Processes
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We introduce a stochastic variational inference procedure for training scalable Gaussian process (GP) models whose per-iteration complexity is independent of both the number of training points, $n$, and the number basis functions used in the kernel approximation, $m$. Our central contributions include an unbiased stochastic estimator of the evidence lower bound (ELBO) for a Gaussian likelihood, as…
▽ More
We introduce a stochastic variational inference procedure for training scalable Gaussian process (GP) models whose per-iteration complexity is independent of both the number of training points, $n$, and the number basis functions used in the kernel approximation, $m$. Our central contributions include an unbiased stochastic estimator of the evidence lower bound (ELBO) for a Gaussian likelihood, as well as a stochastic estimator that lower bounds the ELBO for several other likelihoods such as Laplace and logistic. Independence of the stochastic optimization update complexity on $n$ and $m$ enables inference on huge datasets using large capacity GP models. We demonstrate accurate inference on large classification and regression datasets using GPs and relevance vector machines with up to $m = 10^7$ basis functions.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation
Authors:
Zeyu Wang,
Klint Qinami,
Ioannis Christos Karakozis,
Kyle Genova,
Prem Nair,
Kenji Hata,
Olga Russakovsky
Abstract:
Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has bee…
▽ More
Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.
△ Less
Submitted 2 April, 2020; v1 submitted 26 November, 2019;
originally announced November 2019.
-
Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads
Authors:
Seokin Hong,
Bulent Abali,
Alper Buyuktosunoglu,
Michael B. Healy,
Prashant J. Nair
Abstract:
Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads.
This…
▽ More
Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads.
This paper proposes Touché, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touché framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline.
△ Less
Submitted 2 September, 2019;
originally announced September 2019.
-
Fast High-Dimensional Kernel Filtering
Authors:
Pravin Nair,
Kunal N. Chaudhury
Abstract:
The bilateral and nonlocal means filters are instances of kernel-based filters that are popularly used in image processing. It was recently shown that fast and accurate bilateral filtering of grayscale images can be performed using a low-rank approximation of the kernel matrix. More specifically, based on the eigendecomposition of the kernel matrix, the overall filtering was approximated using spa…
▽ More
The bilateral and nonlocal means filters are instances of kernel-based filters that are popularly used in image processing. It was recently shown that fast and accurate bilateral filtering of grayscale images can be performed using a low-rank approximation of the kernel matrix. More specifically, based on the eigendecomposition of the kernel matrix, the overall filtering was approximated using spatial convolutions, for which efficient algorithms are available. Unfortunately, this technique cannot be scaled to high-dimensional data such as color and hyperspectral images. This is simply because one needs to compute/store a large matrix and perform its eigendecomposition in this case. We show how this problem can be solved using the Nyström method, which is generally used for approximating the eigendecomposition of large matrices. The resulting algorithm can also be used for nonlocal means filtering. We demonstrate the effectiveness of our proposal for bilateral and nonlocal means filtering of color and hyperspectral images. In particular, our method is shown to be competitive with state-of-the-art fast algorithms, and moreover it comes with a theoretical guarantee on the approximation error.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
An Optical Flow-Based Approach for Minimally-Divergent Velocimetry Data Interpolation
Authors:
Berkay Kanberoglu,
Dhritiman Das,
Priya Nair,
Pavan Turaga,
David Frakes
Abstract:
Three-dimensional (3D) biomedical image sets are often acquired with in-plane pixel spacings that are far less than the out-of-plane spacings between images. The resultant anisotropy, which can be detrimental in many applications, can be decreased using image interpolation. Optical flow and/or other registration-based interpolators have proven useful in such interpolation roles in the past. When a…
▽ More
Three-dimensional (3D) biomedical image sets are often acquired with in-plane pixel spacings that are far less than the out-of-plane spacings between images. The resultant anisotropy, which can be detrimental in many applications, can be decreased using image interpolation. Optical flow and/or other registration-based interpolators have proven useful in such interpolation roles in the past. When acquired images are comprised of signals that describe the flow velocity of fluids, additional information is available to guide the interpolation process. In this paper, we present an optical-flow based framework for image interpolation that also minimizes resultant divergence in the interpolated data.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
Fast High-Dimensional Bilateral and Nonlocal Means Filtering
Authors:
Pravin Nair,
Kunal. N. Chaudhury
Abstract:
Existing fast algorithms for bilateral and nonlocal means filtering mostly work with grayscale images. They cannot easily be extended to high-dimensional data such as color and hyperspectral images, patch-based data, flow-fields, etc. In this paper, we propose a fast algorithm for high-dimensional bilateral and nonlocal means filtering. Unlike existing approaches, where the focus is on approximati…
▽ More
Existing fast algorithms for bilateral and nonlocal means filtering mostly work with grayscale images. They cannot easily be extended to high-dimensional data such as color and hyperspectral images, patch-based data, flow-fields, etc. In this paper, we propose a fast algorithm for high-dimensional bilateral and nonlocal means filtering. Unlike existing approaches, where the focus is on approximating the data (using quantization) or the filter kernel (via analytic expansions), we locally approximate the kernel using weighted and shifted copies of a Gaussian, where the weights and shifts are inferred from the data. The algorithm emerging from the proposed approximation essentially involves clustering and fast convolutions, and is easy to implement. Moreover, a variant of our algorithm comes with a guarantee (bound) on the approximation error, which is not enjoyed by existing algorithms. We present some results for high-dimensional bilateral and nonlocal means filtering to demonstrate the speed and accuracy of our proposal. Moreover, we also show that our algorithm can outperform state-of-the-art fast approximations in terms of accuracy and timing.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
Discretely Relaxing Continuous Variables for tractable Variational Inference
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed "DIRECT" approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminat…
▽ More
We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed "DIRECT" approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminating the need for high-variance stochastic gradient estimators and enabling the use of quasi-Newton optimization methods; (ii) its training complexity is independent of the number of training points, permitting inference on large datasets; and (iii) its posterior samples consist of sparse and low-precision quantized integers which permit fast inference on hardware limited devices. In addition, our DIRECT models can exactly compute statistical moments of the parameterized predictive posterior without relying on Monte Carlo sampling. The DIRECT approach is not practical for all likelihoods, however, we identify a popular model structure which is practical, and demonstrate accurate inference using latent variables discretized as extremely low-precision 4-bit quantized integers. While the ELBO computations considered in the numerical studies require over $10^{2352}$ log-likelihood evaluations, we train on datasets with over two-million points in just seconds.
△ Less
Submitted 9 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
Exploiting Structure for Fast Kernel Learning
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or "gaps") in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel app…
▽ More
We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or "gaps") in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel approach to GP training whereby response values are inferred on the gaps before explicitly training the model. We find this second approach to be greatly advantageous for the class of problems considered. Both of these novel approaches make extensive use of Kronecker matrix algebra to design massively scalable algorithms which have low memory requirements. We demonstrate exact GP inference for a spatial-temporal climate modelling problem with 3.7 million training points as well as a video reconstruction problem with 1 billion points.
△ Less
Submitted 9 August, 2018;
originally announced August 2018.
-
Scalable Gaussian Processes with Grid-Structured Eigenfunctions (GP-GRIEF)
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We introduce a kernel approximation strategy that enables computation of the Gaussian process log marginal likelihood and all hyperparameter derivatives in $\mathcal{O}(p)$ time. Our GRIEF kernel consists of $p$ eigenfunctions found using a Nystrom approximation from a dense Cartesian product grid of inducing points. By exploiting algebraic properties of Kronecker and Khatri-Rao tensor products, c…
▽ More
We introduce a kernel approximation strategy that enables computation of the Gaussian process log marginal likelihood and all hyperparameter derivatives in $\mathcal{O}(p)$ time. Our GRIEF kernel consists of $p$ eigenfunctions found using a Nystrom approximation from a dense Cartesian product grid of inducing points. By exploiting algebraic properties of Kronecker and Khatri-Rao tensor products, computational complexity of the training procedure can be practically independent of the number of inducing points. This allows us to use arbitrarily many inducing points to achieve a globally accurate kernel approximation, even in high-dimensional problems. The fast likelihood evaluation enables type-I or II Bayesian inference on large-scale datasets. We benchmark our algorithms on real-world problems with up to two-million training points and $10^{33}$ inducing points.
△ Less
Submitted 1 August, 2018; v1 submitted 5 July, 2018;
originally announced July 2018.
-
LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency
Authors:
Kevin K. Chang,
Prashant J. Nair,
Saugata Ghose,
Donghyuk Lee,
Moinuddin K. Qureshi,
Onur Mutlu
Abstract:
This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency a…
▽ More
This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency and energy consumption. Prior work proposes to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM.
Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Gras** and Cross-Domain Image Matching
Authors:
Andy Zeng,
Shuran Song,
Kuan-Ting Yu,
Elliott Donlon,
Francois R. Hogan,
Maria Bauza,
Daolin Ma,
Orion Taylor,
Melody Liu,
Eudald Romo,
Nima Fazeli,
Ferran Alet,
Nikhil Chavan Dafle,
Rachel Holladay,
Isabella Morona,
Prem Qu Nair,
Druck Green,
Ian Taylor,
Weber Liu,
Thomas Funkhouser,
Alberto Rodriguez
Abstract:
This paper presents a robotic pick-and-place system that is capable of gras** and recognizing both known and novel objects in cluttered environments. The key new feature of the system is that it handles a wide range of object categories without needing any task-specific training data for novel objects. To achieve this, it first uses a category-agnostic affordance prediction algorithm to select a…
▽ More
This paper presents a robotic pick-and-place system that is capable of gras** and recognizing both known and novel objects in cluttered environments. The key new feature of the system is that it handles a wide range of object categories without needing any task-specific training data for novel objects. To achieve this, it first uses a category-agnostic affordance prediction algorithm to select and execute among four different gras** primitive behaviors. It then recognizes picked objects with a cross-domain image classification framework that matches observed images to product images. Since product images are readily available for a wide range of objects (e.g., from the web), the system works out-of-the-box for novel objects without requiring any additional training data. Exhaustive experimental results demonstrate that our multi-affordance gras** achieves high success rates for a wide variety of objects in clutter, and our recognition algorithm achieves high accuracy for both known and novel grasped objects. The approach was part of the MIT-Princeton Team system that took 1st place in the stowing task at the 2017 Amazon Robotics Challenge. All code, datasets, and pre-trained models are available online at http://arc.cs.princeton.edu
△ Less
Submitted 30 May, 2020; v1 submitted 3 October, 2017;
originally announced October 2017.
-
Architectural Techniques to Enable Reliable and Scalable Memory Systems
Authors:
Prashant J. Nair
Abstract:
High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen…
▽ More
High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while kee** the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads.
△ Less
Submitted 13 April, 2017;
originally announced April 2017.