-
Accelerating Production LLMs with Combined Token/Embedding Speculators
Authors:
Davis Wertheimer,
Joshua Rosenkranz,
Thomas Parnell,
Sahil Suneja,
Pavithra Ranganathan,
Raghu Ganti,
Mudhakar Srivatsa
Abstract:
This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allow…
▽ More
This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
△ Less
Submitted 6 June, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
TP-Aware Dequantization
Authors:
Adnan Hoque,
Mudhakar Srivatsa,
Chih-Chieh Yang,
Raghu Ganti
Abstract:
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and…
▽ More
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.
△ Less
Submitted 15 January, 2024;
originally announced February 2024.
-
SudokuSens: Enhancing Deep Learning Robustness for IoT Sensing Applications using a Generative Approach
Authors:
Tianshi Wang,
**yang Li,
Ruijie Wang,
Denizhan Kara,
Shengzhong Liu,
Davis Wertheimer,
Antoni Viros-i-Martin,
Raghu Ganti,
Mudhakar Srivatsa,
Tarek Abdelzaher
Abstract:
This paper introduces SudokuSens, a generative framework for automated generation of training data in machine-learning-based Internet-of-Things (IoT) applications, such that the generated synthetic data mimic experimental configurations not encountered during actual sensor data collection. The framework improves the robustness of resulting deep learning models, and is intended for IoT applications…
▽ More
This paper introduces SudokuSens, a generative framework for automated generation of training data in machine-learning-based Internet-of-Things (IoT) applications, such that the generated synthetic data mimic experimental configurations not encountered during actual sensor data collection. The framework improves the robustness of resulting deep learning models, and is intended for IoT applications where data collection is expensive. The work is motivated by the fact that IoT time-series data entangle the signatures of observed objects with the confounding intrinsic properties of the surrounding environment and the dynamic environmental disturbances experienced. To incorporate sufficient diversity into the IoT training data, one therefore needs to consider a combinatorial explosion of training cases that are multiplicative in the number of objects considered and the possible environmental conditions in which such objects may be encountered. Our framework substantially reduces these multiplicative training needs. To decouple object signatures from environmental conditions, we employ a Conditional Variational Autoencoder (CVAE) that allows us to reduce data collection needs from multiplicative to (nearly) linear, while synthetically generating (data for) the missing conditions. To obtain robustness with respect to dynamic disturbances, a session-aware temporal contrastive learning approach is taken. Integrating the aforementioned two approaches, SudokuSens significantly improves the robustness of deep learning for IoT applications. We explore the degree to which SudokuSens benefits downstream inference tasks in different data sets and discuss conditions under which the approach is particularly effective.
△ Less
Submitted 8 February, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition
Authors:
Adnan Hoque,
Less Wright,
Chih-Chieh Yang,
Mudhakar Srivatsa,
Raghu Ganti
Abstract:
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multi…
▽ More
We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
△ Less
Submitted 22 February, 2024; v1 submitted 5 January, 2024;
originally announced February 2024.
-
Rethinking Data-driven Networking with Foundation Models: Challenges and Opportunities
Authors:
Franck Le,
Mudhakar Srivatsa,
Raghu Ganti,
Vyas Sekar
Abstract:
Foundational models have caused a paradigm shift in the way artificial intelligence (AI) systems are built. They have had a major impact in natural language processing (NLP), and several other domains, not only reducing the amount of required labeled data or even eliminating the need for it, but also significantly improving performance on a wide range of tasks. We argue foundation models can have…
▽ More
Foundational models have caused a paradigm shift in the way artificial intelligence (AI) systems are built. They have had a major impact in natural language processing (NLP), and several other domains, not only reducing the amount of required labeled data or even eliminating the need for it, but also significantly improving performance on a wide range of tasks. We argue foundation models can have a similar profound impact on network traffic analysis, and management. More specifically, we show that network data shares several of the properties that are behind the success of foundational models in linguistics. For example, network data contains rich semantic content, and several of the networking tasks (e.g., traffic classification, generation of protocol implementations from specification text, anomaly detection) can find similar counterparts in NLP (e.g., sentiment analysis, translation from natural language to code, out-of-distribution). However, network settings also present unique characteristics and challenges that must be overcome. Our contribution is in highlighting the opportunities and challenges at the intersection of foundation models and networking.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
State Action Separable Reinforcement Learning
Authors:
Ziyao Zhang,
Liang Ma,
Kin K. Leung,
Konstantinos Poularakis,
Mudhakar Srivatsa
Abstract:
Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of s…
▽ More
Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
Neural Network Tomography
Authors:
Liang Ma,
Ziyao Zhang,
Mudhakar Srivatsa
Abstract:
Network tomography, a classic research problem in the realm of network monitoring, refers to the methodology of inferring unmeasured network attributes using selected end-to-end path measurements. In the research community, network tomography is generally investigated under the assumptions of known network topology, correlated path measurements, bounded number of faulty nodes/links, or even specia…
▽ More
Network tomography, a classic research problem in the realm of network monitoring, refers to the methodology of inferring unmeasured network attributes using selected end-to-end path measurements. In the research community, network tomography is generally investigated under the assumptions of known network topology, correlated path measurements, bounded number of faulty nodes/links, or even special network protocol support. The applicability of network tomography is considerably constrained by these strong assumptions, which therefore frequently position it in the theoretical world. In this regard, we revisit network tomography from the practical perspective by establishing a generic framework that does not rely on any of these assumptions or the types of performance metrics. Given only the end-to-end path performance metrics of sampled node pairs, the proposed framework, NeuTomography, utilizes deep neural network and data augmentation to predict the unmeasured performance metrics via learning non-linear relationships between node pairs and underlying unknown topological/routing properties. In addition, NeuTomography can be employed to reconstruct the original network topology, which is critical to most network planning tasks. Extensive experiments using real network data show that comparing to baseline solutions, NeuTomography can predict network characteristics and reconstruct network topologies with significantly higher accuracy and robustness using only limited measurement data.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
SENSE: Semantically Enhanced Node Sequence Embedding
Authors:
Swati Rallapalli,
Liang Ma,
Mudhakar Srivatsa,
Ananthram Swami,
Heesung Kwon,
Graham Bent,
Christopher Simpkin
Abstract:
Effectively capturing graph node sequences in the form of vector embeddings is critical to many applications. We achieve this by (i) first learning vector embeddings of single graph nodes and (ii) then composing them to compactly represent node sequences. Specifically, we propose SENSE-S (Semantically Enhanced Node Sequence Embedding - for Single nodes), a skip-gram based novel embedding mechanism…
▽ More
Effectively capturing graph node sequences in the form of vector embeddings is critical to many applications. We achieve this by (i) first learning vector embeddings of single graph nodes and (ii) then composing them to compactly represent node sequences. Specifically, we propose SENSE-S (Semantically Enhanced Node Sequence Embedding - for Single nodes), a skip-gram based novel embedding mechanism, for single graph nodes that co-learns graph structure as well as their textual descriptions. We demonstrate that SENSE-S vectors increase the accuracy of multi-label classification tasks by up to 50% and link-prediction tasks by up to 78% under a variety of scenarios using real datasets. Based on SENSE-S, we next propose generic SENSE to compute composite vectors that represent a sequence of nodes, where preserving the node order is important. We prove that this approach is efficient in embedding node sequences, and our experiments on real data confirm its high accuracy in node order decoding.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.
-
neuralRank: Searching and ranking ANN-based model repositories
Authors:
Nirmit Desai,
Linsong Chu,
Raghu K. Ganti,
Sebastian Stein,
Mudhakar Srivatsa
Abstract:
Widespread applications of deep learning have led to a plethora of pre-trained neural network models for common tasks. Such models are often adapted from other models via transfer learning. The models may have varying training sets, training algorithms, network architectures, and hyper-parameters. For a given application, what isthe most suitable model in a model repository? This is a critical que…
▽ More
Widespread applications of deep learning have led to a plethora of pre-trained neural network models for common tasks. Such models are often adapted from other models via transfer learning. The models may have varying training sets, training algorithms, network architectures, and hyper-parameters. For a given application, what isthe most suitable model in a model repository? This is a critical question for practical deployments but it has not received much attention. This paper introduces the novel problem of searching and ranking models based on suitability relative to a target dataset and proposes a ranking algorithm called \textit{neuralRank}. The key idea behind this algorithm is to base model suitability on the discriminating power of a model, using a novel metric to measure it. With experimental results on the MNIST, Fashion, and CIFAR10 datasets, we demonstrate that (1) neuralRank is independent of the domain, the training set, or the network architecture and (2) that the models ranked highly by neuralRank ranking tend to have higher model accuracy in practice.
△ Less
Submitted 2 March, 2019;
originally announced March 2019.
-
Actor Conditioned Attention Maps for Video Action Detection
Authors:
Oytun Ulutan,
Swati Rallapalli,
Mudhakar Srivatsa,
Carlos Torres,
B. S. Manjunath
Abstract:
While observing complex events with multiple actors, humans do not assess each actor separately, but infer from the context. The surrounding context provides essential information for understanding actions. To this end, we propose to replace region of interest(RoI) pooling with an attention module, which ranks each spatio-temporal region's relevance to a detected actor instead of crop**. We refe…
▽ More
While observing complex events with multiple actors, humans do not assess each actor separately, but infer from the context. The surrounding context provides essential information for understanding actions. To this end, we propose to replace region of interest(RoI) pooling with an attention module, which ranks each spatio-temporal region's relevance to a detected actor instead of crop**. We refer to these as Actor-Conditioned Attention Maps (ACAM), which amplify/dampen the features extracted from the entire scene. The resulting actor-conditioned features focus the model on regions that are relevant to the conditioned actor. For actor localization, we leverage pre-trained object detectors, which transfer better. The proposed model is efficient and our action detection pipeline achieves near real-time performance. Experimental results on AVA 2.1 and JHMDB demonstrate the effectiveness of attention maps, with improvements of 7 mAP on AVA and 4 mAP on JHMDB.
△ Less
Submitted 10 May, 2020; v1 submitted 30 December, 2018;
originally announced December 2018.
-
Object Localization and Size Estimation from RGB-D Images
Authors:
ShreeRanjani SrirangamSridharan,
Oytun Ulutan,
Shehzad Noor Taus Priyo,
Swati Rallapalli,
Mudhakar Srivatsa
Abstract:
Depth sensing cameras (e.g., Kinect sensor, Tango phone) can acquire color and depth images that are registered to a common viewpoint. This opens the possibility of develo** algorithms that exploit the advantages of both sensing modalities. Traditionally, cues from color images have been used for object localization (e.g., YOLO). However, the addition of a depth image can be further used to segm…
▽ More
Depth sensing cameras (e.g., Kinect sensor, Tango phone) can acquire color and depth images that are registered to a common viewpoint. This opens the possibility of develo** algorithms that exploit the advantages of both sensing modalities. Traditionally, cues from color images have been used for object localization (e.g., YOLO). However, the addition of a depth image can be further used to segment images that might otherwise have identical color information. Further, the depth image can be used for object size (height/width) estimation (in real-world measurements units, such as meters) as opposed to image based segmentation that would only support drawing bounding boxes around objects of interest. In this paper, we first collect color camera information along with depth information using a custom Android application on Tango Phab2 phone. Second, we perform timing and spatial alignment between the two data sources. Finally, we evaluate several ways of measuring the height of the object of interest within the captured images under a variety of settings.
△ Less
Submitted 1 August, 2018;
originally announced August 2018.
-
Beyond Spatial Auto-Regressive Models: Predicting Housing Prices with Satellite Imagery
Authors:
Archith J. Bency,
Swati Rallapalli,
Raghu K. Ganti,
Mudhakar Srivatsa,
B. S. Manjunath
Abstract:
When modeling geo-spatial data, it is critical to capture spatial correlations for achieving high accuracy. Spatial Auto-Regression (SAR) is a common tool used to model such data, where the spatial contiguity matrix (W) encodes the spatial correlations. However, the efficacy of SAR is limited by two factors. First, it depends on the choice of contiguity matrix, which is typically not learnt from d…
▽ More
When modeling geo-spatial data, it is critical to capture spatial correlations for achieving high accuracy. Spatial Auto-Regression (SAR) is a common tool used to model such data, where the spatial contiguity matrix (W) encodes the spatial correlations. However, the efficacy of SAR is limited by two factors. First, it depends on the choice of contiguity matrix, which is typically not learnt from data, but instead, is assumed to be known apriori. Second, it assumes that the observations can be explained by linear models. In this paper, we propose a Convolutional Neural Network (CNN) framework to model geo-spatial data (specifi- cally housing prices), to learn the spatial correlations automatically. We show that neighborhood information embedded in satellite imagery can be leveraged to achieve the desired spatial smoothing. An additional upside of our framework is the relaxation of linear assumption on the data. Specific challenges we tackle while implementing our framework include, (i) how much of the neighborhood is relevant while estimating housing prices? (ii) what is the right approach to capture multiple resolutions of satellite imagery? and (iii) what other data-sources can help improve the estimation of spatial correlations? We demonstrate a marked improvement of 57% on top of the SAR baseline through the use of features from deep neural networks for the cities of London, Birmingham and Liverpool.
△ Less
Submitted 15 October, 2016;
originally announced October 2016.
-
Prediction-based Online Trajectory Compression
Authors:
Arlei Silva,
Ramya Raghavendra,
Mudhakar Srivatsa,
Ambuj K. Singh
Abstract:
Recent spatio-temporal data applications, such as car-shar\-ing and smart cities, impose new challenges regarding the scalability and timeliness of data processing systems. Trajectory compression is a promising approach for scaling up spatio-temporal databases. However, existing techniques fail to address the online setting, in which a compressed version of a trajectory stream has to be maintained…
▽ More
Recent spatio-temporal data applications, such as car-shar\-ing and smart cities, impose new challenges regarding the scalability and timeliness of data processing systems. Trajectory compression is a promising approach for scaling up spatio-temporal databases. However, existing techniques fail to address the online setting, in which a compressed version of a trajectory stream has to be maintained over time. In this paper, we introduce ONTRAC, a new framework for map-matched online trajectory compression. ONTRAC learns prediction models for suppressing updates to a trajectory database using training data. Two prediction schemes are proposed, one for road segments via a Markov model and another for travel-times by combining Quadratic Programming and Expectation Maximization. Experiments show that ONTRAC outperforms the state-of-the-art offline technique even when long update delays (4 mininutes) are allowed and achieves up to 21 times higher compression ratio for travel-times. Moreover, our approach increases database scalability by up to one order of magnitude.
△ Less
Submitted 15 February, 2016; v1 submitted 23 January, 2016;
originally announced January 2016.
-
Joint Source Selection and Data Extrapolation in Social Sensing for Disaster Response
Authors:
Mohammad Hosseini,
Nooreddin Nagibolhosseini,
Amotz Barnoy,
Peter Terlecky,
Hengchang Liu,
Shaohan Hu,
Shiguang Wang,
Tanvir Amin,
Lu Su,
Dong Wang,
Ramesh Govindan,
Raghu Ganti,
Mudhakar Srivatsa,
Charu Aggrawal,
Tarek Abdelzaher,
Siyu Gu,
Chenji Pan
Abstract:
This paper complements the large body of social sensing literature by develo** means for augmenting sensing data with inference results that "fill-in" missing pieces. It specifically explores the synergy between (i) inference techniques used for filling-in missing pieces and (ii) source selection techniques used to determine which pieces to retrieve in order to improve inference results. We focu…
▽ More
This paper complements the large body of social sensing literature by develo** means for augmenting sensing data with inference results that "fill-in" missing pieces. It specifically explores the synergy between (i) inference techniques used for filling-in missing pieces and (ii) source selection techniques used to determine which pieces to retrieve in order to improve inference results. We focus on prediction in disaster scenarios, where disruptive trend changes occur. We first discuss our previous conference study that compared a set of prediction heuristics and developed a hybrid prediction algorithm. We then enhance the prediction scheme by considering algorithms for sensor selection that improve inference quality. Our proposed source selection and extrapolation algorithms are tested using data collected during the New York City crisis in the aftermath of Hurricane Sandy in November 2012. The evaluation results show that consistently good predictions are achieved. The work is notable for addressing the bi-modal nature of damage propagation in complex systems subjected to stress, where periods of calm are interspersed with periods of severe change. It is novel in offering a new solution to the problem that jointly leverages source selection and extrapolation components thereby improving the results.
△ Less
Submitted 1 December, 2015;
originally announced December 2015.
-
Picking vs. Guessing Secrets: A Game-Theoretic Analysis (Technical Report)
Authors:
MHR Khouzani,
Piotr Mardziel,
Carlos Cid,
Mudhakar Srivatsa
Abstract:
Choosing a hard-to-guess secret is a prerequisite in many security applications. Whether it is a password for user authentication or a secret key for a cryptographic primitive, picking it requires the user to trade-off usability costs with resistance against an adversary: a simple password is easier to remember but is also easier to guess; likewise, a shorter cryptographic key may require fewer co…
▽ More
Choosing a hard-to-guess secret is a prerequisite in many security applications. Whether it is a password for user authentication or a secret key for a cryptographic primitive, picking it requires the user to trade-off usability costs with resistance against an adversary: a simple password is easier to remember but is also easier to guess; likewise, a shorter cryptographic key may require fewer computational and storage resources but it is also easier to attack. A fundamental question is how one can optimally resolve this trade-off. A big challenge is the fact that an adversary can also utilize the knowledge of such usability vs. security trade-offs to strengthen its attack. In this paper, we propose a game-theoretic framework for analyzing the optimal trade-offs in the face of strategic adversaries. We consider two types of adversaries: those limited in their number of tries, and those that are ruled by the cost of making individual guesses. For each type, we derive the mutually-optimal decisions as Nash Equilibria, the strategically pessimistic decisions as maximin, and optimal commitments as Strong Stackelberg Equilibria of the game. We establish that when the adversaries are faced with a capped number of guesses, the user's optimal trade-off is a uniform randomization over a subset of the secret domain. On the other hand, when the attacker strategy is ruled by the cost of making individual guesses, Nash Equilibria may completely fail to provide the user with any level of security, signifying the crucial role of credible commitment for such cases. We illustrate our results using numerical examples based on real-world samples and discuss some policy implications of our work.
△ Less
Submitted 9 May, 2015;
originally announced May 2015.
-
Quantifying Information Leakage in Finite Order Deterministic Programs
Authors:
Ji Zhu,
Mudhakar Srivatsa
Abstract:
Information flow analysis is a powerful technique for reasoning about the sensitive information exposed by a program during its execution. While past work has proposed information theoretic metrics (e.g., Shannon entropy, min-entropy, guessing entropy, etc.) to quantify such information leakage, we argue that some of these measures not only result in counter-intuitive measures of leakage, but also…
▽ More
Information flow analysis is a powerful technique for reasoning about the sensitive information exposed by a program during its execution. While past work has proposed information theoretic metrics (e.g., Shannon entropy, min-entropy, guessing entropy, etc.) to quantify such information leakage, we argue that some of these measures not only result in counter-intuitive measures of leakage, but also are inherently prone to conflicts when comparing two programs P1 and P2 -- say Shannon entropy predicts higher leakage for program P1, while guessing entropy predicts higher leakage for program P2. This paper presents the first attempt towards addressing such conflicts and derives solutions for conflict-free comparison of finite order deterministic programs.
△ Less
Submitted 20 September, 2010;
originally announced September 2010.