-
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
Authors:
Archit Patke,
Dhemath Reddy,
Saurabh Jha,
Haoran Qiu,
Christian Pinto,
Shengkun Cui,
Chandra Narayanaswami,
Zbigniew Kalbarczyk,
Ravishankar Iyer
Abstract:
$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execu…
▽ More
$ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources.
To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swap**, request eviction, GPU-CPU state swap**, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.
△ Less
Submitted 5 June, 2024;
originally announced July 2024.
-
An Optimistic-Robust Approach for Dynamic Positioning of Omnichannel Inventories
Authors:
Pavithra Harsha,
Shivaram Subramanian,
Ali Koc,
Mahesh Ramakrishna,
Brian Quanz,
Dhruv Shah,
Chandra Narayanaswami
Abstract:
We introduce a new class of data-driven and distribution-free optimistic-robust bimodal inventory optimization (BIO) strategy to effectively allocate inventory across a retail chain to meet time-varying, uncertain omnichannel demand. While prior Robust optimization (RO) methods emphasize the downside, i.e., worst-case adversarial demand, BIO also considers the upside to remain resilient like RO wh…
▽ More
We introduce a new class of data-driven and distribution-free optimistic-robust bimodal inventory optimization (BIO) strategy to effectively allocate inventory across a retail chain to meet time-varying, uncertain omnichannel demand. While prior Robust optimization (RO) methods emphasize the downside, i.e., worst-case adversarial demand, BIO also considers the upside to remain resilient like RO while also rea** the rewards of improved average-case performance by overcoming the presence of endogenous outliers. This bimodal strategy is particularly valuable for balancing the tradeoff between lost sales at the store and the costs of cross-channel e-commerce fulfillment, which is at the core of our inventory optimization model. These factors are asymmetric due to the heterogenous behavior of the channels, with a bias towards the former in terms of lost-sales cost and a dependence on network effects for the latter. We provide structural insights about the BIO solution and how it can be tuned to achieve a preferred tradeoff between robustness and the average-case. Our experiments show that significant benefits can be achieved by rethinking traditional approaches to inventory management, which are siloed by channel and location. Using a real-world dataset from a large American omnichannel retail chain, a business value assessment during a peak period indicates over a 15% profitability gain for BIO over RO and other baselines while also preserving the (practical) worst case performance.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Hierarchical Proxy Modeling for Improved HPO in Time Series Forecasting
Authors:
Arindam Jati,
Vijay Ekambaram,
Shaonli Pal,
Brian Quanz,
Wesley M. Gifford,
Pavithra Harsha,
Stuart Siegel,
Sumanta Mukherjee,
Chandra Narayanaswami
Abstract:
Selecting the right set of hyperparameters is crucial in time series forecasting. The classical temporal cross-validation framework for hyperparameter optimization (HPO) often leads to poor test performance because of a possible mismatch between validation and test periods. To address this test-validation mismatch, we propose a novel technique, H-Pro to drive HPO via test proxies by exploiting dat…
▽ More
Selecting the right set of hyperparameters is crucial in time series forecasting. The classical temporal cross-validation framework for hyperparameter optimization (HPO) often leads to poor test performance because of a possible mismatch between validation and test periods. To address this test-validation mismatch, we propose a novel technique, H-Pro to drive HPO via test proxies by exploiting data hierarchies often associated with time series datasets. Since higher-level aggregated time series often show less irregularity and better predictability as compared to the lowest-level time series which can be sparse and intermittent, we optimize the hyperparameters of the lowest-level base-forecaster by leveraging the proxy forecasts for the test period generated from the forecasters at higher levels. H-Pro can be applied on any off-the-shelf machine learning model to perform HPO. We validate the efficacy of our technique with extensive empirical evaluation on five publicly available hierarchical forecasting datasets. Our approach outperforms existing state-of-the-art methods in Tourism, Wiki, and Traffic datasets, and achieves competitive result in Tourism-L dataset, without any model-specific enhancements. Moreover, our method outperforms the winning method of the M5 forecast accuracy competition.
△ Less
Submitted 2 November, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
Enabling Multiple QR Codes in Close Proximity
Authors:
Mercan Topkara,
Thomas Erickson,
Umut Topkara,
Chandrasekhar Narayanaswami
Abstract:
Quick response codes - 2D patterns that can be scanned to access online resources - are being used in a variety of industrial and consumer applications. However, it is problematic to use multiple QR codes in close proximity: scans can fail or result in access to the wrong resource. While this problem is, strictly speaking, due to the design of the scanning software, the very large number of extant…
▽ More
Quick response codes - 2D patterns that can be scanned to access online resources - are being used in a variety of industrial and consumer applications. However, it is problematic to use multiple QR codes in close proximity: scans can fail or result in access to the wrong resource. While this problem is, strictly speaking, due to the design of the scanning software, the very large number of extant scanning applications makes changing the software a difficult logistical challenge. Instead, we describe the design of a new type of QR code that not only enables the use of multiple QR codes in close proximity, but also is compatible with existing scanning solutions. In an evaluation with 20 users, it was found that the new QR codes were as usable as traditional ones, and that they were superior for selecting one code from many. Users did have initial difficulty in discovering how to use the new QR code, so further work is required on that front. We conclude with a discussion of the pros and cons of pQR codes.
△ Less
Submitted 28 October, 2015;
originally announced October 2015.