-
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
Authors:
Amey Agrawal,
Anmol Agarwal,
Nitin Kedia,
Jayashree Mohan,
Souvik Kundu,
Nipun Kwatra,
Ramachandran Ramjee,
Alexey Tumanov
Abstract:
Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of use…
▽ More
Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Metron, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Metron, discussing their strengths and weaknesses. Metron is available at https://github.com/project-metron/metron.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
Authors:
Aditya Annavajjala,
Alind Khare,
Animesh Agrawal,
Igor Fedorov,
Hugo Latapie,
Myung** Lee,
Alexey Tumanov
Abstract:
CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all tr…
▽ More
CNNs are increasingly deployed across different hardware, dynamic environments, and low-power embedded devices. This has led to the design and training of CNN architectures with the goal of maximizing accuracy subject to such variable deployment constraints. As the number of deployment scenarios grows, there is a need to find scalable solutions to design and train specialized CNNs. Once-for-all training has emerged as a scalable approach that jointly co-trains many models (subnets) at once with a constant training cost and finds specialized CNNs later. The scalability is achieved by training the full model and simultaneously reducing it to smaller subnets that share model weights (weight-shared shrinking). However, existing once-for-all training approaches incur huge training costs reaching 1200 GPU hours. We argue this is because they either start the process of shrinking the full model too early or too late. Hence, we propose Delayed $ε$-Shrinking (D$ε$pS) that starts the process of shrinking the full model when it is partially trained (~50%) which leads to training cost improvement and better in-place knowledge distillation to smaller models. The proposed approach also consists of novel heuristics that dynamically adjust subnet learning rates incrementally (E), leading to improved weight-shared knowledge distillation from larger to smaller subnets as well. As a result, DEpS outperforms state-of-the-art once-for-all training techniques across different datasets including CIFAR10/100, ImageNet-100, and ImageNet-1k on accuracy and cost. It achieves 1.83% higher ImageNet-1k top1 accuracy or the same accuracy with 1.3x reduction in FLOPs and 2.5x drop in training cost (GPU*hrs)
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Realizing Lie groups as automorphism groups of bounded domains
Authors:
George Shabat,
Alexander Tumanov
Abstract:
We consider a problem whether a given Lie group can be realized as the group of all biholomorphic automorphisms of a bounded domain in ${\mathbb C}^n$. In an earlier paper of 1990, the authors proved the result for connected linear Lie groups. In this paper we give examples of non-linear groups for which the result still holds.
We consider a problem whether a given Lie group can be realized as the group of all biholomorphic automorphisms of a bounded domain in ${\mathbb C}^n$. In an earlier paper of 1990, the authors proved the result for connected linear Lie groups. In this paper we give examples of non-linear groups for which the result still holds.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Vidur: A Large-Scale Simulation Framework For LLM Inference
Authors:
Amey Agrawal,
Nitin Kedia,
Jayashree Mohan,
Ashish Panwar,
Nipun Kwatra,
Bhargav Gulavani,
Ramachandran Ramjee,
Alexey Tumanov
Abstract:
Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easil…
▽ More
Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.
△ Less
Submitted 21 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Authors:
Amey Agrawal,
Nitin Kedia,
Ashish Panwar,
Jayashree Mohan,
Nipun Kwatra,
Bhargav S. Gulavani,
Alexey Tumanov,
Ramachandran Ramjee
Abstract:
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low…
▽ More
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.
We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles.
Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.
△ Less
Submitted 17 June, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Authors:
Alind Khare,
Dhruv Garg,
Sukrit Kalra,
Snigdha Grandhi,
Ion Stoica,
Alexey Tumanov
Abstract:
The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overal…
▽ More
The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off
Authors:
Sachit Kuhar,
Yash Jain,
Alexey Tumanov
Abstract:
Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed B…
▽ More
Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed Binarization, a unified co-design framework that synergistically integrates hardware-software systems, quantization functions, and representation learning techniques to address this trade-off. Our results demonstrate that Signed Binarization is more accurate than binarization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters, both of the same type, for a DNN block. Finally, our approach achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods for ResNet 18, presenting an alternative solution for deploying efficient models in resource-limited environments.
△ Less
Submitted 3 December, 2023;
originally announced December 2023.
-
ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation
Authors:
Anshul Ahluwalia,
Rohit Das,
Payman Behnam,
Alind Khare,
Pan Li,
Alexey Tumanov
Abstract:
Graph Neural Networks (GNNs) have proven to be quite versatile for a variety of applications, including recommendation systems, fake news detection, drug discovery, and even computer vision. Due to the expanding size of graph-structured data, GNN models have also increased in complexity, leading to substantial latency issues. This is primarily attributed to the irregular structure of graph data an…
▽ More
Graph Neural Networks (GNNs) have proven to be quite versatile for a variety of applications, including recommendation systems, fake news detection, drug discovery, and even computer vision. Due to the expanding size of graph-structured data, GNN models have also increased in complexity, leading to substantial latency issues. This is primarily attributed to the irregular structure of graph data and its access pattern into memory. The natural solution to reduce latency is to compress large GNNs into small GNNs. One way to do this is via knowledge distillation (KD). However, most KD approaches for GNNs only consider the outputs of the last layers and do not consider the outputs of the intermediate layers of the GNNs; these layers may contain important inductive biases indicated by the graph structure. To address this shortcoming, we propose a novel KD approach to GNN compression that we call Attention-Based Knowledge Distillation (ABKD). ABKD is a KD approach that uses attention to identify important intermediate teacher-student layer pairs and focuses on aligning their outputs. ABKD enables higher compression of GNNs with a smaller accuracy dropoff compared to existing KD approaches. On average, we achieve a 1.79% increase in accuracy with a 32.3x compression ratio on OGBN-Mag, a large graph dataset, compared to state-of-the-art approaches.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Wilson Loop Duality and OPE for Super Form Factors of Half-BPS Operators
Authors:
Benjamin Basso,
Alexander G. Tumanov
Abstract:
We propose a dual Wilson loop description for the MHV super form factors of half-BPS operators in planar $\mathcal{N}=4$ super-Yang-Mills theory. In this description, the local operators are represented by on-shell states, made out of zero-momentum particles, that are absorbed by a null periodic super Wilson loop. We present evidence for this duality at weak coupling, by performing an explicit cal…
▽ More
We propose a dual Wilson loop description for the MHV super form factors of half-BPS operators in planar $\mathcal{N}=4$ super-Yang-Mills theory. In this description, the local operators are represented by on-shell states, made out of zero-momentum particles, that are absorbed by a null periodic super Wilson loop. We present evidence for this duality at weak coupling, by performing an explicit calculation of the Wilson loop matrix elements through one loop. At tree level, the interactions localize at the cusps of the loop, revealing a simple connection between the super form factors and the $m=2$ tree amplituhedron. At loop level, we show that the Wilson loop calculation reproduces the known results for the super form factors. Inspired by this duality, we extend the OPE program developed for the form factors of the Lagrangian to the super form factors of the higher-charge operators. We introduce non-perturbative axioms and conjectures for the main building blocks that govern the exchange of the lightest flux-tube excitations. These blocks appear as simple refinements of the form factor transitions introduced in earlier OPE studies. They are expressed at any value of the 't Hooft coupling in terms of the tilted Beisert-Eden-Staudacher kernel. We carry out checks of our conjectures up to two loops at weak coupling for three- and four-point form factors of half-BPS operators of various lengths, finding perfect agreement with perturbative data.
△ Less
Submitted 23 May, 2024; v1 submitted 16 August, 2023;
originally announced August 2023.
-
Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity
Authors:
Hugo Latapie,
Shan Yu,
Patrick Hammer,
Kristinn R. Thorisson,
Vahagn Petrosyan,
Brandon Kynoch,
Alind Khare,
Payman Behnam,
Alexey Tumanov,
Aksheit Saxena,
Anish Aralikatti,
Hanning Chen,
Mohsen Imani,
Mike Archbold,
Tangrui Li,
Pei Wang,
Justin Hart
Abstract:
Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analyt…
▽ More
Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation. These models frequently struggle in real-world applications, resulting in high false positive and negative rates, and exhibit poor adaptability to new scenarios, often requiring costly retraining. To address these issues, we present Ethosight, a flexible and adaptable zero-shot video analytics system. Ethosight begins from a clean slate based on user-defined video analytics, specified through natural language or keywords, and leverages joint embedding models and reasoning mechanisms informed by ontologies such as WordNet and ConceptNet. Ethosight operates effectively on low-cost edge devices and supports enhanced runtime adaptation, thereby offering a new approach to continuous learning without catastrophic forgetting. We provide empirical validation of Ethosight's promising effectiveness across diverse and complex use cases, while highlighting areas for further improvement. A significant contribution of this work is the release of all source code and datasets to enable full reproducibility and to foster further innovation in both the research and commercial domains.
△ Less
Submitted 20 August, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems
Authors:
Debopam Sanyal,
Jui-Tse Hung,
Manav Agrawal,
Prahlad Jasti,
Shahab Nikkhoo,
Somesh Jha,
Tianhao Wang,
Sibin Mohan,
Alexey Tumanov
Abstract:
Model-serving systems have become increasingly popular, especially in real-time web applications. In such systems, users send queries to the server and specify the desired performance metrics (e.g., desired accuracy, latency). The server maintains a set of models (model zoo) in the back-end and serves the queries based on the specified metrics. This paper examines the security, specifically robust…
▽ More
Model-serving systems have become increasingly popular, especially in real-time web applications. In such systems, users send queries to the server and specify the desired performance metrics (e.g., desired accuracy, latency). The server maintains a set of models (model zoo) in the back-end and serves the queries based on the specified metrics. This paper examines the security, specifically robustness against model extraction attacks, of such systems. Existing black-box attacks assume a single model can be repeatedly selected for serving inference requests. Modern inference serving systems break this assumption. Thus, they cannot be directly applied to extract a victim model, as models are hidden behind a layer of abstraction exposed by the serving system. An attacker can no longer identify which model she is interacting with. To this end, we first propose a query-efficient fingerprinting algorithm to enable the attacker to trigger any desired model consistently. We show that by using our fingerprinting algorithm, model extraction can have fidelity and accuracy scores within $1\%$ of the scores obtained when attacking a single, explicitly specified model, as well as up to $14.6\%$ gain in accuracy and up to $7.7\%$ gain in fidelity compared to the naive attack. Second, we counter the proposed attack with a noise-based defense mechanism that thwarts fingerprinting by adding noise to the specified performance metrics. The proposed defense strategy reduces the attack's accuracy and fidelity by up to $9.8\%$ and $4.8\%$, respectively (on medium-sized model extraction). Third, we show that the proposed defense induces a fundamental trade-off between the level of protection and system goodput, achieving configurable and significant victim model extraction protection while maintaining acceptable goodput ($>80\%$). We implement the proposed defense in a real system with plans to open source.
△ Less
Submitted 6 August, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Subgraph Stationary Hardware-Software Inference Co-Design
Authors:
Payman Behnam,
Jianming Tong,
Alind Khare,
Yangyu Chen,
Yue Pan,
Pranav Gadikar,
Abhimanyu Rajeshkumar Bambhaniya,
Tushar Krishna,
Alexey Tumanov
Abstract:
A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-ex…
▽ More
A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency-accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency-accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI-an inference serving stack. For the stream of queries, SUSHI yields up to 25% improvement in latency, 0.98% increase in served accuracy. SUSHI can achieve up to 78.7% off-chip energy savings.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
Authors:
Amey Agrawal,
Sameer Reddy,
Satwik Bhattamishra,
Venkata Prabhakara Sarath Nookala,
Vidushi Vashishth,
Kexin Rong,
Alexey Tumanov
Abstract:
With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage. Such failures are typically offset by a checkpointing mechanism, which comes at the cost of storage and network bandwidth overhead. State-of-the-art approaches…
▽ More
With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage. Such failures are typically offset by a checkpointing mechanism, which comes at the cost of storage and network bandwidth overhead. State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality (accuracy) and compression ratio. Delta compression is then used to further reduce the overhead by only storing the difference between consecutive checkpoints. We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels (ranging from retaining full precision to pruning). We propose (1) a non-uniform quantization scheme that leverages this variation, (2) an efficient search mechanism that dynamically finds the best quantization configurations, and (3) a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences, thereby maximizing compression. We instantiate these contributions in DynaQuant - a framework for DL workload checkpoint compression. Our experiments show that DynaQuant consistently achieves a better tradeoff between accuracy and compression ratios compared to prior works, enabling a compression ratio up to 39x and withstanding up to 10 restores with negligible accuracy impact for fault-tolerant training. DynaQuant achieves at least an order of magnitude reduction in checkpoint storage overhead for training failure recovery as well as transfer learning use cases without any loss of accuracy.
△ Less
Submitted 2 September, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
Authors:
Alind Khare,
Animesh Agrawal,
Aditya Annavajjala,
Payman Behnam,
Myung** Lee,
Hugo Latapie,
Alexey Tumanov
Abstract:
Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, or regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also help achieve higher accuracy than traditional FL methods like FedAvg. Despite the su…
▽ More
Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, or regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also help achieve higher accuracy than traditional FL methods like FedAvg. Despite the success, existing federated NAS methods still fall short in satisfying diverse deployment targets common in on-device inference like hardware, latency budgets, or variable battery levels. Most federated NAS methods search for only a limited range of neuro-architectural patterns, repeat them in a DNN, thereby restricting achievable performance. Moreover, these methods incur prohibitive training costs to satisfy deployment targets. They perform the training and search of DNN architectures repeatedly for each case. SuperFedNAS addresses these challenges by decoupling the training and search in federated NAS. SuperFedNAS co-trains a large number of diverse DNN architectures contained inside one supernet in the FL setting. Post-training, clients perform NAS locally to find specialized DNNs by extracting different parts of the trained supernet with no additional training. SuperFedNAS takes O(1) (instead of O(N)) cost to find specialized DNN architectures in FL for any N deployment targets. As part of SuperFedNAS, we introduce MaxNet - a novel FL training algorithm that performs multi-objective federated optimization of a large number of DNN architectures ($\approx 5*10^8$) under different client data distributions. Overall, SuperFedNAS achieves upto 37.7% higher accuracy for the same MACs or upto 8.13x reduction in MACs for the same accuracy than existing federated NAS methods.
△ Less
Submitted 11 July, 2024; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Hölder Regularity of the $\bar\partial-$equation on the Polydisc
Authors:
Yu Jun Loo,
Alexander Tumanov
Abstract:
In this note, we show that the canonical solution operator to the $\bar\partial-$equation in the polydisc preserves Hölder regularity. It is a well-known fact that such solution operators do not improve Hölder regularity, and as such, our solution operator is optimal in this regard.
In this note, we show that the canonical solution operator to the $\bar\partial-$equation in the polydisc preserves Hölder regularity. It is a well-known fact that such solution operators do not improve Hölder regularity, and as such, our solution operator is optimal in this regard.
△ Less
Submitted 11 January, 2023;
originally announced January 2023.
-
Signed Binary Weight Networks
Authors:
Sachit Kuhar,
Alexey Tumanov,
Judy Hoffman
Abstract:
Efficient inference of Deep Neural Networks (DNNs) is essential to making AI ubiquitous. Two important algorithmic techniques have shown promise for enabling efficient inference - sparsity and binarization. These techniques translate into weight sparsity and weight repetition at the hardware-software level enabling the deployment of DNNs with critically low power and latency requirements. We propo…
▽ More
Efficient inference of Deep Neural Networks (DNNs) is essential to making AI ubiquitous. Two important algorithmic techniques have shown promise for enabling efficient inference - sparsity and binarization. These techniques translate into weight sparsity and weight repetition at the hardware-software level enabling the deployment of DNNs with critically low power and latency requirements. We propose a new method called signed-binary networks to improve efficiency further (by exploiting both weight sparsity and weight repetition together) while maintaining similar accuracy. Our method achieves comparable accuracy on ImageNet and CIFAR10 datasets with binary and can lead to 69% sparsity. We observe real speedup when deploying these models on general-purpose devices and show that this high percentage of unstructured sparsity can lead to a further reduction in energy consumption on ASICs.
△ Less
Submitted 4 December, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification
Authors:
Yanbo Xu,
Alind Khare,
Glenn Matlin,
Monish Ramadoss,
Rishikesan Kamaleswaran,
Chao Zhang,
Alexey Tumanov
Abstract:
Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with ''happens-before'' relation between them.We argue that…
▽ More
Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with ''happens-before'' relation between them.We argue that it is possible to ''unfold'' a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single-stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi-stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio-temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction.
△ Less
Submitted 27 October, 2022; v1 submitted 26 October, 2022;
originally announced October 2022.
-
Infinitesimal automorphisms of quadrics and second jet determination for CR map**s
Authors:
Alexander Tumanov
Abstract:
We consider a problem whether a CR map** of a generic manifold in complex space is uniquely determined by its finite jet at a point, which is referred to as finite jet determination. We derive the finite jet determination for CR map**s of smooth Levi nondegenerate manifolds of arbitrary codimension from the finite dimensionality of the algebras of infinitesimal automorphisms of the correspondi…
▽ More
We consider a problem whether a CR map** of a generic manifold in complex space is uniquely determined by its finite jet at a point, which is referred to as finite jet determination. We derive the finite jet determination for CR map**s of smooth Levi nondegenerate manifolds of arbitrary codimension from the finite dimensionality of the algebras of infinitesimal automorphisms of the corresponding quadrics. Previously, this implication was known for real analytic manifolds. We prove a new 2-jet determination result that covers most affirmative results on this matter obtained so far.
△ Less
Submitted 12 July, 2022;
originally announced July 2022.
-
Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing
Authors:
Jun Shirako,
Akihiro Hayashi,
Sri Raj Paul,
Alexey Tumanov,
Vivek Sarkar
Abstract:
This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obta…
▽ More
This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obtained by dynamic profiler tools; multi-version code generation guarantees the correctness of our AOT transformation in all cases.
Our compilation framework performs automatic parallelization and sophisticated high-level code optimizations for the target distributed heterogeneous hardware platform. It includes extensions to the polyhedral framework that unify user-written loops and implicit loops present in matrix/tensor operators, as well as automated section of CPU vs. GPU code variants. Further, our polyhedral optimizations enable both intra-node and inter-node parallelism. Finally, the optimized output code is deployed using the Ray runtime for scheduling distributed tasks across multiple heterogeneous nodes in a cluster.
Our empirical evaluation shows significant performance improvements relative to sequential Python in both single-node and multi-node experiments, with a performance improvement of over 20,000$\times$ when using 24 nodes and 144 GPUs in the OLCF Summit supercomputer for the Space-Time Adaptive Processing (STAP) radar application.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
An Operator Product Expansion for Form Factors III. Finite Coupling and Multi-Particle Contributions
Authors:
Amit Sever,
Alexander G. Tumanov,
Matthias Wilhelm
Abstract:
Form factors in planar $\mathcal{N}=4$ super-Yang-Mills theory have a dual description in terms of periodic Wilson loops. This duality maps the multi-collinear expansion of the former to an operator product expansion of the latter. The coefficients of this expansion are decomposed in terms of several elementary building blocks and can be determined at finite 't Hooft coupling using bootstrap and i…
▽ More
Form factors in planar $\mathcal{N}=4$ super-Yang-Mills theory have a dual description in terms of periodic Wilson loops. This duality maps the multi-collinear expansion of the former to an operator product expansion of the latter. The coefficients of this expansion are decomposed in terms of several elementary building blocks and can be determined at finite 't Hooft coupling using bootstrap and integrability techniques. Some of these blocks are known from an analogous expansion of scattering amplitudes. In addition to these, the expansion for form factors includes a new type of building block, called {\it form factor transitions}, that encode information about the local operator. In the present paper, we consider the form factor of the chiral part of the stress-tensor supermultiplet. We bootstrap the corresponding form factor transitions of two-particle flux-tube states and use them to predict the leading term in the collinear expansion at finite coupling. The transitions we find can be expressed in terms of a quantity that previously appeared in a seemingly unrelated context, namely the octagon kernel. Lastly, we use a factorized ansatz to determine the multi-particle form factor transitions at finite coupling, which we use to predict the first subleading term in the collinear expansion. A perfect match is found between our predictions and the available perturbative data.
△ Less
Submitted 23 March, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
Mode I and Mode II stress intensity factors and dislocation density behaviour in strain gradient plasticity
Authors:
V. Shlyannikov,
E. Martínez-Pañeda,
A. Tumanov,
R. Khamidullin
Abstract:
In this study, we use the mechanism-based strain gradient plasticity theory to evaluate both crack tip dislocation density behaviour and the coupled effect of the material plastic properties and the intrinsic material length on non-linear amplitude factors. The two planar classical stress-strain states are examined, namely, plane strain and plane stress, both under pure mode I and pure mode II loa…
▽ More
In this study, we use the mechanism-based strain gradient plasticity theory to evaluate both crack tip dislocation density behaviour and the coupled effect of the material plastic properties and the intrinsic material length on non-linear amplitude factors. The two planar classical stress-strain states are examined, namely, plane strain and plane stress, both under pure mode I and pure mode II loading conditions. The constitutive relations are based on Taylor's dislocation model, which enables gaining insights into the role of the increased dislocation density associated with large gradients in plastic strain near cracks. The material model is implemented in a commercial finite element (FE) software package using a user subroutine, and the nonlinear stress intensity factors (SIF) are evaluated as a function of the intrinsic material length, characterising the scale at which gradient effects become significant. As a result of the FE calculations of dislocation density distributions, the effects of both the fracture mode and the stress-strain state are determined. In pure mode I, the geometrically necessary dislocation (GND) density is located symmetrically with respect to the blunted crack tip. On the contrary, under pure mode II, the GND density becomes concentrated in the blunted and sharp parts of the crack tip. In this case, fracture initiation is shown to be likely to occur near the blunted region of the crack tip, where both the stress triaxiality and the GND density are at their maximum. The relation between the equilibrium state of dislocation densities and the intrinsic material length as well as the plastic SIF as a function of the work hardening exponent is discussed.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
An Operator Product Expansion for Form Factors II. Born level
Authors:
Amit Sever,
Alexander G. Tumanov,
Matthias Wilhelm
Abstract:
Form factors in planar N=4 Super-Yang-Mills theory admit a type of non-perturbative operator product expansion (OPE), as we have recently shown in arXiv:2009.11297. This expansion is based on a decomposition of the dual periodic Wilson loop into elementary building blocks: the known pentagon transitions and a new object that we call form factor transition, which encodes the information about the l…
▽ More
Form factors in planar N=4 Super-Yang-Mills theory admit a type of non-perturbative operator product expansion (OPE), as we have recently shown in arXiv:2009.11297. This expansion is based on a decomposition of the dual periodic Wilson loop into elementary building blocks: the known pentagon transitions and a new object that we call form factor transition, which encodes the information about the local operator. In this paper, we compute the two-particle form factor transitions for the chiral part of the stress-tensor supermultiplet at Born level; they yield the leading contribution to the OPE. To achieve this, we explicitly construct the Gubser-Klebanov-Polyakov two-particle singlet states. The resulting transitions are then used to test the OPE against known perturbative data and to make higher-loop predictions.
△ Less
Submitted 1 June, 2021; v1 submitted 27 May, 2021;
originally announced May 2021.
-
CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment
Authors:
Manas Sahni,
Shreya Varshini,
Alind Khare,
Alexey Tumanov
Abstract:
The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. Howe…
▽ More
The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -- and hence the training budget -- by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
Crack tip fields and fracture resistance parameters based on strain gradient plasticity
Authors:
V. Shlyannikov,
E. Martínez-Pañeda,
A. Tumanov,
A. Tartygasheva
Abstract:
The crack tip mechanics of strain gradient plasticity solids is investigated analytically and numerically. A first-order mechanism-based strain gradient (MSG) plasticity theory based on Taylor's dislocation model is adopted and implemented in the commercial finite element package ANSYS by means of a user subroutine. Two boundary value problems are considered, a single edge tension specimen and a b…
▽ More
The crack tip mechanics of strain gradient plasticity solids is investigated analytically and numerically. A first-order mechanism-based strain gradient (MSG) plasticity theory based on Taylor's dislocation model is adopted and implemented in the commercial finite element package ANSYS by means of a user subroutine. Two boundary value problems are considered, a single edge tension specimen and a biaxially loaded plate. First, crack tip fields are characterized. Strain gradient effects associated with dislocation hardening mechanisms elevate crack tip stresses relative to conventional plasticity. A parametric study is conducted and differences with conventional plasticity predictions are quantified. Moreover, the asymptotic nature of the crack tip solution is investigated. The numerical results reveal that the singularity order predicted by the first-order MSG theory is equal or higher to that of linear elastic solids. Also, the crack tip field appears not to have a separable solution. Moreover, contrarily to what has been shown in the higher order version of MSG plasticity, the singularity order exhibits sensitivity to the plastic material properties. Secondly, analytical and numerical approaches are employed to formulate novel amplitude factors for strain gradient plasticity. A generalized J-integral is derived and used to characterize a nonlinear amplitude factor. A closed-form equation for the analytical stress intensity factor is obtained. Amplitude factors are also derived by decomposing the numerical solution for the crack tip stress field. Nonlinear amplitude factor solutions are determined across a wide range of values for the material length scale l and the strain hardening exponent N. The domains of strain gradient relevance are identified, setting the basis for the application of first-order MSG plasticity for fracture and damage assessment.
△ Less
Submitted 15 October, 2020;
originally announced October 2020.
-
An Operator Product Expansion for Form Factors
Authors:
Amit Sever,
Alexander G. Tumanov,
Matthias Wilhelm
Abstract:
We propose an operator product expansion for planar form factors of local operators in $\mathcal{N}=4$ SYM theory. This expansion is based on the dual conformal symmetry of these objects or, equivalently, the conformal symmetry of their dual description in terms of periodic Wilson loops. A form factor is decomposed into a sequence of known pentagon transitions and a new universal object that we ca…
▽ More
We propose an operator product expansion for planar form factors of local operators in $\mathcal{N}=4$ SYM theory. This expansion is based on the dual conformal symmetry of these objects or, equivalently, the conformal symmetry of their dual description in terms of periodic Wilson loops. A form factor is decomposed into a sequence of known pentagon transitions and a new universal object that we call the "form factor transition". This transition is subject to a set of non-trivial bootstrap constraints, which we expect to be sufficient to fully determine it. We evaluate the form factor transition for MHV form factors of the chiral half of the stress tensor supermultiplet at leading order in perturbation theory and use it to produce OPE predictions at any loop order. We match the one-loop and two-loop predictions with data available in the literature.
△ Less
Submitted 20 June, 2021; v1 submitted 23 September, 2020;
originally announced September 2020.
-
HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units
Authors:
Shenda Hong,
Yanbo Xu,
Alind Khare,
Satria Priambada,
Kevin Maher,
Alaa Aljiffry,
Jimeng Sun,
Alexey Tumanov
Abstract:
Deep learning models have achieved expert-level performance in healthcare with an exclusive focus on training accurate models. However, in many clinical environments such as intensive care unit (ICU), real-time model serving is equally if not more important than accuracy, because in ICU patient care is simultaneously more urgent and more expensive. Clinical decisions and their timeliness, therefor…
▽ More
Deep learning models have achieved expert-level performance in healthcare with an exclusive focus on training accurate models. However, in many clinical environments such as intensive care unit (ICU), real-time model serving is equally if not more important than accuracy, because in ICU patient care is simultaneously more urgent and more expensive. Clinical decisions and their timeliness, therefore, directly affect both the patient outcome and the cost of care. To make timely decisions, we argue the underlying serving system must be latency-aware. To compound the challenge, health analytic applications often require a combination of models instead of a single model, to better specialize individual models for different targets, multi-modal data, different prediction windows, and potentially personalized predictions. To address these challenges, we propose HOLMES-an online model ensemble serving framework for healthcare applications. HOLMES dynamically identifies the best performing set of models to ensemble for highest accuracy, while also satisfying sub-second latency constraints on end-to-end prediction. We demonstrate that HOLMES is able to navigate the accuracy/latency tradeoff efficiently, compose the ensemble, and serve the model ensemble pipeline, scaling to simultaneously streaming data from 100 patients, each producing waveform data at 250~Hz. HOLMES outperforms the conventional offline batch-processed inference for the same clinical task in terms of accuracy and latency (by order of magnitude). HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
A note on the closed range of $\bar\partial_b$ on q-convex manifolds
Authors:
Luca Baracco,
Alexander Tumanov
Abstract:
We prove that the tangential Cauchy-Riemann operator has closed range on Levi-pseudoconvex CR manifolds that are embedded in a q-convex complex manifold $X$. Our result generalizes the known case when $X$ is a Stein manifold.
We prove that the tangential Cauchy-Riemann operator has closed range on Levi-pseudoconvex CR manifolds that are embedded in a q-convex complex manifold $X$. Our result generalizes the known case when $X$ is a Stein manifold.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
Cloudburst: Stateful Functions-as-a-Service
Authors:
Vikram Sreekanti,
Chenggang Wu,
Xiayue Charles Lin,
Johann Schleier-Smith,
Jose M. Faleiro,
Joseph E. Gonzalez,
Joseph M. Hellerstein,
Alexey Tumanov
Abstract:
Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platf…
▽ More
Function-as-a-Service (FaaS) platforms and "serverless" cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.
△ Less
Submitted 24 July, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline
Authors:
Richard Liaw,
Romil Bhardwaj,
Lisa Dunlap,
Yitian Zou,
Joseph Gonzalez,
Ion Stoica,
Alexey Tumanov
Abstract:
Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (…
▽ More
Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible.
To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched -- a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload over-looked in prior work - trial disposability, progressively identifiable rankings among different configurations, and space-time constraints - to outperform standard hyperparameter search algorithms across a variety of benchmarks.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
Stationary discs and finite jet determination for CR map**s in higher codimension
Authors:
Alexander Tumanov
Abstract:
We discuss stationary discs for generic CR manifolds and apply them to the problem of finite jet determination for CR map**s. We prove that a CR diffeomorphism of two finitely smooth strictly pseudoconvex Levi generating CR manifolds is uniquely determined by its 2-jet at a given point. A new key element of the proof is the existence of non-defective stationary discs.
We discuss stationary discs for generic CR manifolds and apply them to the problem of finite jet determination for CR map**s. We prove that a CR diffeomorphism of two finitely smooth strictly pseudoconvex Levi generating CR manifolds is uniquely determined by its 2-jet at a given point. A new key element of the proof is the existence of non-defective stationary discs.
△ Less
Submitted 8 June, 2020; v1 submitted 8 December, 2019;
originally announced December 2019.
-
The OoO VLIW JIT Compiler for GPU Inference
Authors:
Paras Jain,
Xiangxi Mo,
Ajay Jain,
Alexey Tumanov,
Joseph E. Gonzalez,
Ion Stoica
Abstract:
Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU…
▽ More
Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU resources in an early-binding, context-free fashion. We propose a VLIW-inspired Out-of-Order (OoO) Just-in-Time (JIT) compiler that coalesces and reorders execution kernels at runtime for throughput-optimal device utilization while satisfying latency SLOs. We quantify the inefficiencies of space-only and time-only multiplexing alternatives and demonstrate an achievable 7.7x opportunity gap through spatial coalescing.
△ Less
Submitted 30 January, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
Dynamic Space-Time Scheduling for GPU Inference
Authors:
Paras Jain,
Xiangxi Mo,
Ajay Jain,
Harikaran Subbaraj,
Rehan Sohail Durrani,
Alexey Tumanov,
Joseph Gonzalez,
Ion Stoica
Abstract:
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learn…
▽ More
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to resource-efficiency, latency predictability, and isolation when compared with conventional batched inference. Our experimental analysis suggests up to a 5x potential for improved utilization through the exploration of more advanced spatial and temporal multiplexing strategies. Our preliminary prototype of a dynamic space-time scheduler demonstrates a 3.23x floating-point throughput increase over space-only multiplexing and a 7.73x increase over time-only multiplexing for convolutions, while also providing better isolation and latency predictability.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
Serverless Computing: One Step Forward, Two Steps Back
Authors:
Joseph M. Hellerstein,
Jose Faleiro,
Joseph E. Gonzalez,
Johann Schleier-Smith,
Vikram Sreekanti,
Alexey Tumanov,
Chenggang Wu
Abstract:
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current…
▽ More
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless architectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud---with its exabytes of storage and millions of cores---should offer to innovative developers.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
InferLine: ML Prediction Pipeline Provisioning and Management for Tight Latency Objectives
Authors:
Daniel Crankshaw,
Gur-Eyal Sela,
Corey Zumar,
Xiangxi Mo,
Joseph E. Gonzalez,
Ion Stoica,
Alexey Tumanov
Abstract:
Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process.
In this paper we introduce InferLine, a system…
▽ More
Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process.
In this paper we introduce InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost. InferLine consists of a low-frequency combinatorial planner and a high-frequency auto-scaling tuner. The low-frequency planner leverages stage-wise profiling, discrete event simulation, and constrained combinatorial search to automatically select hardware type, replication, and batching parameters for each stage in the pipeline. The high-frequency tuner uses network calculus to auto-scale each stage to meet tail latency goals in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.
△ Less
Submitted 3 August, 2020; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Oscillations of the Critical Temperature in a (Fe/Cr/Fe)/V/Fe Heterostructure
Authors:
V. A. Tumanov,
Yu. V. Goryunov,
Yu. N. Proshin
Abstract:
The superconducting and magnetic properties of the (Fe/Cr/Fe)/V/Fe layered system with variable thickness of the chromium layer have been experimentally and theoretically studied. The magnetic properties of the system have been studied by the ferromagnetic resonance method, and the superconducting transition temperature has been measured from the jump in the magnetic susceptibility. A wide variety…
▽ More
The superconducting and magnetic properties of the (Fe/Cr/Fe)/V/Fe layered system with variable thickness of the chromium layer have been experimentally and theoretically studied. The magnetic properties of the system have been studied by the ferromagnetic resonance method, and the superconducting transition temperature has been measured from the jump in the magnetic susceptibility. A wide variety of magnetic states are observed in the system; in particular, the structure of small domains can arise in the iron layer placed between vanadium and chromium. It has been shown experimentally that the critical temperature $T_c$ of the superconducting transition undergoes nonmonotonic oscillations with a noticeable amplitude in the given system with the change in the thickness of the Cr layer. The proposed model based on the proximity effect theory makes it possible to relate these $T_c$ oscillations to the features of the magnetic structure of the samples.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Scattering Amplitudes -- Wilson Loops Duality for the First Non-planar Correction
Authors:
Roy Ben-Israel,
Alexander G. Tumanov,
Amit Sever
Abstract:
We study the first non-planar correction to gluon scattering amplitudes in ${\cal N}=4$ SYM theory. The correction takes the form of a double trace partial amplitude and is suppressed by one power of $1/N$ with respect to the leading single trace contribution. We extend the duality between planar scattering amplitudes and null polygonal Wilson loops to the double trace amplitude. The new duality r…
▽ More
We study the first non-planar correction to gluon scattering amplitudes in ${\cal N}=4$ SYM theory. The correction takes the form of a double trace partial amplitude and is suppressed by one power of $1/N$ with respect to the leading single trace contribution. We extend the duality between planar scattering amplitudes and null polygonal Wilson loops to the double trace amplitude. The new duality relates the amplitude to the correlation function of two infinite null polygonal Wilson lines that are subject to a quantum periodicity constraint. We test the duality perturbatively at one-loop order and demonstrate it for the dual string in AdS. The duality allows us to extend the notion of the loop integrand beyond the planar limit and to determine it using recursion relations. It also allows one to apply the integrability-based pentagon operator product expansion approach to the first non-planar order.
△ Less
Submitted 30 July, 2018; v1 submitted 26 February, 2018;
originally announced February 2018.
-
Ray: A Distributed Framework for Emerging AI Applications
Authors:
Philipp Moritz,
Robert Nishihara,
Stephanie Wang,
Alexey Tumanov,
Richard Liaw,
Eric Liang,
Melih Elibol,
Zongheng Yang,
William Paul,
Michael I. Jordan,
Ion Stoica
Abstract:
The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray---a distributed system to address them. Ray implements a unified interface that can express both task-pa…
▽ More
The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray---a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.
△ Less
Submitted 29 September, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
E11 and the non-linear dual graviton
Authors:
Alexander G. Tumanov,
Peter West
Abstract:
The non-linear duality relation between the gravity and dual gravity fields are found in E theory by carrying out $E_{11}$ variations of previously found duality relations. We also find the dual graviton equation of motion up to the addition of some very specific terms whose coefficients are not determined. Using the calculations in this paper this ambiguity was resolved in reference [15] where th…
▽ More
The non-linear duality relation between the gravity and dual gravity fields are found in E theory by carrying out $E_{11}$ variations of previously found duality relations. We also find the dual graviton equation of motion up to the addition of some very specific terms whose coefficients are not determined. Using the calculations in this paper this ambiguity was resolved in reference [15] where the full non-linear dual gravity equation was found. As a result the equations of motion in E theory have now been found at the full non-linear level up to, and including, level three, which contains the dual graviton field. When truncated to contain fields at levels three and less, and the spacetime is restricted to be the familiar eleven dimensional space time, the equations are equivalent to those of eleven dimensional supergravity.
△ Less
Submitted 24 August, 2020; v1 submitted 30 October, 2017;
originally announced October 2017.
-
Scarcity of periodic orbits in outer billiards
Authors:
Alexander Tumanov
Abstract:
We give a simple proof of our previous result with V. Zharnitsky that the set of period 4 orbits in planar outer billiard with piecewise smooth convex boundary has empty interior, provided that no four corners of the boundary form a parallelogram. We also obtain results on period 5 and 6 orbits.
We give a simple proof of our previous result with V. Zharnitsky that the set of period 4 orbits in planar outer billiard with piecewise smooth convex boundary has empty interior, provided that no four corners of the boundary form a parallelogram. We also obtain results on period 5 and 6 orbits.
△ Less
Submitted 24 December, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
IDK Cascades: Fast Deep Learning by Learning not to Overthink
Authors:
Xin Wang,
Yujia Luo,
Daniel Crankshaw,
Alexey Tumanov,
Fisher Yu,
Joseph E. Gonzalez
Abstract:
Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that fora majority of real-world inputs, the recent advances in deep learning have created models that effectively "overthink" on simple inputs. In this paper, we revisit the classic question of building model cascades that pr…
▽ More
Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that fora majority of real-world inputs, the recent advances in deep learning have created models that effectively "overthink" on simple inputs. In this paper, we revisit the classic question of building model cascades that primarily leverage class asymmetry to reduce cost. We introduce the "I Don't Know"(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. We propose two search based methods for constructing cascades as well as a new cost-aware objective within this framework. The proposed IDK cascade framework can be easily adopted in the existing model serving systems without additional model re-training. We evaluate the proposed techniques on a range of benchmarks to demonstrate the effectiveness of the proposed framework.
△ Less
Submitted 27 June, 2018; v1 submitted 2 June, 2017;
originally announced June 2017.
-
Real-Time Machine Learning: The Missing Pieces
Authors:
Robert Nishihara,
Philipp Moritz,
Stephanie Wang,
Alexey Tumanov,
William Paul,
Johann Schleier-Smith,
Richard Liaw,
Mehrdad Niknami,
Michael I. Jordan,
Ion Stoica
Abstract:
Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution…
▽ More
Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making. These applications pose a new set of requirements, none of which are difficult to achieve in isolation, but the combination of which creates a challenge for existing distributed execution frameworks: computation with millisecond latency at high throughput, adaptive construction of arbitrary task graphs, and execution of heterogeneous kernels over diverse sets of resources. We assert that a new distributed execution framework is needed for such ML applications and propose a candidate approach with a proof-of-concept architecture that achieves a 63x performance improvement over a state-of-the-art execution framework for a representative application.
△ Less
Submitted 19 May, 2017; v1 submitted 11 March, 2017;
originally announced March 2017.
-
E11, Romans theory and higher level duality relations
Authors:
Alexander G. Tumanov,
Peter West
Abstract:
From the underlying non-linear realisation we compute the complete E11 invariant equations of motion in eleven dimensions, at the linearised level, up to and including level four in the fields. Thus we include the metric, the three and six forms, the dual graviton and three fields at level four. The fields are linked by a set of duality equations, which are first order in derivatives and transform…
▽ More
From the underlying non-linear realisation we compute the complete E11 invariant equations of motion in eleven dimensions, at the linearised level, up to and including level four in the fields. Thus we include the metric, the three and six forms, the dual graviton and three fields at level four. The fields are linked by a set of duality equations, which are first order in derivatives and transform into each other under the E11 symmetries. From these duality relations we deduce second order equations of motion, including those for the usual supergravity fields. As a result the on-shell degrees of freedom are those of the eleven dimensional supergravity. We also show that the level four fields provide an eleven dimensional origin of Romans theory and lead to a novel duality relation.
△ Less
Submitted 9 January, 2017; v1 submitted 10 November, 2016;
originally announced November 2016.
-
E11 in 11D
Authors:
Alexander G. Tumanov,
Peter West
Abstract:
We construct the non-linear realisation of the semi-direct product of E11 and its vector representation in eleven dimensions and find the dynamical equations it predicts at low levels. These equations are completely determined by the non-linear realisation and when restricted to contain only the usual fields of supergravity and the usual space-time we find precisely the equations of motion of elev…
▽ More
We construct the non-linear realisation of the semi-direct product of E11 and its vector representation in eleven dimensions and find the dynamical equations it predicts at low levels. These equations are completely determined by the non-linear realisation and when restricted to contain only the usual fields of supergravity and the usual space-time we find precisely the equations of motion of eleven dimensional supergravity. This paper extends the results announced in arXiv:1512.01644 and in particular it contains the contributions to the equations of motion that involve derivatives with respect to the level one generalised coordinates.
△ Less
Submitted 19 February, 2016; v1 submitted 15 January, 2016;
originally announced January 2016.
-
E11 must be a symmetry of strings and branes
Authors:
Alexander G. Tumanov,
Peter West
Abstract:
We construct the non-linear realisation of the semi-direct product of E11 and its vector representation in five and eleven dimensions and find the dynamical equations it predicts at low levels. Restricting these results to contain only the usual fields of supergravity and the generalised space-time to be the usual space-time we find the equations of motion of the five and eleven dimensional maxima…
▽ More
We construct the non-linear realisation of the semi-direct product of E11 and its vector representation in five and eleven dimensions and find the dynamical equations it predicts at low levels. Restricting these results to contain only the usual fields of supergravity and the generalised space-time to be the usual space-time we find the equations of motion of the five and eleven dimensional maximal supergravity theories. Since this non-linear realisation contains effects that are beyond the supergravity approximation and are thought to be present in an underlying theory we conclude that the low energy effective action of string and branes must possess an E11 symmetry.
△ Less
Submitted 22 January, 2016; v1 submitted 5 December, 2015;
originally announced December 2015.
-
E11 and exceptional field theory
Authors:
Alexander G. Tumanov,
Peter West
Abstract:
We demonstrate that exceptional field theory is a truncation of the non-linear realisation of the semi-direct product of E11 and its first fundamental as proposed in 2003. Evaluating the simple equations of the E11 approach, and using the commutators of the E11 algebra, we find the equations of exceptional field theory after making a radical truncation. This procedure does not respect any of the h…
▽ More
We demonstrate that exceptional field theory is a truncation of the non-linear realisation of the semi-direct product of E11 and its first fundamental as proposed in 2003. Evaluating the simple equations of the E11 approach, and using the commutators of the E11 algebra, we find the equations of exceptional field theory after making a radical truncation. This procedure does not respect any of the higher level E11 symmetries and so these are lost. We suggest that the need for the section condition in exceptional field theory could be a consequence of the truncation.
△ Less
Submitted 31 July, 2015;
originally announced July 2015.
-
Pseudoholomorphic discs and symplectic structures in Hilbert space
Authors:
Alexandre Sukhov,
Alexander Tumanov
Abstract:
We develop the theory of $J$-holomorphic discs in Hilbert spaces with almost complex structures. As an aplication, we prove a version of Gromov's symplectic non-squeezing theorem for Hilbert spaces. It can be applied to short-time symplectic flows of a wide class of Hamiltonian PDEs.
We develop the theory of $J$-holomorphic discs in Hilbert spaces with almost complex structures. As an aplication, we prove a version of Gromov's symplectic non-squeezing theorem for Hilbert spaces. It can be applied to short-time symplectic flows of a wide class of Hamiltonian PDEs.
△ Less
Submitted 28 February, 2015; v1 submitted 14 November, 2014;
originally announced November 2014.
-
Symplectic non-squeezing in Hilbert space and discrete Schrödinger equations
Authors:
Alexandre Sukhov,
Alexander Tumanov
Abstract:
We prove a generalization of Gromov's symplectic non-squeezing theorem for the case of Hilbert spaces. Our approach is based on filling almost complex Hilbert spaces by complex discs partially extending Gromov's results on existence of $J$-complex curves. We apply our result to the flow of the discrete nonlinear Schrödinger equation.
We prove a generalization of Gromov's symplectic non-squeezing theorem for the case of Hilbert spaces. Our approach is based on filling almost complex Hilbert spaces by complex discs partially extending Gromov's results on existence of $J$-complex curves. We apply our result to the flow of the discrete nonlinear Schrödinger equation.
△ Less
Submitted 5 April, 2016; v1 submitted 14 November, 2014;
originally announced November 2014.
-
The Physics of the B Factories
Authors:
A. J. Bevan,
B. Golob,
Th. Mannel,
S. Prell,
B. D. Yabsley,
K. Abe,
H. Aihara,
F. Anulli,
N. Arnaud,
T. Aushev,
M. Beneke,
J. Beringer,
F. Bianchi,
I. I. Bigi,
M. Bona,
N. Brambilla,
J. B rodzicka,
P. Chang,
M. J. Charles,
C. H. Cheng,
H. -Y. Cheng,
R. Chistov,
P. Colangelo,
J. P. Coleman,
A. Drutskoy
, et al. (2009 additional authors not shown)
Abstract:
This work is on the Physics of the B Factories. Part A of this book contains a brief description of the SLAC and KEK B Factories as well as their detectors, BaBar and Belle, and data taking related issues. Part B discusses tools and methods used by the experiments in order to obtain results. The results themselves can be found in Part C.
Please note that version 3 on the archive is the auxiliary…
▽ More
This work is on the Physics of the B Factories. Part A of this book contains a brief description of the SLAC and KEK B Factories as well as their detectors, BaBar and Belle, and data taking related issues. Part B discusses tools and methods used by the experiments in order to obtain results. The results themselves can be found in Part C.
Please note that version 3 on the archive is the auxiliary version of the Physics of the B Factories book. This uses the notation alpha, beta, gamma for the angles of the Unitarity Triangle. The nominal version uses the notation phi_1, phi_2 and phi_3. Please cite this work as Eur. Phys. J. C74 (2014) 3026.
△ Less
Submitted 31 October, 2015; v1 submitted 24 June, 2014;
originally announced June 2014.
-
In situ diffraction study of catalytic hydrogenation of VO2: Stable phases and origins of metallicity
Authors:
Yaroslav Filinchuk,
Nikolay A. Tumanov,
Voraksmy Ban,
Heng Ji,
Jiang Wei,
Michael W. Swift,
Andriy H. Nevidomskyy,
Douglas Natelson
Abstract:
Controlling electronic population through chemical do** is one way to tip the balance between competing phases in materials with strong electronic correlations. Vanadium dioxide exhibits a first-order phase transition at around 338 K between a high temperature, tetragonal, metallic state (T) and a low temperature, monoclinic, insulating state (M1), driven by electron-electron and electron-lattic…
▽ More
Controlling electronic population through chemical do** is one way to tip the balance between competing phases in materials with strong electronic correlations. Vanadium dioxide exhibits a first-order phase transition at around 338 K between a high temperature, tetragonal, metallic state (T) and a low temperature, monoclinic, insulating state (M1), driven by electron-electron and electron-lattice interactions. Intercalation of VO2 with atomic hydrogen has been demonstrated, with evidence that this do** suppresses the transition. However, the detailed effects of intercalated H on the crystal and electronic structure of the resulting hydride have not been previously reported. Here we present synchrotron and neutron diffraction studies of this material system, map** out the structural phase diagram as a function of temperature and hydrogen content. In addition to the original T and M1 phases, we find two orthorhombic phases, O1 and O2, which are stabilized at higher hydrogen content. We present density functional calculations that confirm the metallicity of these states and discuss the physical basis by which hydrogen stabilizes conducting phases, in the context of the metal-insulator transition.
△ Less
Submitted 10 June, 2014;
originally announced June 2014.
-
Commutators of singular integrals, the Bergman projection, and boundary regularity of elliptic equations in the plane
Authors:
Alexander Tumanov
Abstract:
We obtain estimates of commutators of singular integral operators in Lipschitz spaces and apply the results to boundary regularity of elliptic equations in the plane. We obtain an explicit asymptotic formula for the Bergman projection.
We obtain estimates of commutators of singular integral operators in Lipschitz spaces and apply the results to boundary regularity of elliptic equations in the plane. We obtain an explicit asymptotic formula for the Bergman projection.
△ Less
Submitted 27 February, 2016; v1 submitted 2 June, 2014;
originally announced June 2014.