-
Enabling Performant and Secure EDA as a Service in Public Clouds Using Confidential Containers
Authors:
Mengmei Ye,
Derren Dunn,
Daniele Buono,
Angelo Ruocco,
Claudio Carvalho,
Tobin Feldman-fitzthum,
Hubertus Franke,
James Bottomley
Abstract:
Increasingly, business opportunities available to fabless design teams in the semiconductor industry far exceed those addressable with on-prem compute resources. An attractive option to capture these electronic design automation (EDA) design opportunities is through public cloud bursting. However, security concerns with public cloud bursting arise from having to protect process design kits, third…
▽ More
Increasingly, business opportunities available to fabless design teams in the semiconductor industry far exceed those addressable with on-prem compute resources. An attractive option to capture these electronic design automation (EDA) design opportunities is through public cloud bursting. However, security concerns with public cloud bursting arise from having to protect process design kits, third party intellectual property, and new design data for semiconductor devices and chips. One way to address security concerns for public cloud bursting is to leverage confidential containers for EDA workloads. Confidential containers add zero trust computing elements to significantly reduce the probability of intellectual property escapes. A key concern that often follows security discussions is whether EDA workload performance will suffer with confidential computing. In this work we demonstrate a full set of EDA confidential containers and their deployment and characterize performance impacts of confidential elements of the flow including storage and networking. A complete end-to-end confidential container-based EDA workload exhibits 7.13% and 2.05% performance overheads over bare-metal container and VM based solutions, respectively.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models
Authors:
Rhea Sanjay Sukthanker,
Arber Zela,
Benedikt Staffler,
Aaron Klein,
Lennart Purucker,
Joerg K. H. Franke,
Frank Hutter
Abstract:
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training a…
▽ More
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 774M parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.
△ Less
Submitted 21 June, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Authors:
Haoran Qiu,
Weichao Mao,
Archit Patke,
Shengkun Cui,
Saurabh Jha,
Chen Wang,
Hubertus Franke,
Zbigniew T. Kalbarczyk,
Tamer Başar,
Ravishankar K. Iyer
Abstract:
Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line bloc…
▽ More
Large language models (LLMs) have been driving a new wave of interactive AI applications across numerous domains. However, efficiently serving LLM inference requests is challenging due to their unpredictable execution times originating from the autoregressive nature of generative models. Existing LLM serving systems exploit first-come-first-serve (FCFS) scheduling, suffering from head-of-line blocking issues. To address the non-deterministic nature of LLMs and enable efficient interactive LLM serving, we present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths. Our open-source SSJF implementation does not require changes to memory management or batching strategies. Evaluations on real-world datasets and production workload traces show that SSJF reduces average job completion times by 30.5-39.6% and increases throughput by 2.2-3.6x compared to FCFS schedulers, across no batching, dynamic batching, and continuous batching settings.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
$\widetilde{O}(T^{-1})$ Convergence to (Coarse) Correlated Equilibria in Full-Information General-Sum Markov Games
Authors:
Weichao Mao,
Haoran Qiu,
Chen Wang,
Hubertus Franke,
Zbigniew Kalbarczyk,
Tamer Başar
Abstract:
No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analog…
▽ More
No-regret learning has a long history of being closely connected to game theory. Recent works have devised uncoupled no-regret learning dynamics that, when adopted by all the players in normal-form games, converge to various equilibrium solutions at a near-optimal rate of $\widetilde{O}(T^{-1})$, a significant improvement over the $O(1/\sqrt{T})$ rate of classic no-regret learners. However, analogous convergence results are scarce in Markov games, a more generic setting that lays the foundation for multi-agent reinforcement learning. In this work, we close this gap by showing that the optimistic-follow-the-regularized-leader (OFTRL) algorithm, together with appropriate value update procedures, can find $\widetilde{O}(T^{-1})$-approximate (coarse) correlated equilibria in full-information general-sum Markov games within $T$ iterations. Numerical results are also included to corroborate our theoretical findings.
△ Less
Submitted 23 April, 2024; v1 submitted 2 February, 2024;
originally announced March 2024.
-
Rethinking Performance Measures of RNA Secondary Structure Problems
Authors:
Frederic Runge,
Jörg K. H. Franke,
Daniel Fertmann,
Frank Hutter
Abstract:
Accurate RNA secondary structure prediction is vital for understanding cellular regulation and disease mechanisms. Deep learning (DL) methods have surpassed traditional algorithms by predicting complex features like pseudoknots and multi-interacting base pairs. However, traditional distance measures can hardly deal with such tertiary interactions and the currently used evaluation measures (F1 scor…
▽ More
Accurate RNA secondary structure prediction is vital for understanding cellular regulation and disease mechanisms. Deep learning (DL) methods have surpassed traditional algorithms by predicting complex features like pseudoknots and multi-interacting base pairs. However, traditional distance measures can hardly deal with such tertiary interactions and the currently used evaluation measures (F1 score, MCC) have limitations. We propose the Weisfeiler-Lehman graph kernel (WL) as an alternative metric. Embracing graph-based metrics like WL enables fair and accurate evaluation of RNA structure prediction algorithms. Further, WL provides informative guidance, as demonstrated in an RNA design experiment.
△ Less
Submitted 4 December, 2023;
originally announced January 2024.
-
BPF-oF: Storage Function Pushdown Over the Network
Authors:
Ioannis Zarkadas,
Tal Zussman,
Jeremy Carin,
Sheng Jiang,
Yuhong Zhong,
Jonas Pfefferle,
Hubertus Franke,
Junfeng Yang,
Kostis Kaffes,
Ryan Stutsman,
Asaf Cidon
Abstract:
Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a…
▽ More
Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a series of simple but dependent I/O access patterns. Therefore, one way to reduce the network processing overhead is to execute dependent series of I/O accesses at the remote storage server, reducing the back-and-forth communication between the storage layer and the application. We refer to this approach as \emph{remote-storage pushdown}. We present BPF-oF, a new remote-storage pushdown protocol built on top of NVMe-oF, which enables applications to safely push custom eBPF storage functions to a remote storage server.
The main challenge in integrating BPF-oF with storage systems is preserving the benefits of their client-based in-memory caches. We address this challenge by designing novel caching techniques for storage pushdown, including splitting queries into separate in-memory and remote-storage phases and periodically refreshing the client cache with sampled accesses from the remote storage device. We demonstrate the utility of BPF-oF by integrating it with three storage systems, including RocksDB, a popular persistent key-value store that has no existing storage pushdown capability. We show BPF-oF provides significant speedups in all three systems when accessed over the network, for example improving RocksDB's throughput by up to 2.8$\times$ and tail latency by up to 2.6$\times$.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Constrained Parameter Regularization
Authors:
Jörg K. H. Franke,
Michael Hefenbrock,
Gregor Koehler,
Frank Hutter
Abstract:
Regularization is a critical component in deep learning training, with weight decay being a commonly used approach. It applies a constant penalty coefficient uniformly across all parameters. This may be unnecessarily restrictive for some parameters, while insufficiently restricting others. To dynamically adjust penalty coefficients for different parameter groups, we present constrained parameter r…
▽ More
Regularization is a critical component in deep learning training, with weight decay being a commonly used approach. It applies a constant penalty coefficient uniformly across all parameters. This may be unnecessarily restrictive for some parameters, while insufficiently restricting others. To dynamically adjust penalty coefficients for different parameter groups, we present constrained parameter regularization (CPR) as an alternative to traditional weight decay. Instead of applying a single constant penalty to all parameters, we enforce an upper bound on a statistical measure (e.g., the L$_2$-norm) of parameter groups. Consequently, learning becomes a constraint optimization problem, which we address by an adaptation of the augmented Lagrangian method. CPR only requires two hyperparameters and incurs no measurable runtime overhead. Additionally, we propose a simple but efficient mechanism to adapt the upper bounds during the optimization. We provide empirical evidence of CPR's efficacy in experiments on the "grokking" phenomenon, computer vision, and language modeling tasks. Our results demonstrate that CPR counteracts the effects of grokking and consistently matches or outperforms traditional weight decay.
△ Less
Submitted 6 December, 2023; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Beyond Random Augmentations: Pretraining with Hard Views
Authors:
Fabio Ferreira,
Ivo Rapant,
Jörg K. H. Franke,
Frank Hutter
Abstract:
Many Self-Supervised Learning (SSL) methods aim for model invariance to different image augmentations known as views. To achieve this invariance, conventional approaches make use of random sampling operations within the image augmentation pipeline. We hypothesize that the efficacy of pretraining pipelines based on conventional random view sampling can be enhanced by explicitly selecting views that…
▽ More
Many Self-Supervised Learning (SSL) methods aim for model invariance to different image augmentations known as views. To achieve this invariance, conventional approaches make use of random sampling operations within the image augmentation pipeline. We hypothesize that the efficacy of pretraining pipelines based on conventional random view sampling can be enhanced by explicitly selecting views that benefit the learning progress. A simple, yet effective approach is to select hard views that yield a higher loss. In this paper, we present Hard View Pretraining (HVP), a learning-free strategy that builds upon this hypothesis and extends random view generation. HVP exposes the model to harder, more challenging samples during SSL pretraining, which enhances downstream performance. It encompasses the following iterative steps: 1) randomly sample multiple views and forward each view through the pretrained model, 2) create pairs of two views and compute their loss, 3) adversarially select the pair yielding the highest loss depending on the current model state, and 4) run the backward pass with the selected pair. As a result, HVP achieves linear evaluation accuracy improvements of 1% on average on ImageNet for both 100 and 300 epoch pretraining and similar improvements on transfer tasks across DINO, SimSiam, iBOT, and SimCLR.
△ Less
Submitted 27 May, 2024; v1 submitted 5 October, 2023;
originally announced October 2023.
-
RecycleNet: Latent Feature Recycling Leads to Iterative Decision Refinement
Authors:
Gregor Koehler,
Tassilo Wald,
Constantin Ulrich,
David Zimmerer,
Paul F. Jaeger,
Jörg K. H. Franke,
Simon Kohl,
Fabian Isensee,
Klaus H. Maier-Hein
Abstract:
Despite the remarkable success of deep learning systems over the last decade, a key difference still remains between neural network and human decision-making: As humans, we cannot only form a decision on the spot, but also ponder, revisiting an initial guess from different angles, distilling relevant information, arriving at a better decision. Here, we propose RecycleNet, a latent feature recyclin…
▽ More
Despite the remarkable success of deep learning systems over the last decade, a key difference still remains between neural network and human decision-making: As humans, we cannot only form a decision on the spot, but also ponder, revisiting an initial guess from different angles, distilling relevant information, arriving at a better decision. Here, we propose RecycleNet, a latent feature recycling method, instilling the pondering capability for neural networks to refine initial decisions over a number of recycling steps, where outputs are fed back into earlier network layers in an iterative fashion. This approach makes minimal assumptions about the neural network architecture and thus can be implemented in a wide variety of contexts. Using medical image segmentation as the evaluation environment, we show that latent feature recycling enables the network to iteratively refine initial predictions even beyond the iterations seen during training, converging towards an improved decision. We evaluate this across a variety of segmentation benchmarks and show consistent improvements even compared with top-performing segmentation methods. This allows trading increased computation time for improved performance, which can be beneficial, especially for safety-critical applications.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Scalable Deep Learning for RNA Secondary Structure Prediction
Authors:
Jörg K. H. Franke,
Frederic Runge,
Frank Hutter
Abstract:
The field of RNA secondary structure prediction has made significant progress with the adoption of deep learning techniques. In this work, we present the RNAformer, a lean deep learning model using axial attention and recycling in the latent space. We gain performance improvements by designing the architecture for modeling the adjacency matrix directly in the latent space and by scaling the size o…
▽ More
The field of RNA secondary structure prediction has made significant progress with the adoption of deep learning techniques. In this work, we present the RNAformer, a lean deep learning model using axial attention and recycling in the latent space. We gain performance improvements by designing the architecture for modeling the adjacency matrix directly in the latent space and by scaling the size of the model. Our approach achieves state-of-the-art performance on the popular TS0 benchmark dataset and even outperforms methods that use external information. Further, we show experimentally that the RNAformer can learn a biophysical model of the RNA folding process.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Towards Automated Design of Riboswitches
Authors:
Frederic Runge,
Jörg K. H. Franke,
Frank Hutter
Abstract:
Experimental screening and selection pipelines for the discovery of novel riboswitches are expensive, time-consuming, and inefficient. Using computational methods to reduce the number of candidates for the screen could drastically decrease these costs. However, existing computational approaches do not fully satisfy all requirements for the design of such initial screening libraries. In this work,…
▽ More
Experimental screening and selection pipelines for the discovery of novel riboswitches are expensive, time-consuming, and inefficient. Using computational methods to reduce the number of candidates for the screen could drastically decrease these costs. However, existing computational approaches do not fully satisfy all requirements for the design of such initial screening libraries. In this work, we present a new method, libLEARNA, capable of providing RNA focus libraries of diverse variable-length qualified candidates. Our novel structure-based design approach considers global properties as well as desired sequence and structure features. We demonstrate the benefits of our method by designing theophylline riboswitch libraries, following a previously published protocol, and yielding 30% more unique high-quality candidates.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Remote attestation of SEV-SNP confidential VMs using e-vTPMs
Authors:
Vikram Narayanan,
Claudio Carvalho,
Angelo Ruocco,
Gheorghe Almási,
James Bottomley,
Mengmei Ye,
Tobin Feldman-Fitzthum,
Daniele Buono,
Hubertus Franke,
Anton Burtsev
Abstract:
Trying to address the security challenges of a cloud-centric software deployment paradigm, silicon and cloud vendors are introducing confidential computing - an umbrella term aimed at providing hardware and software mechanisms for protecting cloud workloads from the cloud provider and its software stack. Today, Intel SGX, AMD SEV, Intel TDX, etc., provide a way to shield cloud applications from th…
▽ More
Trying to address the security challenges of a cloud-centric software deployment paradigm, silicon and cloud vendors are introducing confidential computing - an umbrella term aimed at providing hardware and software mechanisms for protecting cloud workloads from the cloud provider and its software stack. Today, Intel SGX, AMD SEV, Intel TDX, etc., provide a way to shield cloud applications from the cloud provider through encryption of the application's memory below the hardware boundary of the CPU, hence requiring trust only in the CPU vendor. Unfortunately, existing hardware mechanisms do not automatically enable the guarantee that a protected system was not tampered with during configuration and boot time. Such a guarantee relies on a hardware RoT, i.e., an integrity-protected location that can store measurements in a trustworthy manner, extend them, and authenticate the measurement logs to the user.
In this work, we design and implement a virtual TPM that virtualizes the hardware RoT without requiring trust in the cloud provider. To ensure the security of a vTPM in a provider-controlled environment, we leverage unique isolation properties of the SEV-SNP hardware that allows us to execute secure services as part of the enclave environment protected from the cloud provider. We further develop a novel approach to vTPM state management where the vTPM state is not preserved across reboots. Specifically, we develop a stateless ephemeral vTPM that supports remote attestation without any persistent state on the host. This allows us to pair each confidential VM with a private instance of a vTPM completely isolated from the provider-controlled environment and other VMs. We built our prototype entirely on open-source components. Though our work is AMD-specific, a similar approach could be used to build remote attestation protocols on other trusted execution environments.
△ Less
Submitted 25 June, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Intel TDX Demystified: A Top-Down Approach
Authors:
Pau-Chen Cheng,
Wojciech Ozga,
Enriquillo Valdez,
Salman Ahmed,
Zhongshu Gu,
Hani Jamjoom,
Hubertus Franke,
James Bottomley
Abstract:
Intel Trust Domain Extensions (TDX) is a new architectural extension in the 4th Generation Intel Xeon Scalable Processor that supports confidential computing. TDX allows the deployment of virtual machines in the Secure-Arbitration Mode (SEAM) with encrypted CPU state and memory, integrity protection, and remote attestation. TDX aims to enforce hardware-assisted isolation for virtual machines and m…
▽ More
Intel Trust Domain Extensions (TDX) is a new architectural extension in the 4th Generation Intel Xeon Scalable Processor that supports confidential computing. TDX allows the deployment of virtual machines in the Secure-Arbitration Mode (SEAM) with encrypted CPU state and memory, integrity protection, and remote attestation. TDX aims to enforce hardware-assisted isolation for virtual machines and minimize the attack surface exposed to host platforms, which are considered to be untrustworthy or adversarial in the confidential computing's new threat model. TDX can be leveraged by regulated industries or sensitive data holders to outsource their computations and data with end-to-end protection in public cloud infrastructure.
This paper aims to provide a comprehensive understanding of TDX to potential adopters, domain experts, and security researchers looking to leverage the technology for their own purposes. We adopt a top-down approach, starting with high-level security principles and moving to low-level technical details of TDX. Our analysis is based on publicly available documentation and source code, offering insights from security researchers outside of Intel.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
Programmable System Call Security with eBPF
Authors:
**ghao Jia,
YiFei Zhu,
Dan Williams,
Andrea Arcangeli,
Claudio Canella,
Hubertus Franke,
Tobin Feldman-Fitzthum,
Dimitrios Skarlatos,
Daniel Gruss,
Tianyin Xu
Abstract:
System call filtering is a widely used security mechanism for protecting a shared OS kernel against untrusted user applications. However, existing system call filtering techniques either are too expensive due to the context switch overhead imposed by userspace agents, or lack sufficient programmability to express advanced policies. Seccomp, Linux's system call filtering module, is widely used by m…
▽ More
System call filtering is a widely used security mechanism for protecting a shared OS kernel against untrusted user applications. However, existing system call filtering techniques either are too expensive due to the context switch overhead imposed by userspace agents, or lack sufficient programmability to express advanced policies. Seccomp, Linux's system call filtering module, is widely used by modern container technologies, mobile apps, and system management services. Despite the adoption of the classic BPF language (cBPF), security policies in Seccomp are mostly limited to static allow lists, primarily because cBPF does not support stateful policies. Consequently, many essential security features cannot be expressed precisely and/or require kernel modifications.
In this paper, we present a programmable system call filtering mechanism, which enables more advanced security policies to be expressed by leveraging the extended BPF language (eBPF). More specifically, we create a new Seccomp eBPF program type, exposing, modifying or creating new eBPF helper functions to safely manage filter state, access kernel and user state, and utilize synchronization primitives. Importantly, our system integrates with existing kernel privilege and capability mechanisms, enabling unprivileged users to install advanced filters safely. Our evaluation shows that our eBPF-based filtering can enhance existing policies (e.g., reducing the attack surface of early execution phase by up to 55.4% for temporal specialization), mitigate real-world vulnerabilities, and accelerate filters.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Partially Trusting the Service Mesh Control Plane
Authors:
Constantin Adam,
Abdulhamid Adebayo,
Hubertus Franke,
Edward Snible,
Tobin Feldman-Fitzthum,
James Cadden,
Nerla Jean-Louis
Abstract:
Zero Trust is a novel cybersecurity model that focuses on continually evaluating trust to prevent the initiation and horizontal spreading of attacks. A cloud-native Service Mesh is an example of Zero Trust Architecture that can filter out external threats. However, the Service Mesh does not shield the Application Owner from internal threats, such as a rogue administrator of the cluster where their…
▽ More
Zero Trust is a novel cybersecurity model that focuses on continually evaluating trust to prevent the initiation and horizontal spreading of attacks. A cloud-native Service Mesh is an example of Zero Trust Architecture that can filter out external threats. However, the Service Mesh does not shield the Application Owner from internal threats, such as a rogue administrator of the cluster where their application is deployed. In this work, we are enhancing the Service Mesh to allow the definition and reinforcement of a Verifiable Configuration that is defined and signed off by the Application Owner. Backed by automated digital signing solutions and confidential computing technologies, the Verifiable Configuration allows changing the trust model of the Service Mesh, from the data plane fully trusting the control plane to partially trusting it. This lets the application benefit from all the functions provided by the Service Mesh (resource discovery, traffic management, mutual authentication, access control, observability), while ensuring that the Cluster Administrator cannot change the state of the application in a way that was not intended by the Application Owner.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Probabilistic Transformer: Modelling Ambiguities and Distributions for RNA Folding and Molecule Design
Authors:
Jörg K. H. Franke,
Frederic Runge,
Frank Hutter
Abstract:
Our world is ambiguous and this is reflected in the data we use to train our algorithms. This is particularly true when we try to model natural processes where collected data is affected by noisy measurements and differences in measurement techniques. Sometimes, the process itself is ambiguous, such as in the case of RNA folding, where the same nucleotide sequence can fold into different structure…
▽ More
Our world is ambiguous and this is reflected in the data we use to train our algorithms. This is particularly true when we try to model natural processes where collected data is affected by noisy measurements and differences in measurement techniques. Sometimes, the process itself is ambiguous, such as in the case of RNA folding, where the same nucleotide sequence can fold into different structures. This suggests that a predictive model should have similar probabilistic characteristics to match the data it models. Therefore, we propose a hierarchical latent distribution to enhance one of the most successful deep learning models, the Transformer, to accommodate ambiguities and data distributions. We show the benefits of our approach (1) on a synthetic task that captures the ability to learn a hidden data distribution, (2) with state-of-the-art results in RNA folding that reveal advantages on highly ambiguous data, and (3) demonstrating its generative capabilities on property-based molecule design by implicitly learning the underlying distributions and outperforming existing work.
△ Less
Submitted 14 November, 2022; v1 submitted 27 May, 2022;
originally announced May 2022.
-
HetSched: Quality-of-Mission Aware Scheduling for Autonomous Vehicle SoCs
Authors:
Aporva Amarnath,
Subhankar Pal,
Hiwot Kassa,
Augusto Vega,
Alper Buyuktosunoglu,
Hubertus Franke,
John-David Wellman,
Ronald Dreslinski,
Pradip Bose
Abstract:
Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and safety requirements prior to deployment. With increasing complexity in AV applications, the system needs to meet these real-time demands of multiple safety-critical applications simultaneously. A typical AV-SoC is a heterogeneous multiprocessor consisting of accelerators supported by general-purpose co…
▽ More
Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and safety requirements prior to deployment. With increasing complexity in AV applications, the system needs to meet these real-time demands of multiple safety-critical applications simultaneously. A typical AV-SoC is a heterogeneous multiprocessor consisting of accelerators supported by general-purpose cores. Such heterogeneity, while needed for power-performance efficiency, complicates the art of task scheduling.
In this paper, we demonstrate that hardware heterogeneity impacts the scheduler's effectiveness and that optimizing for only the real-time aspect of applications is not sufficient in AVs. Therefore, a more holistic approach is required -- one that considers global Quality-of-Mission (QoM) metrics, as defined in the paper. We then propose HetSched, a multi-step scheduler that leverages dynamic runtime information about the underlying heterogeneous hardware platform, along with the applications' real-time constraints and the task traffic in the system to optimize overall mission performance. HetSched proposes two scheduling policies: MSstat and MSdyn and scheduling optimizations like task pruning, hybrid heterogeneous ranking and rank update. HetSched improves overall mission performance on average by 4.6x, 2.6x and 2.6x when compared against CPATH, ADS and 2lvl-EDF (state-of-the-art real-time schedulers built for heterogeneous systems), respectively, and achieves an average of 53.3% higher hardware utilization, while meeting 100% critical deadlines for real-world applications of autonomous vehicles. Furthermore, when used as part of an SoC design space exploration loop, in comparison to prior schedulers, HetSched reduces the number of processing elements required by an SoC to safely complete AV's missions by 35% on average while achieving 2.7x lower energy-mission time product.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
Hyperparameter Transfer Across Developer Adjustments
Authors:
Danny Stoll,
Jörg K. H. Franke,
Diane Wagner,
Simon Selg,
Frank Hutter
Abstract:
After developer adjustments to a machine learning (ML) algorithm, how can the results of an old hyperparameter optimization (HPO) automatically be used to speedup a new HPO? This question poses a challenging problem, as developer adjustments can change which hyperparameter settings perform well, or even the hyperparameter search space itself. While many approaches exist that leverage knowledge obt…
▽ More
After developer adjustments to a machine learning (ML) algorithm, how can the results of an old hyperparameter optimization (HPO) automatically be used to speedup a new HPO? This question poses a challenging problem, as developer adjustments can change which hyperparameter settings perform well, or even the hyperparameter search space itself. While many approaches exist that leverage knowledge obtained on previous tasks, so far, knowledge from previous development steps remains entirely untapped. In this work, we remedy this situation and propose a new research framework: hyperparameter transfer across adjustments (HT-AA). To lay a solid foundation for this research framework, we provide four simple HT-AA baseline algorithms and eight benchmarks changing various aspects of ML algorithms, their hyperparameter search spaces, and the neural architectures used. The best baseline, on average and depending on the budgets for the old and new HPO, reaches a given performance 1.2--2.6x faster than a prominent HPO algorithm without transfer. As HPO is a crucial step in ML development but requires extensive computational resources, this speedup would lead to faster development cycles, lower costs, and reduced environmental impacts. To make these benefits available to ML developers off-the-shelf and to facilitate future research on HT-AA, we provide python packages for our baselines and benchmarks.
△ Less
Submitted 25 October, 2020;
originally announced October 2020.
-
Sample-Efficient Automated Deep Reinforcement Learning
Authors:
Jörg K. H. Franke,
Gregor Köhler,
André Biedenkapp,
Frank Hutter
Abstract:
Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning…
▽ More
Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.
△ Less
Submitted 17 March, 2021; v1 submitted 3 September, 2020;
originally announced September 2020.
-
STOMP: A Tool for Evaluation of Scheduling Policies in Heterogeneous Multi-Processors
Authors:
Augusto Vega,
Aporva Amarnath,
John-David Wellman,
Hiwot Kassa,
Subhankar Pal,
Hubertus Franke,
Alper Buyuktosunoglu,
Ronald Dreslinski,
Pradip Bose
Abstract:
The proliferation of heterogeneous chip multiprocessors in recent years has reached unprecedented levels. Traditional homogeneous platforms have shown fundamental limitations when it comes to enabling high-performance yet-ultra-low-power computing, in particular in application domains with real-time execution deadlines or criticality constraints. By combining the right set of general purpose cores…
▽ More
The proliferation of heterogeneous chip multiprocessors in recent years has reached unprecedented levels. Traditional homogeneous platforms have shown fundamental limitations when it comes to enabling high-performance yet-ultra-low-power computing, in particular in application domains with real-time execution deadlines or criticality constraints. By combining the right set of general purpose cores and hardware accelerators together, along with proper chip interconnects and memory technology, heterogeneous chip multiprocessors have become an effective high-performance and low-power computing alternative.
One of the challenges of heterogeneous architectures relates to efficient scheduling of application tasks (processes, threads) across the variety of options in the chip. As a result, it is key to provide tools to enable early-stage prototy** and evaluation of new scheduling policies for heterogeneous platforms. In this paper, we present STOMP (Scheduling Techniques Optimization in heterogeneous Multi-Processors), a simulator for fast implementation and evaluation of task scheduling policies in multi-core/multi-processor systems with a convenient interface for "plugging" in new scheduling policies in a simple manner. Thorough validation of STOMP exhibits small relative errors when compared against closed-formed equivalent models during steady-state analysis.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Neural Architecture Evolution in Deep Reinforcement Learning for Continuous Control
Authors:
Jörg K. H. Franke,
Gregor Köhler,
Noor Awad,
Frank Hutter
Abstract:
Current Deep Reinforcement Learning algorithms still heavily rely on handcrafted neural network architectures. We propose a novel approach to automatically find strong topologies for continuous control tasks while only adding a minor overhead in terms of interactions in the environment. To achieve this, we combine Neuroevolution techniques with off-policy training and propose a novel architecture…
▽ More
Current Deep Reinforcement Learning algorithms still heavily rely on handcrafted neural network architectures. We propose a novel approach to automatically find strong topologies for continuous control tasks while only adding a minor overhead in terms of interactions in the environment. To achieve this, we combine Neuroevolution techniques with off-policy training and propose a novel architecture mutation operator. Experiments on five continuous control benchmarks show that the proposed Actor-Critic Neuroevolution algorithm often outperforms the strong Actor-Critic baseline and is capable of automatically finding topologies in a sample-efficient manner which would otherwise have to be found by expensive architecture search.
△ Less
Submitted 27 February, 2020; v1 submitted 28 October, 2019;
originally announced October 2019.
-
Coexistence of strong and weak coupling in ZnO nanowire cavities
Authors:
Tom Michalsky,
Helena Franke,
Robert Buschlinger,
Ulf Peschel,
Marius Grundmann,
Rüdiger Schmidt-Grund
Abstract:
We present a high quality two-dimensional cavity structure based on ZnO nanowires coated with concentrical Bragg reflectors. The spatial mode distribution leads to the simultaneous appearance of the weak and strong coupling regime even at room temperature. Photoluminescence measurements agree with FDTD simulations. Furthermore the ZnO core nanowires allow for the observation of middle polariton br…
▽ More
We present a high quality two-dimensional cavity structure based on ZnO nanowires coated with concentrical Bragg reflectors. The spatial mode distribution leads to the simultaneous appearance of the weak and strong coupling regime even at room temperature. Photoluminescence measurements agree with FDTD simulations. Furthermore the ZnO core nanowires allow for the observation of middle polariton branches between the A- and B-exciton ground state resonances. Further, lasing emission up to room temperature is detected in excitation dependent photoluminescence measurements.
△ Less
Submitted 22 February, 2016;
originally announced February 2016.
-
Disaggregated and optically interconnected memory: when will it be cost effective?
Authors:
Bulent Abali,
Richard J. Eickemeyer,
Hubertus Franke,
Chung-Sheng Li,
Marc A. Taubenblatt
Abstract:
The "Disaggregated Server" concept has been proposed for datacenters where the same type server resources are aggregated in their respective pools, for example a compute pool, memory pool, network pool, and a storage pool. Each server is constructed dynamically by allocating the right amount of resources from these pools according to the workload's requirements. Modularity, higher packaging and co…
▽ More
The "Disaggregated Server" concept has been proposed for datacenters where the same type server resources are aggregated in their respective pools, for example a compute pool, memory pool, network pool, and a storage pool. Each server is constructed dynamically by allocating the right amount of resources from these pools according to the workload's requirements. Modularity, higher packaging and cooling efficiencies, and higher resource utilization are among the suggested benefits. With the emergence of very large datacenters, "clouds" containing tens of thousands of servers, datacenter efficiency has become an important topic. Few computer chip and systems vendors are working on and making frequent announcements on silicon photonics and disaggregated memory systems.
In this paper we study the trade-off between cost and performance of building a disaggregated memory system where DRAM modules in the datacenter are pooled, for example in memory-only chassis and racks. The compute pool and the memory pool are interconnected by an optical interconnect to overcome the distance and bandwidth issues of electrical fabrics. We construct a simple cost model that includes the cost of latency, cost of bandwidth and the savings expected from a disaggregated memory system. We then identify the level at which a disaggregated memory system becomes cost competitive with a traditional direct attached memory system.
Our analysis shows that a rack-scale disaggregated memory system will have a non-trivial performance penalty, and at the datacenter scale the penalty is impractically high, and the optical interconnect costs are at least a factor of 10 more expensive than where they should be when compared to the traditional direct attached memory systems.
△ Less
Submitted 3 March, 2015;
originally announced March 2015.
-
Discrete relaxation of exciton-polaritons in an inhomogeneous potential
Authors:
T. Michalsky,
H. Franke,
C. Sturm,
M. Grundmann,
R. Schmidt-Grund
Abstract:
We present indications, that the wave function-stiffness condition during energy-relaxation as observed in single-phase state quantum systems manifests also in a single particle ensemble. This is demonstrated for exciton-polaritons in the strong coupling regime in a ZnO-based microcavity at T = 10 K for non-resonant excitation. It is well known that the pump-induced spatially inhomogeneous backgro…
▽ More
We present indications, that the wave function-stiffness condition during energy-relaxation as observed in single-phase state quantum systems manifests also in a single particle ensemble. This is demonstrated for exciton-polaritons in the strong coupling regime in a ZnO-based microcavity at T = 10 K for non-resonant excitation. It is well known that the pump-induced spatially inhomogeneous background potential leads to nearly equally spaced energy levels in the k-space distribution for propagating polariton Bose-Einstein condensates. Surprisingly this particular pattern is also observable for uncondensed exciton-polaritons.
△ Less
Submitted 12 January, 2015;
originally announced January 2015.
-
Cavity Polariton Condensate in a Disordered Environment
Authors:
Martin Thunert,
Alexander Janot,
Helena Franke,
Chris Sturm,
Tom Michalsky,
María Dolores Martín,
Luis Viña,
Bernd Rosenow,
Marius Grundmann,
Rüdiger Schmidt-Grund
Abstract:
We report on the influence of disorder on an exciton-polariton condensate in a ZnO based bulk planar microcavity and compare experimental results with a theoretical model for a non-equilibrium condensate. Experimentally, we detect intensity fluctuations within the far-field emission pattern even at high condensate densities which indicates a significant impact of disorder. We show that these effec…
▽ More
We report on the influence of disorder on an exciton-polariton condensate in a ZnO based bulk planar microcavity and compare experimental results with a theoretical model for a non-equilibrium condensate. Experimentally, we detect intensity fluctuations within the far-field emission pattern even at high condensate densities which indicates a significant impact of disorder. We show that these effects rely on the driven dissipative nature of the condensate and argue that they can be accounted for by spatial phase inhomogeneities induced by disorder, which occur even for increasing condensate densities realized in the regime of high excitation power. Thus, non-equilibrium effects strongly suppress the stabilization of the condensate against disorder, contrarily to what is expected for equilibrium condensates in the high density limit. Numerical simulations based on our theoretical model reproduce the experimental data.
△ Less
Submitted 16 November, 2015; v1 submitted 30 December, 2014;
originally announced December 2014.