-
Ontology Development Kit: a toolkit for building, maintaining, and standardising biomedical ontologies
Authors:
Nicolas Matentzoglu,
Damien Goutte-Gattat,
Shawn Zheng Kai Tan,
James P. Balhoff,
Seth Carbon,
Anita R. Caron,
William D. Duncan,
Joe E. Flack,
Melissa Haendel,
Nomi L. Harris,
William R Hogan,
Charles Tapley Hoyt,
Rebecca C. Jackson,
HyeongSik Kim,
Huseyin Kir,
Martin Larralde,
Julie A. McMurry,
James A. Overton,
Bjoern Peters,
Clare Pilgrim,
Ray Stefancsik,
Sofia MC Robb,
Sabrina Toro,
Nicole A Vasilevsky,
Ramona Walls
, et al. (2 additional authors not shown)
Abstract:
Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking, and dependency management. To manage these processes, a diverse set of tools is required, from command line utilities to powerful ontology engineering environments such as ROBOT. Particularly in the biomedical domain, which has…
▽ More
Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking, and dependency management. To manage these processes, a diverse set of tools is required, from command line utilities to powerful ontology engineering environments such as ROBOT. Particularly in the biomedical domain, which has developed a set of highly diverse yet inter-dependent ontologies, standardising release practices and metadata, and establishing shared quality standards, are crucial to enable interoperability. The Ontology Development Kit (ODK) provides a set of standardised, customisable, and automatically executable workflows, and packages all required tooling in a single Docker image. In this paper, we provide an overview of how the ODK works, show how it is used in practice, and describe how we envision it driving standardisation efforts in our community.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads
Authors:
Guin Gilman,
Robert J. Walls
Abstract:
We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and content…
▽ More
We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and contention-aware thread block placement policies limits the effectiveness of NVIDIA's concurrency mechanisms. In summary, the sequential nature of deep learning workloads and their fluctuating resource requirements and kernel runtimes make executing such workloads while maintaining consistently high utilization and low, predictable turnaround times difficult on current NVIDIA hardware.
△ Less
Submitted 1 October, 2021;
originally announced October 2021.
-
Memory-Efficient Deep Learning Inference in Trusted Execution Environments
Authors:
Jean-Baptiste Truong,
William Gallagher,
Tian Guo,
Robert J. Walls
Abstract:
This study identifies and proposes techniques to alleviate two key bottlenecks to executing deep neural networks in trusted execution environments (TEEs): page thrashing during the execution of convolutional layers and the decryption of large weight matrices in fully-connected layers. For the former, we propose a novel partitioning scheme, y-plane partitioning, designed to (i) provide consistent e…
▽ More
This study identifies and proposes techniques to alleviate two key bottlenecks to executing deep neural networks in trusted execution environments (TEEs): page thrashing during the execution of convolutional layers and the decryption of large weight matrices in fully-connected layers. For the former, we propose a novel partitioning scheme, y-plane partitioning, designed to (i) provide consistent execution time when the layer output is large compared to the TEE secure memory; and (ii) significantly reduce the memory footprint of convolutional layers. For the latter, we leverage quantization and compression. In our evaluation, the proposed optimizations incurred latency overheads ranging from 1.09X to 2X baseline for a wide range of TEE sizes; in contrast, an unmodified implementation incurred latencies of up to 26X when running inside of the TEE.
△ Less
Submitted 30 September, 2021; v1 submitted 30 April, 2021;
originally announced April 2021.
-
Data-Free Model Extraction
Authors:
Jean-Baptiste Truong,
Pratyush Maini,
Robert J. Walls,
Nicolas Papernot
Abstract:
Current model extraction attacks assume that the adversary has access to a surrogate dataset with characteristics similar to the proprietary data used to train the victim model. This requirement precludes the use of existing model extraction techniques on valuable models, such as those trained on rare or hard to acquire datasets. In contrast, we propose data-free model extraction methods that do n…
▽ More
Current model extraction attacks assume that the adversary has access to a surrogate dataset with characteristics similar to the proprietary data used to train the victim model. This requirement precludes the use of existing model extraction techniques on valuable models, such as those trained on rare or hard to acquire datasets. In contrast, we propose data-free model extraction methods that do not require a surrogate dataset. Our approach adapts techniques from the area of data-free knowledge transfer for model extraction. As part of our study, we identify that the choice of loss is critical to ensuring that the extracted model is an accurate replica of the victim model. Furthermore, we address difficulties arising from the adversary's limited access to the victim model in a black-box setting. For example, we recover the model's logits from its probability predictions to approximate gradients. We find that the proposed data-free model extraction approach achieves high-accuracy with reasonable query complexity -- 0.99x and 0.92x the victim model accuracy on SVHN and CIFAR-10 datasets given 2M and 20M queries respectively.
△ Less
Submitted 31 March, 2021; v1 submitted 30 November, 2020;
originally announced November 2020.
-
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
Authors:
Shijian Li,
Robert J. Walls,
Tian Guo
Abstract:
Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce t…
▽ More
Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.
In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.
△ Less
Submitted 6 April, 2020;
originally announced April 2020.
-
DRAB-LOCUS: An Area-Efficient AES Architecture for Hardware Accelerator Co-Location on FPGAs
Authors:
Jacob T. Grycel,
Robert J. Walls
Abstract:
Advanced Encryption Standard (AES) implementations on Field Programmable Gate Arrays (FPGA) commonly focus on maximizing throughput at the cost of utilizing high volumes of FPGA slice logic. High resource usage limits systems' abilities to implement other functions (such as video processing or machine learning) that may want to share the same FPGA resources. In this paper, we address the shared re…
▽ More
Advanced Encryption Standard (AES) implementations on Field Programmable Gate Arrays (FPGA) commonly focus on maximizing throughput at the cost of utilizing high volumes of FPGA slice logic. High resource usage limits systems' abilities to implement other functions (such as video processing or machine learning) that may want to share the same FPGA resources. In this paper, we address the shared resource challenge by proposing and evaluating a low-area, but high-throughput, AES architecture. In contrast to existing work, our DSP/RAM-Based Low-CLB Usage (DRAB-LOCUS) architecture leverages block RAM tiles and Digital Signal Processing (DSP) slices to implement the AES Sub Bytes, Mix Columns, and Add Round Key sub-round transformations, reducing resource usage by a factor of 3 over traditional approaches. To achieve area-efficiency, we built an inner-pipelined architecture using the internal registers of block RAM tiles and DSP slices. Our DRAB-LOCUS architecture features a 12-stage pipeline capable of producing 7.055 Gbps of interleaved encrypted or decrypted data, and only uses 909 Look Up tables, 593 Flip Flops, 16 block RAMs, and 18 DSP slices in the target device.
△ Less
Submitted 11 November, 2019;
originally announced November 2019.
-
Silhouette: Efficient Protected Shadow Stacks for Embedded Systems
Authors:
Jie Zhou,
Yufei Du,
Zhuojia Shen,
Lele Ma,
John Criswell,
Robert J. Walls
Abstract:
Microcontroller-based embedded systems are increasingly used for applications that can have serious and immediate consequences if compromised---including automobile control systems, smart locks, drones, and implantable medical devices. Due to resource and execution-time constraints, C is the primary language used for programming these devices. Unfortunately, C is neither type-safe nor memory-safe,…
▽ More
Microcontroller-based embedded systems are increasingly used for applications that can have serious and immediate consequences if compromised---including automobile control systems, smart locks, drones, and implantable medical devices. Due to resource and execution-time constraints, C is the primary language used for programming these devices. Unfortunately, C is neither type-safe nor memory-safe, and control-flow hijacking remains a prevalent threat.
This paper presents Silhouette: a compiler-based defense that efficiently guarantees the integrity of return addresses, significantly reducing the attack surface for control-flow hijacking. Silhouette combines an incorruptible shadow stack for return addresses with checks on forward control flow and memory protection to ensure that all functions return to the correct dynamic caller. To protect its shadow stack, Silhouette uses store hardening, an efficient intra-address space isolation technique targeting various ARM architectures that leverages special store instructions found on ARM processors.
We implemented Silhouette for the ARMv7-M architecture, but our techniques are applicable to other common embedded ARM architectures. Our evaluation shows that Silhouette incurs a geometric mean of 1.3% and 3.4% performance overhead on two benchmark suites. Furthermore, we prototyped Silhouette-Invert, an alternative implementation of Silhouette, which incurs just 0.3% and 1.9% performance overhead, at the cost of a minor hardware change.
△ Less
Submitted 25 June, 2020; v1 submitted 26 October, 2019;
originally announced October 2019.
-
Confidential Deep Learning: Executing Proprietary Models on Untrusted Devices
Authors:
Peter M. VanNostrand,
Ioannis Kyriazis,
Michelle Cheng,
Tian Guo,
Robert J. Walls
Abstract:
Performing deep learning on end-user devices provides fast offline inference results and can help protect the user's privacy. However, running models on untrusted client devices reveals model information which may be proprietary, i.e., the operating system or other applications on end-user devices may be manipulated to copy and redistribute this information, infringing on the model provider's inte…
▽ More
Performing deep learning on end-user devices provides fast offline inference results and can help protect the user's privacy. However, running models on untrusted client devices reveals model information which may be proprietary, i.e., the operating system or other applications on end-user devices may be manipulated to copy and redistribute this information, infringing on the model provider's intellectual property. We propose the use of ARM TrustZone, a hardware-based security feature present in most phones, to confidentially run a proprietary model on an untrusted end-user device. We explore the limitations and design challenges of using TrustZone and examine potential approaches for confidential deep learning within this environment. Of particular interest is providing robust protection of proprietary model information while minimizing total performance overhead.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
ERHARD-RNG: A Random Number Generator Built from Repurposed Hardware in Embedded Systems
Authors:
Jacob Grycel,
Robert J. Walls
Abstract:
Quality randomness is fundamental to cryptographic operations but on embedded systems good sources are (seemingly) hard to find. Rather than use expensive custom hardware, our ERHARD-RNG Pseudo-Random Number Generator (PRNG) utilizes entropy sources that are already common in a range of low-cost embedded platforms. We empirically evaluate the entropy provided by three sources---SRAM startup state,…
▽ More
Quality randomness is fundamental to cryptographic operations but on embedded systems good sources are (seemingly) hard to find. Rather than use expensive custom hardware, our ERHARD-RNG Pseudo-Random Number Generator (PRNG) utilizes entropy sources that are already common in a range of low-cost embedded platforms. We empirically evaluate the entropy provided by three sources---SRAM startup state, oscillator jitter, and device temperature---and integrate those sources into a full Pseudo-Random Number Generator implementation based on Fortuna. Our system addresses a number of fundamental challenges affecting random number generation on embedded systems. For instance, we propose SRAM startup state as a means to efficiently generate the initial seed---even for systems that do not have writeable storage. Further, the system's use of oscillator jitter allows for the continuous collection of entropy-generating events---even for systems that do not have the user-generated events that are commonly used in general-purpose systems for entropy, e.g., key presses or network events.
△ Less
Submitted 11 November, 2019; v1 submitted 22 March, 2019;
originally announced March 2019.
-
Speeding up Deep Learning with Transient Servers
Authors:
Shijian Li,
Robert J. Walls,
Lijie Xu,
Tian Guo
Abstract:
Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using…
▽ More
Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs.
We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.
△ Less
Submitted 5 May, 2019; v1 submitted 28 February, 2019;
originally announced March 2019.