Search | arXiv e-print repository

Providing High-Performance Execution with a Sequential Contract for Cryptographic Programs

Authors: Ali Hajiabadi, Trevor E. Carlson

Abstract: Constant-time programming is a widely deployed approach to harden cryptographic programs against side channel attacks. However, modern processors violate the underlying assumptions of constant-time policies by speculatively executing unintended paths of the program. In this work, we propose Cassandra, a novel hardware-software mechanism to protect constant-time cryptographic code against specula… ▽ More Constant-time programming is a widely deployed approach to harden cryptographic programs against side channel attacks. However, modern processors violate the underlying assumptions of constant-time policies by speculatively executing unintended paths of the program. In this work, we propose Cassandra, a novel hardware-software mechanism to protect constant-time cryptographic code against speculative control flow based attacks. Cassandra explores the radical design point of disabling the branch predictor and recording-and-replaying sequential control flow of the program. Two key insights that enable our design are that (1) the sequential control flow of a constant-time program is constant over different runs, and (2) cryptographic programs are highly looped and their control flow patterns repeat in a highly compressible way. These insights allow us to perform an offline branch analysis that significantly compresses control flow traces. We add a small component to a typical processor design, the Branch Trace Unit, to store compressed traces and determine fetch redirections according to the sequential model of the program. Moreover, we provide a formal security analysis and prove that our methodology adheres to a strong security contract by design. Despite providing a higher security guarantee, Cassandra counter-intuitively improves performance by 1.77% by eliminating branch misprediction penalties. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 17 pages, 7 figures, 4 tables

arXiv:2405.12513 [pdf, other]

Fully Randomized Pointers

Authors: Gregory J. Duck, Sai Dhawal Phaye, Roland H. C. Yap, Trevor E. Carlson

Abstract: Software security continues to be a critical concern for programs implemented in low-level programming languages such as C and C++. Many defenses have been proposed in the current literature, each with different trade-offs including performance, compatibility, and attack resistance. One general class of defense is pointer randomization or authentication, where invalid object access (e.g., memory e… ▽ More Software security continues to be a critical concern for programs implemented in low-level programming languages such as C and C++. Many defenses have been proposed in the current literature, each with different trade-offs including performance, compatibility, and attack resistance. One general class of defense is pointer randomization or authentication, where invalid object access (e.g., memory errors) is obfuscated or denied. Many defenses rely on the program termination (e.g., crashing) to abort attacks, with the implicit assumption that an adversary cannot "brute force" the defense with multiple attack attempts. However, such assumptions do not always hold, such as hardware speculative execution attacks or network servers configured to restart on error. In such cases, we argue that most existing defenses provide only weak effective security. In this paper, we propose Fully Randomized Pointers (FRP) as a stronger memory error defense that is resistant to even brute force attacks. The key idea is to fully randomize pointer bits -- as much as possible while also preserving binary compatibility -- rendering the relationships between pointers highly unpredictable. Furthermore, the very high degree of randomization renders brute force attacks impractical -- providing strong effective security compared to existing work. We design a new FRP encoding that is: (1) compatible with existing binary code (without recompilation); (2) decoupled from the underlying object layout; and (3) can be efficiently decoded on-the-fly to the underlying memory address. We prototype FRP in the form of a software implementation (BlueFat) to test security and compatibility, and a proof-of-concept hardware implementation (GreenFat) to evaluate performance. We show that FRP is secure, practical, and compatible at the binary level, while a hardware implementation can achieve low performance overheads (<10%). △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 24 pages, 3 figures

arXiv:2310.17089 [pdf, other]

Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling

Authors: Changxi Liu, Alen Sabu, Akanksha Chaudhari, Qingxuan Kang, Trevor E. Carlson

Abstract: High-performance, multi-core processors are the key to accelerating workloads in several application domains. To continue to scale performance at the limit of Moore's Law and Dennard scaling, software and hardware designers have turned to dynamic solutions that adapt to the needs of applications in a transparent, automatic way. For example, modern hardware improves its performance and power effici… ▽ More High-performance, multi-core processors are the key to accelerating workloads in several application domains. To continue to scale performance at the limit of Moore's Law and Dennard scaling, software and hardware designers have turned to dynamic solutions that adapt to the needs of applications in a transparent, automatic way. For example, modern hardware improves its performance and power efficiency by changing the hardware configuration, like the frequency and voltage of cores, according to a number of parameters such as the technology used, the workload running, etc. With this level of dynamism, it is essential to simulate next-generation multi-core processors in a way that can both respond to system changes and accurately determine system performance metrics. Currently, no sampled simulation platform can achieve these goals of dynamic, fast, and accurate simulation of multi-threaded workloads. In this work, we propose a solution that allows for fast, accurate simulation in the presence of both hardware and software dynamism. To accomplish this goal, we present Pac-Sim, a novel sampled simulation methodology for fast, accurate sampled simulation that requires no upfront analysis of the workload. With our proposed methodology, it is now possible to simulate long-running dynamically scheduled multi-threaded programs with significant simulation speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both static and dynamic thread scheduling. The experimental results show that Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for statically and dynamically scheduled benchmarks, respectively. Pac-Sim also demonstrates significant simulation speedups as high as 523.5$\times$ (210.3$\times$ on average) for the train input set of SPEC CPU2017. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 14 pages, 14 figures

arXiv:2306.11291 [pdf, other]

Mitigating Speculation-based Attacks through Configurable Hardware/Software Co-design

Authors: Ali Hajiabadi, Archit Agarwal, Andreas Diavastos, Trevor E. Carlson

Abstract: New speculation-based attacks that affect large numbers of modern systems are disclosed regularly. Currently, CPU vendors regularly fall back to heavy-handed mitigations like using barriers or enforcing strict programming guidelines resulting in significant performance overhead. What is missing is a solution that allows for efficient mitigation and is flexible enough to address both current and fu… ▽ More New speculation-based attacks that affect large numbers of modern systems are disclosed regularly. Currently, CPU vendors regularly fall back to heavy-handed mitigations like using barriers or enforcing strict programming guidelines resulting in significant performance overhead. What is missing is a solution that allows for efficient mitigation and is flexible enough to address both current and future speculation vulnerabilities, without additional hardware changes. In this work, we present SpecControl, a novel hardware/software co-design, that enables new levels of security while reducing the performance overhead that has been demonstrated by state-of-the-art methodologies. SpecControl introduces a communication interface that allows compilers and application developers to inform the hardware about true branch dependencies, confidential control-flow instructions, and fine-grained instruction constraints in order to apply restrictions only when necessary. We evaluate SpecControl against known speculative execution attacks and in addition, present a new speculative fetch attack variant on the Pattern History Table (PHT) in branch predictors that shows how similar previously reported vulnerabilities are more dangerous by enabling unprivileged attacks, especially with the state-of-the-art branch predictors. SpecControl provides stronger security guarantees compared to the existing defenses while reducing the performance overhead of two state-of-the-art defenses from 51% and 43% to just 23%. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 13 pages, 15 figures

arXiv:2306.11195 [pdf, other]

New Cross-Core Cache-Agnostic and Prefetcher-based Side-Channels and Covert-Channels

Authors: Yun Chen, Ali Hajiabadi, Lingfeng Pei, Trevor E. Carlson

Abstract: In this paper, we reveal the existence of a new class of prefetcher, the XPT prefetcher, in the modern Intel processors which has never been officially documented. It speculatively issues a load, bypassing last-level cache (LLC) lookups, when it predicts that a load request will result in an LLC miss. We demonstrate that XPT prefetcher is shared among different cores, which enables an attacker to… ▽ More In this paper, we reveal the existence of a new class of prefetcher, the XPT prefetcher, in the modern Intel processors which has never been officially documented. It speculatively issues a load, bypassing last-level cache (LLC) lookups, when it predicts that a load request will result in an LLC miss. We demonstrate that XPT prefetcher is shared among different cores, which enables an attacker to build cross-core side-channel and covert-channel attacks. We propose PrefetchX, a cross-core attack mechanism, to leak users' sensitive data and activities. We empirically demonstrate that PrefetchX can be used to extract private keys of real-world RSA applications. Furthermore, we show that PrefetchX can enable side-channel attacks that can monitor keystrokes and network traffic patterns of users. Our two cross-core covert-channel attacks also see a low error rate and a 1.7MB/s maximum channel capacity. Due to the cache-independent feature of PrefetchX, current cache-based mitigations are not effective against our attacks. Overall, our work uncovers a significant vulnerability in the XPT prefetcher, which can be exploited to compromise the confidentiality of sensitive information in both crypto and non-crypto-related applications among processor cores. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 12 pages, 12 figures

arXiv:2302.13863 [pdf, other]

Capstone: A Capability-based Foundation for Trustless Secure Memory Access (Extended Version)

Authors: Jason Zhi**gcheng Yu, Conrad Watt, Aditya Badole, Trevor E. Carlson, Prateek Saxena

Abstract: Capability-based memory isolation is a promising new architectural primitive. Software can access low-level memory only via capability handles rather than raw pointers, which provides a natural interface to enforce security restrictions. Existing architectural capability designs such as CHERI provide spatial safety, but fail to extend to other memory models that security-sensitive software designs… ▽ More Capability-based memory isolation is a promising new architectural primitive. Software can access low-level memory only via capability handles rather than raw pointers, which provides a natural interface to enforce security restrictions. Existing architectural capability designs such as CHERI provide spatial safety, but fail to extend to other memory models that security-sensitive software designs may desire. In this paper, we propose Capstone, a more expressive architectural capability design that supports multiple existing memory isolation models in a trustless setup, i.e., without relying on trusted software components. We show how Capstone is well-suited for environments where privilege boundaries are fluid (dynamically extensible), memory sharing/delegation are desired both temporally and spatially, and where such needs are to be balanced with availability concerns. Capstone can also be implemented efficiently. We present an implementation sketch and through evaluation show that its overhead is below 50% in common use cases. We also prototype a functional emulator for Capstone and use it to demonstrate the runnable implementations of six real-world memory models without trusted software components: three types of enclave-based TEEs, a thread scheduler, a memory allocator, and Rust-style memory safety -- all within the interface of Capstone. △ Less

Submitted 9 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: 31 pages, 10 figures. This is an extended version of a paper to appear at 32nd USENIX Security Symposium, August 2023; acknowledgments updated

arXiv:2204.09797 [pdf, other]

Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network Accelerator

Authors: Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson

Abstract: Machine learning, particularly deep neural network inference, has become a vital workload for many computing systems, from data centers and HPC systems to edge-based computing. As advances in sparsity have helped improve the efficiency of AI acceleration, there is a continued need for improved system efficiency for both high-performance and system-level acceleration. This work takes a unique loo… ▽ More Machine learning, particularly deep neural network inference, has become a vital workload for many computing systems, from data centers and HPC systems to edge-based computing. As advances in sparsity have helped improve the efficiency of AI acceleration, there is a continued need for improved system efficiency for both high-performance and system-level acceleration. This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration that aims to minimize useless work, improve utilization, and increase performance and energy efficiency. Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and MLP workloads. This work demonstrates state-of-the-art energy efficiency and performance centring on activation-based sparsity and a highly-parallel dataflow method that improves the overall functional unit utilization (at 30 fps). This work enhances energy efficiency over a state-of-the-art solution by 1.46$\times$. Taken together, this methodology presents a novel, new direction to achieve high-efficiency, high-performance designs for next-generation AI acceleration platforms. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Comments: 12 pages, 9 figures and 5 tables

arXiv:2109.03112 [pdf, other]

Efficient Instruction Scheduling using Real-time Load Delay Tracking

Authors: Andreas Diavastos, Trevor E. Carlson

Abstract: Many hardware structures in today's high-performance out-of-order processors do not scale in an efficient way. To address this, different solutions have been proposed that build execution schedules in an energy-efficient manner. Issue time prediction processors are one such solution that use data-flow dependencies and predefined instruction latencies to predict issue times of repeated instructions… ▽ More Many hardware structures in today's high-performance out-of-order processors do not scale in an efficient way. To address this, different solutions have been proposed that build execution schedules in an energy-efficient manner. Issue time prediction processors are one such solution that use data-flow dependencies and predefined instruction latencies to predict issue times of repeated instructions. In this work, we aim to improve their accuracy, and consequently their performance, in an energy efficient way. We accomplish this by taking advantage of two key observations. First, memory accesses often take additional time to arrive than the static, predefined access latency that is used to describe these systems. Second, we find that these memory access delays often repeat across iterations of the same code. This, in turn, allows us to predict the arrival time of these accesses. In this work, we introduce a new processor microarchitecture, that replaces a complex reservation-station-based scheduler with an efficient, scalable alternative. Our proposed scheduling technique tracks real-time delays of loads to accurately predict instruction issue times, and uses a reordering mechanism to prioritize instructions based on that prediction, achieving close-to-out-of-order processor performance. To accomplish this in an energy-efficient manner we introduce: (1) an instruction delay learning mechanism that monitors repeated load instructions and learns their latest delay, (2) an issue time predictor that uses learned delays and data-flow dependencies to predict instruction issue times and (3) priority queues that reorder instructions based on their issue time prediction. Together, our processor achieves 86.2% of the performance of a traditional out-of-order processor, higher than previous efficient scheduler proposals, while still consuming 30% less power. △ Less

Submitted 7 September, 2021; originally announced September 2021.

Comments: 13 pages, 11 figures, 4 tables

arXiv:2109.00474 [pdf, other]

Leaking Control Flow Information via the Hardware Prefetcher

Authors: Yun Chen, Lingfeng Pei, Trevor E. Carlson

Abstract: Modern processor designs use a variety of microarchitectural methods to achieve high performance. Unfortunately, new side-channels have often been uncovered that exploit these enhanced designs. One area that has received little attention from a security perspective is the processor's hard-ware prefetcher, a critical component used to mitigate DRAM latency in today's systems. Prefetchers, like bran… ▽ More Modern processor designs use a variety of microarchitectural methods to achieve high performance. Unfortunately, new side-channels have often been uncovered that exploit these enhanced designs. One area that has received little attention from a security perspective is the processor's hard-ware prefetcher, a critical component used to mitigate DRAM latency in today's systems. Prefetchers, like branch predictors, hold critical state related to the execution of the application, and have the potential to leak secret information. But up to now, there has not been a demonstration of a generic prefetcher side-channel that could be actively exploited in today's hardware. In this paper, we present AfterImage, a new side-channel that exploits the Intel Instruction Pointer-based stride prefetcher. We observe that, when the execution of the processor switches between different private domains, the prefetcher trained by one domain can be triggered in another. To the best of our knowledge, this work is the first to publicly demonstrate a methodology that is both algorithm-agnostic and also able to leak kernel data into userspace. AfterImage is different from previous works, as it leaks data on the non-speculative path of execution. Because of this, a large class of work that has focused on protecting transient, branch-outcome-based data will be unable to block this side-channel. By reverse-engineering the IP-stride prefetcher in modern Intel processors, we have successfully developed three variants of AfterImage to leak control flow information across code regions, processes and the user-kernel boundary. We find a high level of accuracy in leaking information with our methodology (from 91%, up to 99%), and propose two mitigation techniques to block this side-channel, one of which can be used on hardware systems today. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: 16 pages, 14 figures, 8 listings

arXiv:2107.11336 [pdf, other]

Mitigating Power Attacks through Fine-Grained Instruction Reordering

Authors: Yun Chen, Ali Hajiabadi, Romain Poussier, Andreas Diavastos, Shivam Bhasin, Trevor E. Carlson

Abstract: Side-channel attacks are a security exploit that take advantage of information leakage. They use measurement and analysis of physical parameters to reverse engineer and extract secrets from a system. Power analysis attacks in particular, collect a set of power traces from a computing device and use statistical techniques to correlate this information with the attacked application data and source c… ▽ More Side-channel attacks are a security exploit that take advantage of information leakage. They use measurement and analysis of physical parameters to reverse engineer and extract secrets from a system. Power analysis attacks in particular, collect a set of power traces from a computing device and use statistical techniques to correlate this information with the attacked application data and source code. Counter measures like just-in-time compilation, random code injection and instruction descheduling obfuscate the execution of instructions to reduce the security risk. Unfortunately, due to the randomness and excess instructions executed by these solutions, they introduce large overheads in performance, power and area. In this work we propose a scheduling algorithm that dynamically reorders instructions in an out-of-order processor to provide obfuscated execution and mitigate power analysis attacks with little-to-no effect on the performance, power or area of the processor. We exploit the time between operand availability of critical instructions (slack) to create high-performance random schedules without requiring additional instructions or static prescheduling. Further, we perform an extended security analysis using different attacks. We highlight the dangers of using incorrect adversarial assumptions, which can often lead to a false sense of security. In that regard, our advanced security metric demonstrates improvements of 34$\times$, while our basic security evaluation shows results up to 261$\times$. Moreover, our system achieves performance within 96% on average, of the baseline unprotected processor. △ Less

Submitted 23 July, 2021; originally announced July 2021.

Comments: 13 pages, 12 figures

arXiv:2010.08440 [pdf, other]

Elasticlave: An Efficient Memory Model for Enclaves

Authors: Zhi**gcheng Yu, Shweta Shinde, Trevor E. Carlson, Prateek Saxena

Abstract: Trusted-execution environments (TEE), like Intel SGX, isolate user-space applications into secure enclaves without trusting the OS. Thus, TEEs reduce the trusted computing base, but add one to two orders of magnitude slow-down. The performance cost stems from a strict memory model, which we call the spatial isolation model, where enclaves cannot share memory regions with each other. In this work,… ▽ More Trusted-execution environments (TEE), like Intel SGX, isolate user-space applications into secure enclaves without trusting the OS. Thus, TEEs reduce the trusted computing base, but add one to two orders of magnitude slow-down. The performance cost stems from a strict memory model, which we call the spatial isolation model, where enclaves cannot share memory regions with each other. In this work, we present Elasticlave---a new TEE memory model that allows enclaves to selectively and temporarily share memory with other enclaves and the OS. Elasticlave eliminates the need for expensive data copy operations, while offering the same level of application-desired security as possible with the spatial model. We prototype Elasticlave design on an RTL-designed cycle-level RISC-V core and observe 1 to 2 orders of magnitude performance improvements over the spatial model implemented with the same processor configuration. Elasticlave has a small TCB. We find that its performance characteristics and hardware area footprint scale well with the number of shared memory regions it is configured to support. △ Less

Submitted 16 October, 2020; originally announced October 2020.

arXiv:2008.07171 [pdf, other]

CARGO : Context Augmented Critical Region Offload for Network-bound datacenter Workloads

Authors: Siddharth Rai, Trevor E. Carlson

Abstract: Network bound applications, like a database server executing OLTP queries or a caching server storing objects for a dynamic web applications, are essential services that consumers and businesses use daily. These services run on a large datacenters and are required to meet predefined Service Level Objectives (SLO), or latency targets, with high probability. Thus, efficient datacenter applications s… ▽ More Network bound applications, like a database server executing OLTP queries or a caching server storing objects for a dynamic web applications, are essential services that consumers and businesses use daily. These services run on a large datacenters and are required to meet predefined Service Level Objectives (SLO), or latency targets, with high probability. Thus, efficient datacenter applications should optimize their execution in terms of power and performance. However, to support large scale data storage, these workloads make heavy use of pointer connected data structures (e.g., hash table, large fan-out tree, trie) and exhibit poor instruction and memory level parallelism. Our experiments show that due to long memory access latency, these workloads occupy processor resources (e.g., ROB entries, RS buffers, LS queue entries etc.) for a prolonged period of time that delay the processing of subsequent requests. Delayed execution not only increases request processing latency, but also severely effects an application throughput and power-efficiency. To overcome this limitation, we present CARGO, a novel mechanism to overlap queuing latency and request processing by executing select instructions on an application critical path at the network interface card (NIC) while requests wait for processor resources to become available. Our mechanism dynamically identifies the critical instructions and includes the register state needed to compute the long latency memory accesses. This context-augmented critical region is often executed at the NIC well before execution begins at the core, effectively prefetching the data ahead of time. Across a variety of interactive datacenter applications, our proposal improves latency, throughput, and power efficiency by 2.7X, 2.7X, and 1.5X, respectively, while incurring a modest amount storage overhead. △ Less

Submitted 17 August, 2020; originally announced August 2020.

arXiv:2007.12934 [pdf, other]

SOTERIA: In Search of Efficient Neural Networks for Private Inference

Authors: Anshul Aggarwal, Trevor E. Carlson, Reza Shokri, Shruti Tople

Abstract: ML-as-a-service is gaining popularity where a cloud server hosts a trained model and offers prediction (inference) service to users. In this setting, our objective is to protect the confidentiality of both the users' input queries as well as the model parameters at the server, with modest computation and communication overhead. Prior solutions primarily propose fine-tuning cryptographic methods to… ▽ More ML-as-a-service is gaining popularity where a cloud server hosts a trained model and offers prediction (inference) service to users. In this setting, our objective is to protect the confidentiality of both the users' input queries as well as the model parameters at the server, with modest computation and communication overhead. Prior solutions primarily propose fine-tuning cryptographic methods to make them efficient for known fixed model architectures. The drawback with this line of approach is that the model itself is never designed to operate with existing efficient cryptographic computations. We observe that the network architecture, internal functions, and parameters of a model, which are all chosen during training, significantly influence the computation and communication overhead of a cryptographic method, during inference. Based on this observation, we propose SOTERIA -- a training method to construct model architectures that are by-design efficient for private inference. We use neural architecture search algorithms with the dual objective of optimizing the accuracy of the model and the overhead of using cryptographic primitives for secure inference. Given the flexibility of modifying a model during training, we find accurate models that are also efficient for private computation. We select garbled circuits as our underlying cryptographic primitive, due to their expressiveness and efficiency, but this approach can be extended to hybrid multi-party computation settings. We empirically evaluate SOTERIA on MNIST and CIFAR10 datasets, to compare with the prior work. Our results confirm that SOTERIA is indeed effective in balancing performance and accuracy. △ Less

Submitted 25 July, 2020; originally announced July 2020.

arXiv:2006.09982 [pdf, other]

You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference to ANN-Level Accuracy

Authors: Srivatsa P, Kyle Timothy Ng Chu, Burin Amornpaisannon, Yaswanth Tavva, Venkata Pavan Kumar Miriyala, Jibin Wu, Malu Zhang, Haizhou Li, Trevor E. Carlson

Abstract: In the past decade, advances in Artificial Neural Networks (ANNs) have allowed them to perform extremely well for a wide range of tasks. In fact, they have reached human parity when performing image recognition, for example. Unfortunately, the accuracy of these ANNs comes at the expense of a large number of cache and/or memory accesses and compute operations. Spiking Neural Networks (SNNs), a type… ▽ More In the past decade, advances in Artificial Neural Networks (ANNs) have allowed them to perform extremely well for a wide range of tasks. In fact, they have reached human parity when performing image recognition, for example. Unfortunately, the accuracy of these ANNs comes at the expense of a large number of cache and/or memory accesses and compute operations. Spiking Neural Networks (SNNs), a type of neuromorphic, or brain-inspired network, have recently gained significant interest as power-efficient alternatives to ANNs, because they are sparse, accessing very few weights, and typically only use addition operations instead of the more power-intensive multiply-and-accumulate (MAC) operations. The vast majority of neuromorphic hardware designs support rate-encoded SNNs, where the information is encoded in spike rates. Rate-encoded SNNs could be seen as inefficient as an encoding scheme because it involves the transmission of a large number of spikes. A more efficient encoding scheme, Time-To-First-Spike (TTFS) encoding, encodes information in the relative time of arrival of spikes. While TTFS-encoded SNNs are more efficient than rate-encoded SNNs, they have, up to now, performed poorly in terms of accuracy compared to previous methods. Hence, in this work, we aim to overcome the limitations of TTFS-encoded neuromorphic systems. To accomplish this, we propose: (1) a novel optimization algorithm for TTFS-encoded SNNs converted from ANNs and (2) a novel hardware accelerator for TTFS-encoded SNNs, with a scalable and low-power design. Overall, our work in TTFS encoding and training improves the accuracy of SNNs to achieve state-of-the-art results on MNIST MLPs, while reducing power consumption by 1.46$\times$ over the state-of-the-art neuromorphic hardware. △ Less

Submitted 8 November, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: 10 pages, 4 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. This work is an extended version of the paper accepted to the 2nd Workshop on Accelerated Machine Learning (AccML 2020)

arXiv:2003.11837 [pdf, other]

Rectified Linear Postsynaptic Potential Function for Backpropagation in Deep Spiking Neural Networks

Authors: Malu Zhang, Jiadong Wang, Burin Amornpaisannon, Zhixuan Zhang, VPK Miriyala, Ammar Belatreche, Hong Qu, Jibin Wu, Yansong Chua, Trevor E. Carlson, Haizhou Li

Abstract: Spiking Neural Networks (SNNs) use spatio-temporal spike patterns to represent and transmit information, which is not only biologically realistic but also suitable for ultra-low-power event-driven neuromorphic implementation. Motivated by the success of deep learning, the study of Deep Spiking Neural Networks (DeepSNNs) provides promising directions for artificial intelligence applications. Howeve… ▽ More Spiking Neural Networks (SNNs) use spatio-temporal spike patterns to represent and transmit information, which is not only biologically realistic but also suitable for ultra-low-power event-driven neuromorphic implementation. Motivated by the success of deep learning, the study of Deep Spiking Neural Networks (DeepSNNs) provides promising directions for artificial intelligence applications. However, training of DeepSNNs is not straightforward because the well-studied error back-propagation (BP) algorithm is not directly applicable. In this paper, we first establish an understanding as to why error back-propagation does not work well in DeepSNNs. To address this problem, we propose a simple yet efficient Rectified Linear Postsynaptic Potential function (ReL-PSP) for spiking neurons and propose a Spike-Timing-Dependent Back-Propagation (STDBP) learning algorithm for DeepSNNs. In STDBP algorithm, the timing of individual spikes is used to convey information (temporal coding), and learning (back-propagation) is performed based on spike timing in an event-driven manner. Our experimental results show that the proposed learning algorithm achieves state-of-the-art classification accuracy in single spike time based learning algorithms of DeepSNNs. Furthermore, by utilizing the trained model parameters obtained from the proposed STDBP learning algorithm, we demonstrate the ultra-low-power inference operations on a recently proposed neuromorphic inference accelerator. Experimental results show that the neuromorphic hardware consumes 0.751~mW of the total power consumption and achieves a low latency of 47.71~ms to classify an image from the MNIST dataset. Overall, this work investigates the contribution of spike timing dynamics to information encoding, synaptic plasticity and decision making, providing a new perspective to design of future DeepSNNs and neuromorphic hardware systems. △ Less

Submitted 3 November, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: This work has been submitted to the IEEE for possible publication. Copyrightmay be transferred without notice, after which this version may no longer beaccessible

arXiv:1903.05488 [pdf, other]

doi 10.1016/j.future.2017.10.044

Power-Performance Tradeoffs in Data Center Servers: DVFS, CPU pinning, Horizontal, and Vertical Scaling

Authors: Jakub Krzywda, Ahmed Ali-Eldin, Trevor E. Carlson, Per-Olov Östberg, Erik Elmroth

Abstract: Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running… ▽ More Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running on our local testbed, we perform a set of experiments to quantify the influence of DVFS, vertical and horizontal scaling, and CPU pinning on end-to-end response time (average and tail), throughput, and power consumption with different workloads. Results of the experiments show that DVFS rarely reduces the power consumption of underloaded servers by more than 5%, but it can be used to limit the maximal power consumption of a saturated server by up to 20% (at a cost of performance degradation). CPU pinning reduces the power consumption of underloaded server (by up to 7%) at the cost of performance degradation, which can be limited by choosing an appropriate CPU pinning scheme. Horizontal and vertical scaling improves both the average and tail response time, but the improvement is not proportional to the amount of resources added. The load balancing strategy has a big impact on the tail response time of horizontally scaled applications. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 31 pages

Journal ref: Future Generation Computer Systems, Elsevier, Vol. 81, pp. 114-128, 2018

Showing 1–16 of 16 results for author: Carlson, T E