Search | arXiv e-print repository

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Authors: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li

Abstract: In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the v… ▽ More In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.08413 [pdf, other]

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Authors: Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura

Abstract: Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are… ▽ More Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2402.18595 [pdf, other]

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

Authors: Bo Liu, Grace Li Zhang, Xunzhao Yin, Ulf Schlichtmann, Bing Li

Abstract: Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design… ▽ More Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design, the multipliers are replaced by simple logic gates to project the results onto a wide bit representation. These bits carry individual position weights, which can be trained for specific neural networks to enhance inference accuracy. The outputs of the new multipliers are added by bit-wise weighted accumulation and the accumulation results are compatible with existing computing platforms accelerating neural networks with either uniform or non-uniform quantization. Since the multiplication function is replaced by simple logic projection, the critical paths in the resulting circuits become much shorter. Correspondingly, pipelining stages in the MAC array can be reduced, leading to a significantly smaller area as well as a better power efficiency. The proposed design has been synthesized and verified by ResNet18-Cifar10, ResNet20-Cifar100 and ResNet50-ImageNet. The experimental results confirmed the reduction of circuit area by up to 79.63% and the reduction of power consumption of executing DNNs by up to 70.18%, while the accuracy of the neural networks can still be well maintained. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2309.10510 [pdf, other]

Logic Design of Neural Networks for High-Throughput and Low-Power Applications

Authors: Kangwei Xu, Grace Li Zhang, Ulf Schlichtmann, Bing Li

Abstract: Neural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiplyaccumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC… ▽ More Neural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiplyaccumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC operations in a neural network. Accordingly, the throughput in generating classification results is not high, which prevents the application of traditional hardware platforms in extreme-throughput scenarios. Besides, the power consumption of such platforms is also high, mainly due to data movement. To overcome this challenge, in this paper, we propose to flatten and implement all the operations at neurons, e.g., MAC and ReLU, in a neural network with their corresponding logic circuits. To improve the throughput and reduce the power consumption of such logic designs, the weight values are embedded into the MAC units to simplify the logic, which can reduce the delay of the MAC units and the power consumption incurred by weight movement. The retiming technique is further used to improve the throughput of the logic circuits for neural networks. In addition, we propose a hardware-aware training method to reduce the area of logic designs of neural networks. Experimental results demonstrate that the proposed logic designs can achieve high throughput and low power consumption for several high-throughput applications. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: accepted by ASPDAC 2024

arXiv:2306.08951 [pdf, other]

MLonMCU: TinyML Benchmarking with Fast Retargeting

Authors: Philipp van Kempen, Rafael Stahl, Daniel Mueller-Gritschneder, Ulf Schlichtmann

Abstract: While there exist many ways to deploy machine learning models on microcontrollers, it is non-trivial to choose the optimal combination of frameworks and targets for a given application. Thus, automating the end-to-end benchmarking flow is of high relevance nowadays. A tool called MLonMCU is proposed in this paper and demonstrated by benchmarking the state-of-the-art TinyML frameworks TFLite for Mi… ▽ More While there exist many ways to deploy machine learning models on microcontrollers, it is non-trivial to choose the optimal combination of frameworks and targets for a given application. Thus, automating the end-to-end benchmarking flow is of high relevance nowadays. A tool called MLonMCU is proposed in this paper and demonstrated by benchmarking the state-of-the-art TinyML frameworks TFLite for Microcontrollers and TVM effortlessly with a large number of configurations in a low amount of time. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: CODAI 2022 Workshop - Embedded System Week (ESWeek)

arXiv:2306.07294 [pdf, other]

Computational and Storage Efficient Quadratic Neurons for Deep Neural Networks

Authors: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li

Abstract: Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structu… ▽ More Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structures, the fundamental building blocks of neural networks, to alleviate the computational and storage cost. In this work, an efficient quadratic neuron architecture distinguished by its enhanced utilization of second-order computational information is introduced. By virtue of their better expressivity, DNNs employing the proposed quadratic neurons can attain similar accuracy with fewer neurons and computational cost. Experimental results have demonstrated that the proposed quadratic neuron structure exhibits superior computational and storage efficiency across various tasks when compared with both linear and non-linear neurons in prior work. △ Less

Submitted 27 November, 2023; v1 submitted 10 June, 2023; originally announced June 2023.

Comments: Accepted by Design Automation and Test in Europe (DATE) 2024

arXiv:2303.17878 [pdf, other]

Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

Authors: Rafael Stahl, Daniel Mueller-Gritschneder, Ulf Schlichtmann

Abstract: Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate r… ▽ More Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: Accepted as a full paper by the TinyML Research Symposium 2023

ACM Class: F.2.2; I.2.8

arXiv:2303.13997 [pdf, other]

PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration

Authors: Richard Petri, Grace Li Zhang, Yiran Chen, Ulf Schlichtmann, Bing Li

Abstract: Deep neural networks (DNNs) have been successfully applied in various fields. A major challenge of deploying DNNs, especially on edge devices, is power consumption, due to the large number of multiply-and-accumulate (MAC) operations. To address this challenge, we propose PowerPruning, a novel method to reduce power consumption in digital neural network accelerators by selecting weights that lead t… ▽ More Deep neural networks (DNNs) have been successfully applied in various fields. A major challenge of deploying DNNs, especially on edge devices, is power consumption, due to the large number of multiply-and-accumulate (MAC) operations. To address this challenge, we propose PowerPruning, a novel method to reduce power consumption in digital neural network accelerators by selecting weights that lead to less power consumption in MAC operations. In addition, the timing characteristics of the selected weights together with all activation transitions are evaluated. The weights and activations that lead to small delays are further selected. Consequently, the maximum delay of the sensitized circuit paths in the MAC units is reduced even without modifying MAC units, which thus allows a flexible scaling of supply voltage to reduce power consumption further. Together with retraining, the proposed method can reduce power consumption of DNNs on hardware by up to 78.3% with only a slight accuracy loss. △ Less

Submitted 27 November, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: accepted by Design Automation Conference (DAC) 2023

arXiv:2212.14337 [pdf, other]

Biologically Plausible Learning on Neuromorphic Hardware Architectures

Authors: Christopher Wolters, Brady Taylor, Edward Hanson, Xiaoxuan Yang, Ulf Schlichtmann, Yiran Chen

Abstract: With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog… ▽ More With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog memories. On the software side, the sequential Backpropagation algorithm prevents efficient parallelization and thus fast convergence. A novel method, Direct Feedback Alignment, resolves inherent layer dependencies by directly passing the error from the output to each layer. At the intersection of hardware/software co-design, there is a demand for develo** algorithms that are tolerable to hardware nonidealities. Therefore, this work explores the interrelationship of implementing bio-plausible learning in-situ on neuromorphic hardware, emphasizing energy, area, and latency constraints. Using the benchmarking framework DNN+NeuroSim, we investigate the impact of hardware nonidealities and quantization on algorithm performance, as well as how network topologies and algorithm-level design choices can scale latency, energy and area consumption of a chip. To the best of our knowledge, this work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa. The best results achieved for accuracy remain Backpropagation-based, notably when facing hardware imperfections. Direct Feedback Alignment, on the other hand, allows for significant speedup due to parallelization, reducing training time by a factor approaching N for N-layered networks. △ Less

Submitted 11 April, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

arXiv:2211.14928 [pdf, ps, other]

Class-based Quantization for Neural Networks

Authors: Wenhao Sun, Grace Li Zhang, Huaxi Gu, Bing Li, Ulf Schlichtmann

Abstract: In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform quantization… ▽ More In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform quantization or focus on model-wise and layer-wise uniform quantization, which are not as fine-grained as filter-wise quantization. In this paper, we propose a class-based quantization method to determine the minimum number of quantization bits for each filter or neuron in DNNs individually. In the proposed method, the importance score of each filter or neuron with respect to the number of classes in the dataset is first evaluated. The larger the score is, the more important the filter or neuron is and thus the larger the number of quantization bits should be. Afterwards, a search algorithm is adopted to exploit the different importance of filters and neurons to determine the number of quantization bits of each filter or neuron. Experimental results demonstrate that the proposed method can maintain the inference accuracy with low bit-width quantization. Given the same number of quantization bits, the proposed method can also achieve a better inference accuracy than the existing methods. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: accepted by DATE2023 (Design, Automation and Test in Europe)

arXiv:2211.14926 [pdf, other]

Step**Net: A Step** Neural Network with Incremental Accuracy Enhancement

Authors: Wenhao Sun, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Huaxi Gu, Bing Li, Ulf Schlichtmann

Abstract: Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy… ▽ More Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy of the results should be able to be enhanced dynamically according to the computational resources available in the computing system. To address these challenges, we propose a design framework called Step**Net. Step**Net constructs a series of subnets whose accuracy is incrementally enhanced as more MAC operations become available. Therefore, this design allows a trade-off between accuracy and latency. In addition, the larger subnets in Step**Net are built upon smaller subnets, so that the results of the latter can directly be reused in the former without recomputation. This property allows Step**Net to decide on-the-fly whether to enhance the inference accuracy by executing further MAC operations. Experimental results demonstrate that Step**Net provides an effective incremental accuracy improvement and its inference accuracy consistently outperforms the state-of-the-art work under the same limit of computational resources. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: accepted by DATE2023 (Design, Automation and Test in Europe)

arXiv:2211.14917 [pdf, other]

CorrectNet: Robustness Enhancement of Analog In-Memory Computing for Neural Networks by Error Suppression and Compensation

Authors: Amro Eldebiky, Grace Li Zhang, Georg Boecherer, Bing Li, Ulf Schlichtmann

Abstract: The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms… ▽ More The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms rely on analog properties of the devices and thus suffer from process variations and noise. Consequently, weights in neural networks configured into these platforms can deviate from the expected values, which may lead to feature errors and a significant degradation of inference accuracy. To address this issue, in this paper, we propose a framework to enhance the robustness of neural networks under variations and noise. First, a modified Lipschitz constant regularization is proposed during neural network training to suppress the amplification of errors propagated through network layers. Afterwards, error compensation is introduced at necessary locations determined by reinforcement learning to rescue the feature maps with remaining errors. Experimental results demonstrate that inference accuracy of neural networks can be recovered from as low as 1.69% under variations and noise back to more than 95% of their original accuracy, while the training and hardware cost are negligible. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: Accepted by DATE 2023 (Design, Automation and Test in Europe)

arXiv:2203.05516 [pdf, other]

VirtualSync+: Timing Optimization with Virtual Synchronization

Authors: Grace Li Zhang, Bing Li, Xing Huang, Xunzhao Yin, Cheng Zhuo, Masanori Hashimoto, Ulf Schlichtmann

Abstract: In digital circuit designs, sequential components such as flip-flops are used to synchronize signal propagations. Logic computations are aligned at and thus isolated by flip-flop stages. Although this fully synchronous style can reduce design efforts significantly, it may affect circuit performance negatively, because sequential components can only introduce delays into signal propagations but nev… ▽ More In digital circuit designs, sequential components such as flip-flops are used to synchronize signal propagations. Logic computations are aligned at and thus isolated by flip-flop stages. Although this fully synchronous style can reduce design efforts significantly, it may affect circuit performance negatively, because sequential components can only introduce delays into signal propagations but never accelerate them. In this paper, we propose a new timing model, VirtualSync+, in which signals, specially those along critical paths, are allowed to propagate through several sequential stages without flip-flops. Timing constraints are still satisfied at the boundary of the optimized circuit to maintain a consistent interface with existing designs. By removing clock-to-q delays and setup time requirements of flip-flops on critical paths, the performance of a circuit can be pushed even beyond the limit of traditional sequential designs. In addition, we further enhance the optimization with VirtualSync+ by fine-tuning with commercial design tools, e.g., Design Compiler from Synopsys, to achieve more accurate result. Experimental results demonstrate that circuit performance can be improved by up to 4% (average 1.5%) compared with that after extreme retiming and sizing, while the increase of area is still negligible. This timing performance is enhanced beyond the limit of traditional sequential designs. It also demonstrates that compared with those after retiming and sizing, the circuits with VirtualSync+ can achieve better timing performance under the same area cost or smaller area cost under the same clock period, respectively. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2109.09502 [pdf, other]

doi 10.1109/ASP-DAC52403.2022.9712595

Differentially Evolving Memory Ensembles: Pareto Optimization based on Computational Intelligence for Embedded Memories on a System Level

Authors: Felix Last, Ceren Yeni, Ulf Schlichtmann

Abstract: As the relative power, performance, and area (PPA) impact of embedded memories continues to grow, proper parameterization of each of the thousands of memories on a chip is essential. When the parameters of all memories of a product are optimized together as part of a single system, better trade-offs may be achieved than if the same memories were optimized in isolation. However, challenges such as… ▽ More As the relative power, performance, and area (PPA) impact of embedded memories continues to grow, proper parameterization of each of the thousands of memories on a chip is essential. When the parameters of all memories of a product are optimized together as part of a single system, better trade-offs may be achieved than if the same memories were optimized in isolation. However, challenges such as a sparse solution space, conflicting objectives, and computationally expensive PPA estimation impede the application of common optimization heuristics. We show how the memory system optimization problem can be solved through computational intelligence. We apply a Pareto-based Differential Evolution to ensure unbiased optimization of multiple PPA objectives. To ensure efficient exploration of a sparse solution space, we repair individuals to yield feasible parameterizations. PPA is estimated efficiently in large batches by pre-trained regression neural networks. Our framework enables the system optimization of thousands of memories while kee** a small resource footprint. Evaluating our method on a tractable system, we find that our method finds diverse solutions which exhibit less than 0.5% distance from known global optima. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: Accepted as part of ASP-DAC 2022 special session

arXiv:2003.03269 [pdf, other]

doi 10.1145/3385262

Predicting Memory Compiler Performance Outputs using Feed-Forward Neural Networks

Authors: Felix Last, Max Haeberlein, Ulf Schlichtmann

Abstract: Typical semiconductor chips include thousands of mostly small memories. As memories contribute an estimated 25% to 40% to the overall power, performance, and area (PPA) of a chip, memories must be designed carefully to meet the system's requirements. Memory arrays are highly uniform and can be described by approximately 10 parameters depending mostly on the complexity of the periphery. Thus, to im… ▽ More Typical semiconductor chips include thousands of mostly small memories. As memories contribute an estimated 25% to 40% to the overall power, performance, and area (PPA) of a chip, memories must be designed carefully to meet the system's requirements. Memory arrays are highly uniform and can be described by approximately 10 parameters depending mostly on the complexity of the periphery. Thus, to improve PPA utilization, memories are typically generated by memory compilers. A key task in the design flow of a chip is to find optimal memory compiler parametrizations which on the one hand fulfill system requirements while on the other hand optimize PPA. Although most compiler vendors also provide optimizers for this task, these are often slow or inaccurate. To enable efficient optimization in spite of long compiler run times, we propose training fully connected feed-forward neural networks to predict PPA outputs given a memory compiler parametrization. Using an exhaustive search-based optimizer framework which obtains neural network predictions, PPA-optimal parametrizations are found within seconds after chip designers have specified their requirements. Average model prediction errors of less than 3%, a decision reliability of over 99% and productive usage of the optimizer for successful, large volume chip design projects illustrate the effectiveness of the approach. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: 23 pages, 8 figures, 4 tables; accepted for publication in the ACM TODAES special issue on machine learning for CAD (ML-CAD)

Journal ref: ACM Trans. Des. Autom. Electron. Syst. 25, 5 (2020) 39

arXiv:2003.00862 [pdf, other]

doi 10.1109/TCAD.2020.2974338

TimingCamouflage+: Netlist Security Enhancement with Unconventional Timing (with Appendix)

Authors: Grace Li Zhang, Bing Li, Meng Li, Bei Yu, David Z. Pan, Michaela Brunner, Georg Sigl, Ulf Schlichtmann

Abstract: With recent advances in reverse engineering, attackers can reconstruct a netlist to counterfeit chips by opening the die and scanning all layers of authentic chips. This relatively easy counterfeiting is made possible by the use of the standard simple clocking scheme, where all combinational blocks function within one clock period, so that a netlist of combinational logic gates and flip-flops is s… ▽ More With recent advances in reverse engineering, attackers can reconstruct a netlist to counterfeit chips by opening the die and scanning all layers of authentic chips. This relatively easy counterfeiting is made possible by the use of the standard simple clocking scheme, where all combinational blocks function within one clock period, so that a netlist of combinational logic gates and flip-flops is sufficient to duplicate a design. In this paper, we propose to invalidate the assumption that a netlist completely represents the function of a circuit with unconventional timing. With the introduced wave-pipelining paths, attackers have to capture gate and interconnect delays during reverse engineering, or to test a huge number of combinational paths to identify the wave-pipelining paths. To hinder the test-based attack, we construct false paths with wave-pipelining to increase the counterfeiting challenge. Experimental results confirm that wave-pipelining true paths and false paths can be constructed in benchmark circuits successfully with only a negligible cost, thus thwarting the potential attack techniques. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1705.04998 [pdf, other]

Transport or Store? Synthesizing Flow-based Microfluidic Biochips using Distributed Channel Storage

Authors: Chunfeng Liu, Bing Li, Hailong Yao, Paul Pop, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: Flow-based microfluidic biochips have attracted much atten- tion in the EDA community due to their miniaturized size and execution efficiency. Previous research, however, still follows the traditional computing model with a dedicated storage unit, which actually becomes a bottleneck of the performance of bio- chips. In this paper, we propose the first architectural synthe- sis framework considerin… ▽ More Flow-based microfluidic biochips have attracted much atten- tion in the EDA community due to their miniaturized size and execution efficiency. Previous research, however, still follows the traditional computing model with a dedicated storage unit, which actually becomes a bottleneck of the performance of bio- chips. In this paper, we propose the first architectural synthe- sis framework considering distributed storage constructed tem- porarily from transportation channels to cache fluid samples. Since distributed storage can be accessed more efficiently than a dedicated storage unit and channels can switch between the roles of transportation and storage easily, biochips with this dis- tributed computing architecture can achieve a higher execution efficiency even with fewer resources. Experimental results con- firm that the execution efficiency of a bioassay can be improved by up to 28% while the number of valves in the biochip can be reduced effectively. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: ACM/IEEE Design Automation Conference (DAC), June 2017

arXiv:1705.04996 [pdf, other]

Testing Microfluidic Fully Programmable Valve Arrays (FPVAs)

Authors: Chunfeng Liu, Bing Li, Bhargab B. Bhattacharya, Krishnendu Chakrabarty, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: Fully Programmable Valve Array (FPVA) has emerged as a new architecture for the next-generation flow-based microfluidic biochips. This 2D-array consists of regularly-arranged valves, which can be dynamically configured by users to realize microfluidic devices of different shapes and sizes as well as interconnections. Additionally, the regularity of the underlying structure renders FPVAs easier to… ▽ More Fully Programmable Valve Array (FPVA) has emerged as a new architecture for the next-generation flow-based microfluidic biochips. This 2D-array consists of regularly-arranged valves, which can be dynamically configured by users to realize microfluidic devices of different shapes and sizes as well as interconnections. Additionally, the regularity of the underlying structure renders FPVAs easier to integrate on a tiny chip. However, these arrays may suffer from various manufacturing defects such as blockage and leakage in control and flow channels. Unfortunately, no efficient method is yet known for testing such a general-purpose architecture. In this paper, we present a novel formulation using the concept of flow paths and cut-sets, and describe an ILP-based hierarchical strategy for generating compact test sets that can detect multiple faults in FPVAs. Simulation results demonstrate the efficacy of the proposed method in detecting manufacturing faults with only a small number of test vectors. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: Design, Automation and Test in Europe (DATE), March 2017

arXiv:1705.04995 [pdf, other]

doi 10.1109/TCAD.2017.2702632

Design-Phase Buffer Allocation for Post-Silicon Clock Binning by Iterative Learning

Authors: Li Zhang, Bing Li, **glan Liu, Yiyu Shi, Ulf Schlichtmann

Abstract: At submicron manufacturing technology nodes, pro- cess variations affect circuit performance significantly. To counter these variations, engineers are reserving more timing margin to maintain yield, leading to an unaffordable overdesign. Most of these margins, however, are wasted after manufacturing, because process variations cause only some chips to be really slow, while other chips can easily m… ▽ More At submicron manufacturing technology nodes, pro- cess variations affect circuit performance significantly. To counter these variations, engineers are reserving more timing margin to maintain yield, leading to an unaffordable overdesign. Most of these margins, however, are wasted after manufacturing, because process variations cause only some chips to be really slow, while other chips can easily meet given timing specifications. To reduce this pessimism, we can reserve less timing margin and tune failed chips after manufacturing with clock buffers to make them meet timing specifications. With this post-silicon clock tuning, critical paths can be balanced with neighboring paths in each chip specifically to counter the effect of process variations. Consequently, chips with timing failures can be rescued and the yield can thus be improved. This is specially useful in high- performance designs, e.g., high-end CPUs, where clock binning makes chips with higher performance much more profitable. In this paper, we propose a method to determine where to insert post-silicon tuning buffers during the design phase to improve the overall profit with clock binning. This method learns the buffer locations with a Sobol sequence iteratively and reduces the buffer ranges afterwards with tuning concentration and buffer grou**. Experimental results demonstrate that the proposed method can achieve a profit improvement of about 14% on average and up to 26%, with only a small number of tuning buffers inserted into the circuit. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017

arXiv:1705.04993 [pdf, other]

doi 10.1145/2966986.2967064

PieceTimer: A Holistic Timing Analysis Framework Considering Setup/Hold Time Interdependency Using A Piecewise Model

Authors: Grace Li Zhang, Bing Li, Ulf Schlichtmann

Abstract: In static timing analysis, clock-to-q delays of flip-flops are considered as constants. Setup times and hold times are characterized separately and also used as constants. The characterized delays, setup times and hold times, are ap- plied in timing analysis independently to verify the perfor- mance of circuits. In reality, however, clock-to-q delays of flip-flops depend on both setup and hold tim… ▽ More In static timing analysis, clock-to-q delays of flip-flops are considered as constants. Setup times and hold times are characterized separately and also used as constants. The characterized delays, setup times and hold times, are ap- plied in timing analysis independently to verify the perfor- mance of circuits. In reality, however, clock-to-q delays of flip-flops depend on both setup and hold times. Instead of being constants, these delays change with respect to different setup/hold time combinations. Consequently, the simple ab- straction of setup/hold times and constant clock-to-q delays introduces inaccuracy in timing analysis. In this paper, we propose a holistic method to consider the relation between clock-to-q delays and setup/hold time combinations with a piecewise linear model. The result is more accurate than that of traditional timing analysis, and the incorporation of the interdependency between clock-to-q delays, setup times and hold times may also improve circuit performance. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), November 2016

arXiv:1705.04992 [pdf, other]

doi 10.1145/2897937.2898017

EffiTest: Efficient Delay Test and Statistical Prediction for Configuring Post-silicon Tunable Buffers

Authors: Grace Li Zhang, Bing Li, Ulf Schlichtmann

Abstract: At nanometer manufacturing technology nodes, process variations significantly affect circuit performance. To combat them, post- silicon clock tuning buffers can be deployed to balance timing bud- gets of critical paths for each individual chip after manufacturing. The challenge of this method is that path delays should be mea- sured for each chip to configure the tuning buffers properly. Current m… ▽ More At nanometer manufacturing technology nodes, process variations significantly affect circuit performance. To combat them, post- silicon clock tuning buffers can be deployed to balance timing bud- gets of critical paths for each individual chip after manufacturing. The challenge of this method is that path delays should be mea- sured for each chip to configure the tuning buffers properly. Current methods for this delay measurement rely on path-wise frequency step**. This strategy, however, requires too much time from ex- pensive testers. In this paper, we propose an efficient delay test framework (EffiTest) to solve the post-silicon testing problem by aligning path delays using the already-existing tuning buffers in the circuit. In addition, we only test representative paths and the delays of other paths are estimated by statistical delay prediction. Exper- imental results demonstrate that the proposed method can reduce the number of frequency step** iterations by more than 94% with only a slight yield loss. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: ACM/IEEE Design Automation Conference (DAC), June 2016

arXiv:1705.04991 [pdf, other]

doi 10.1145/2897937.2898052

Novel CMOS RFIC Layout Generation with Concurrent Device Placement and Fixed-Length Microstrip Routing

Authors: Tsun-Ming Tseng, Bing Li, Ching-Feng Yeh, Hsiang-Chieh Jhan, Zuo-Ming Tsai, Mark Po-Hung Lin, Ulf Schlichtmann

Abstract: With advancing process technologies and booming IoT markets, millimeter-wave CMOS RFICs have been widely developed in re- cent years. Since the performance of CMOS RFICs is very sensi- tive to the precision of the layout, precise placement of devices and precisely matched microstrip lengths to given values have been a labor-intensive and time-consuming task, and thus become a major bottleneck for… ▽ More With advancing process technologies and booming IoT markets, millimeter-wave CMOS RFICs have been widely developed in re- cent years. Since the performance of CMOS RFICs is very sensi- tive to the precision of the layout, precise placement of devices and precisely matched microstrip lengths to given values have been a labor-intensive and time-consuming task, and thus become a major bottleneck for time to market. This paper introduces a progressive integer-linear-programming-based method to gener- ate high-quality RFIC layouts satisfying very stringent routing requirements of microstrip lines, including spacing/non-crossing rules, precise length, and bend number minimization, within a given layout area. The resulting RFIC layouts excel in both per- formance and area with much fewer bends compared with the simulation-tuning based manual layout, while the layout gener- ation time is significantly reduced from weeks to half an hour. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: ACM/IEEE Design Automation Conference (DAC), 2016

arXiv:1705.04990 [pdf, other]

doi 10.3850/9783981537079_0250

Sampling-based Buffer Insertion for Post-Silicon Yield Improvement under Process Variability

Authors: Grace Li Zhang, Bing Li, Ulf Schlichtmann

Abstract: At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured… ▽ More At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured for each chip individually so that chips with timing failures may be rescued to improve yield. In this paper, we propose a sampling-based method to determine the proper locations of these buffers. The goal of this buffer insertion is to reduce the number of buffers and their ranges, while still maintaining a good yield improvement. Experimental results demonstrate that our algorithm can achieve a significant yield improvement (up to 35%) with only a small number of buffers. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: Design, Automation and Test in Europe (DATE), 2016

arXiv:1705.04988 [pdf, other]

doi 10.1109/MDAT.2015.2492473

Storage and Caching: Synthesis of Flow-based Microfluidic Biochips

Authors: Tsun-Ming Tseng, Bing Li, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: Flow-based microfluidic biochips are widely used in lab- on-a-chip experiments. In these chips, devices such as mixers and detectors connected by micro-channels execute specific operations. Intermediate fluid samples are saved in storage temporarily until target devices become avail- able. However, if the storage unit does not have enough capacity, fluid samples must wait in devices, reducing thei… ▽ More Flow-based microfluidic biochips are widely used in lab- on-a-chip experiments. In these chips, devices such as mixers and detectors connected by micro-channels execute specific operations. Intermediate fluid samples are saved in storage temporarily until target devices become avail- able. However, if the storage unit does not have enough capacity, fluid samples must wait in devices, reducing their efficiency and thus increasing the overall execution time. Consequently, storage and caching of fluid samples in such microfluidic chips must be considered during synthesis to balance execution efficiency and chip area. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE Design and Test, December 2015

arXiv:1705.04986 [pdf, other]

doi 10.1109/TCAD.2015.2432143

Statistical Timing Analysis and Criticality Computation for Circuits with Post-Silicon Clock Tuning Elements

Authors: Bing Li, Ulf Schlichtmann

Abstract: Post-silicon clock tuning elements are widely used in high-performance designs to mitigate the effects of process variations and aging. Located on clock paths to flip-flops, these tuning elements can be configured through the scan chain so that clock skews to these flip-flops can be adjusted after man- ufacturing. Owing to the delay compensation across consecutive register stages enabled by the cl… ▽ More Post-silicon clock tuning elements are widely used in high-performance designs to mitigate the effects of process variations and aging. Located on clock paths to flip-flops, these tuning elements can be configured through the scan chain so that clock skews to these flip-flops can be adjusted after man- ufacturing. Owing to the delay compensation across consecutive register stages enabled by the clock tuning elements, higher yield and enhanced robustness can be achieved. These benefits are, nonetheless, attained by increasing die area due to the inserted clock tuning elements. For balancing performance improvement and area cost, an efficient timing analysis algorithm is needed to evaluate the performance of such a circuit. So far this evaluation is only possible by Monte Carlo simulation which is very timing- consuming. In this paper, we propose an alternative method using graph transformation, which computes a parametric minimum clock period and is more than 10 4 times faster than Monte Carlo simulation while maintaining a good accuracy. This method also identifies the gates that are critical to circuit performance, so that a fast analysis-optimization flow becomes possible. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, November 2015

arXiv:1705.04984 [pdf, other]

doi 10.1109/TCAD.2015.2402657

ILP-based Alleviation of Dense Meander Segments with Prioritized Shifting and Progressive Fixing in PCB Routing

Authors: Tsun-Ming Tseng, Bing Li, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: Length-matching is an important technique to bal- ance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate very dense meander segments. Signals propagating along these meander segments exhibit a speedup effect due to crosstalk between the segments of the same wire, thus leading to mismatch of arrival times even under the same physical wire length. In this… ▽ More Length-matching is an important technique to bal- ance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate very dense meander segments. Signals propagating along these meander segments exhibit a speedup effect due to crosstalk between the segments of the same wire, thus leading to mismatch of arrival times even under the same physical wire length. In this paper, we present a post-processing method to enlarge the width and the distance of meander segments and hence distribute them more evenly on the board so that crosstalk can be reduced. In the proposed framework, we model the sharing of available routing areas after removing dense meander segments from the initial routing, as well as the generation of relaxed meander segments and their groups for wire length compensation. This model is transformed into an ILP problem and solved for a balanced distribution of wire patterns. In addition, we adjust the locations of long wire segments according to wire priorities to swap free spaces toward critical wires that need much length compensation. To reduce the problem space of the ILP model, we also introduce a progressive fixing technique so that wire patterns are grown gradually from the edge of the routing toward the center area. Experimental results show that the proposed method can expand meander segments significantly even under very tight area constraints, so that the speedup effect can be alleviated effectively in high- performance PCB designs. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Journal ref: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34(6), 1000-1013, June 2015

arXiv:1705.04983 [pdf, other]

doi 10.1109/ICCAD.2013.6691193

Post-Route Alleviation of Dense Meander Segments in High-Performance Printed Circuit Boards

Authors: Tsun-Ming Tseng, Bing Li, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: Length-matching is an important technique to balance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate dense meander segments with small distance. Signals propagating across these meander segments exhibit a speedup effect due to crosstalks between the segments of the same wire, thus leading to mismatch of arrival times even with the same physical wire l… ▽ More Length-matching is an important technique to balance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate dense meander segments with small distance. Signals propagating across these meander segments exhibit a speedup effect due to crosstalks between the segments of the same wire, thus leading to mismatch of arrival times even with the same physical wire length. In this paper, we propose a post-processing method to enlarge the width and the distance of meander segments and distribute them more evenly on the board so that the crosstalks can be reduced. In the proposed framework, we model the sharing combinations of available routing areas after removing dense meander segments from the initial routing, as well as the generation of relaxed meander segments and their groups in subareas. Thereafter, this model is transformed into an ILP problem and solved efficiently. Experimental results show that the proposed method can extend the width and the distance of meander segments about two times even under very tight area constraints, so that the crosstalks and thus the speedup effect can be alleviated effectively in high-performance PCB designs. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2013

arXiv:1705.04982 [pdf, other]

doi 10.1145/2483028.2483123

Post-Route Refinement for High-Frequency PCBs Considering Meander Segment Alleviation

Authors: Tsun-Ming, Tseng Bing Li, Tsung-Yi Ho, Ulf Schlichtmann

Abstract: In this paper, we propose a post-processing framework which iteratively refines the routing results from an existing PCB router by removing dense meander segments. By swap** and detouring dense meander segments the proposed method can effectively alleviate accumulating crosstalk noise, while respecting pre-defined area constraints. Experimental results show more than 85% reduction of the meander… ▽ More In this paper, we propose a post-processing framework which iteratively refines the routing results from an existing PCB router by removing dense meander segments. By swap** and detouring dense meander segments the proposed method can effectively alleviate accumulating crosstalk noise, while respecting pre-defined area constraints. Experimental results show more than 85% reduction of the meander segments and hence the noise cost. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: ACM Great Lake Symposium on VLSI (GLSVLSI), 2013

arXiv:1705.04981 [pdf, other]

doi 10.1109/TCAD.2012.2228305

On Timing Model Extraction and Hierarchical Statistical Timing Analysis

Authors: Bing Li, Ning Chen, Yang Xu, Ulf Schlichtmann

Abstract: In this paper, we investigate the challenges to apply Statistical Static Timing Analysis (SSTA) in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based and latch-controlled, we propose methods to extract timing models… ▽ More In this paper, we investigate the challenges to apply Statistical Static Timing Analysis (SSTA) in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based and latch-controlled, we propose methods to extract timing models which contain interfacing as well as compressed internal constraints. Using these compact timing models the runtime of full-chip timing analysis can be reduced, while circuit details from IP vendors are not exposed. We also propose a method to reconstruct the correlation between modules during full-chip timing analysis. This correlation can not be incorporated into timing models because it depends on the layout of the corresponding modules in the chip. In addition, we investigate how to apply the extracted timing models with the reconstructed correlation to evaluate the performance of the complete design. Experiments demonstrate that using the extracted timing models and reconstructed correlation full-chip timing analysis can be several times faster than applying the flattened circuit directly, while the accuracy of statistical timing analysis is still well maintained. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Journal ref: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32(3), 367-380, March 2013

arXiv:1705.04980 [pdf, other]

doi 10.1109/TCAD.2012.2202393

Statistical Timing Analysis for Latch-Controlled Circuits with Reduced Iterations and Graph Transformations

Authors: Bing Li, Ning Chen, Ulf Schlichtmann

Abstract: Level-sensitive latches are widely used in high- performance designs. For such circuits efficient statistical timing analysis algorithms are needed to take increasing process vari- ations into account. But existing methods solving this problem are still computationally expensive and can only provide the yield at a given clock period. In this paper we propose a method combining reduced iterations a… ▽ More Level-sensitive latches are widely used in high- performance designs. For such circuits efficient statistical timing analysis algorithms are needed to take increasing process vari- ations into account. But existing methods solving this problem are still computationally expensive and can only provide the yield at a given clock period. In this paper we propose a method combining reduced iterations and graph transformations. The reduced iterations extract setup time constraints and identify a subgraph for the following graph transformations handling the constraints from nonpositive loops. The combined algorithms are very efficient, more than 10 times faster than other existing methods, and result in a parametric minimum clock period, which together with the hold time constraints can be used to compute the yield at any given clock period very easily. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Journal ref: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31(11), 1670-1683, November 2012

arXiv:1705.04979 [pdf, other]

doi 10.1109/ICCAD.2011.6105314

Fast Statistical Timing Analysis for Circuits with Post-Silicon Tunable Clock Buffers

Authors: Bing Li, Ning Chen, Ulf Schlichtmann

Abstract: Post-Silicon Tunable (PST) clock buffers are widely used in high performance designs to counter process variations. By allowing delay compensation between consecutive register stages, PST buffers can effectively improve the yield of digital circuits. To date, the evaluation of manufacturing yield in the presence of PST buffers is only possible using Monte Carlo simulation. In this paper, we propos… ▽ More Post-Silicon Tunable (PST) clock buffers are widely used in high performance designs to counter process variations. By allowing delay compensation between consecutive register stages, PST buffers can effectively improve the yield of digital circuits. To date, the evaluation of manufacturing yield in the presence of PST buffers is only possible using Monte Carlo simulation. In this paper, we propose an alternative method based on graph transformations, which is much faster, more than 1000 times, and computes a parametric minimum clock period. It also identifies the gates which are most critical to the circuit performance, therefore enabling a fast analysis-optimization flow. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2011

arXiv:1705.04976 [pdf, other]

doi 10.1145/1687399.1687463

Timing Model Extraction for Sequential Circuits Considering Process Variations

Authors: Bing Li, Ning Chen, Ulf Schlichtmann

Abstract: As semiconductor devices continue to scale down, process vari- ations become more relevant for circuit design. Facing such variations, statistical static timing analysis is introduced to model variations more accurately so that the pessimism in tra- ditional worst case timing analysis is reduced. Because all de- lays are modeled using correlated random variables, most statis- tical timing methods… ▽ More As semiconductor devices continue to scale down, process vari- ations become more relevant for circuit design. Facing such variations, statistical static timing analysis is introduced to model variations more accurately so that the pessimism in tra- ditional worst case timing analysis is reduced. Because all de- lays are modeled using correlated random variables, most statis- tical timing methods are much slower than corner based timing analysis. To speed up statistical timing analysis, we propose a method to extract timing models for flip-flop and latch based sequential circuits respectively. When such a circuit is used as a module in a hierarchical design, the timing model instead of the original circuit is used for timing analysis. The extracted timing models are much smaller than the original circuits. Ex- periments show that using extracted timing models accelerates timing verification by orders of magnitude compared to previ- ous approaches using flat netlists directly. Accuracy is main- tained, however, with the mean and standard deviation of the clock period both showing usually less than 1% error compared to Monte Carlo simulation on a number of benchmark circuits. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2009

arXiv:1705.04975 [pdf, other]

doi 10.1109/DATE.2009.5090869

On Hierarchical Statistical Static Timing Analysis

Authors: Bing Li, Ning Chen, Manuel Schmidt, Walter Schneider, Ulf Schlichtmann

Abstract: Statistical static timing analysis deals with the increasing variations in manufacturing processes to reduce the pessimism in the worst case timing analysis. Because of the correlation between delays of circuit components, timing model generation and hierarchical timing analysis face more challenges than in static timing analysis. In this paper, a novel method to generate timing models for combina… ▽ More Statistical static timing analysis deals with the increasing variations in manufacturing processes to reduce the pessimism in the worst case timing analysis. Because of the correlation between delays of circuit components, timing model generation and hierarchical timing analysis face more challenges than in static timing analysis. In this paper, a novel method to generate timing models for combinational circuits considering variations is proposed. The resulting timing models have accurate input-output delays and are about 80% smaller than the original circuits. Additionally, an accurate hierarchical timing analysis method at design level using pre-characterized timing models is proposed. This method incorporates the correlation between modules by replacing independent random variables to improve timing accuracy. Experimental results show that the correlation between modules strongly affects the delay distribution of the hierarchical design and the proposed method has good accuracy compared with Monte Carlo simulation, but is faster by three orders of magnitude. △ Less

Submitted 14 May, 2017; originally announced May 2017.

Comments: Design, Automation and Test in Europe (DATE) 2009

arXiv:1705.02610 [pdf, other]

doi 10.1007/978-3-540-95948-9_16

Static Timing Model Extraction for Combinational Circuits

Authors: Bing Li, Christoph Knoth, Walter Schneider, Manuel Schmidt, Ulf Schlichtmann

Abstract: For large circuits, static timing analysis (STA) needs to be performed in a hierarchical manner to achieve higher performance in arrival time propagation. In hierarchical STA, efficient and accurate timing models of sub-modules need to be created. We propose a timing model extraction method that significantly reduces the size of timing models without losing any accuracy by removing redundant timin… ▽ More For large circuits, static timing analysis (STA) needs to be performed in a hierarchical manner to achieve higher performance in arrival time propagation. In hierarchical STA, efficient and accurate timing models of sub-modules need to be created. We propose a timing model extraction method that significantly reduces the size of timing models without losing any accuracy by removing redundant timing information. Circuit components which do not contribute to the delay of any input to output pair are removed. The proposed method is deterministic. Compared to the original models, the numbers of edges and vertices of the resulting timing models are reduced by 84% and 85% on average, respectively, which are significantly more than the results achieved by other methods. △ Less

Submitted 7 May, 2017; originally announced May 2017.

Comments: International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), 2008

MSC Class: 68W35 VLSI algorithms ACM Class: B.7

arXiv:1405.2909 [pdf]

Emulated ASIC Power and Temperature Monitor System for FPGA Prototy** of an Invasive MPSoC Computing Architecture

Authors: Elisabeth Glocker, Qingqing Chen, Asheque M. Zaidi, Ulf Schlichtmann, Doris Schmitt-Landsiedel

Abstract: In this contribution the emulation of an ASIC temperature and power monitoring system (TPMon) for FPGA prototy** is presented and tested to control processor temperatures under different control targets and operating strategies. The approach for emulating the power monitor is based on an instruction-level energy model. For emulating the temperature monitor, a thermal RC model is used. The monito… ▽ More In this contribution the emulation of an ASIC temperature and power monitoring system (TPMon) for FPGA prototy** is presented and tested to control processor temperatures under different control targets and operating strategies. The approach for emulating the power monitor is based on an instruction-level energy model. For emulating the temperature monitor, a thermal RC model is used. The monitoring system supplies an invasive MPSoC computing architecture with hardware status information (power and temperature data of the processors within the system). These data are required for resource-aware load distribution. As a proof of concept different operating strategies and control targets were evaluated for a 2-tile invasive MPSoC computing system. △ Less

Submitted 12 May, 2014; originally announced May 2014.

Comments: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281)

Report number: Racing/2014/03

Showing 1–35 of 35 results for author: Schlichtmann, U