-
LiveMind: Low-latency Large Language Models with Simultaneous Inference
Authors:
Chuangtao Chen,
Grace Li Zhang,
Xunzhao Yin,
Cheng Zhuo,
Ulf Schlichtmann,
Bing Li
Abstract:
In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the v…
▽ More
In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
Authors:
Christopher Wolters,
Xiaoxuan Yang,
Ulf Schlichtmann,
Toyotaro Suzumura
Abstract:
Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are…
▽ More
Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration
Authors:
Bo Liu,
Grace Li Zhang,
Xunzhao Yin,
Ulf Schlichtmann,
Bing Li
Abstract:
Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design…
▽ More
Deep neural networks (DNNs) have achieved great breakthroughs in many fields such as image classification and natural language processing. However, the execution of DNNs needs to conduct massive numbers of multiply-accumulate (MAC) operations on hardware and thus incurs a large power consumption. To address this challenge, we propose a novel digital MAC design based on encoding. In this new design, the multipliers are replaced by simple logic gates to project the results onto a wide bit representation. These bits carry individual position weights, which can be trained for specific neural networks to enhance inference accuracy. The outputs of the new multipliers are added by bit-wise weighted accumulation and the accumulation results are compatible with existing computing platforms accelerating neural networks with either uniform or non-uniform quantization. Since the multiplication function is replaced by simple logic projection, the critical paths in the resulting circuits become much shorter. Correspondingly, pipelining stages in the MAC array can be reduced, leading to a significantly smaller area as well as a better power efficiency. The proposed design has been synthesized and verified by ResNet18-Cifar10, ResNet20-Cifar100 and ResNet50-ImageNet. The experimental results confirmed the reduction of circuit area by up to 79.63% and the reduction of power consumption of executing DNNs by up to 70.18%, while the accuracy of the neural networks can still be well maintained.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Logic Design of Neural Networks for High-Throughput and Low-Power Applications
Authors:
Kangwei Xu,
Grace Li Zhang,
Ulf Schlichtmann,
Bing Li
Abstract:
Neural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiplyaccumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC…
▽ More
Neural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiplyaccumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC operations in a neural network. Accordingly, the throughput in generating classification results is not high, which prevents the application of traditional hardware platforms in extreme-throughput scenarios. Besides, the power consumption of such platforms is also high, mainly due to data movement. To overcome this challenge, in this paper, we propose to flatten and implement all the operations at neurons, e.g., MAC and ReLU, in a neural network with their corresponding logic circuits. To improve the throughput and reduce the power consumption of such logic designs, the weight values are embedded into the MAC units to simplify the logic, which can reduce the delay of the MAC units and the power consumption incurred by weight movement. The retiming technique is further used to improve the throughput of the logic circuits for neural networks. In addition, we propose a hardware-aware training method to reduce the area of logic designs of neural networks. Experimental results demonstrate that the proposed logic designs can achieve high throughput and low power consumption for several high-throughput applications.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
MLonMCU: TinyML Benchmarking with Fast Retargeting
Authors:
Philipp van Kempen,
Rafael Stahl,
Daniel Mueller-Gritschneder,
Ulf Schlichtmann
Abstract:
While there exist many ways to deploy machine learning models on microcontrollers, it is non-trivial to choose the optimal combination of frameworks and targets for a given application. Thus, automating the end-to-end benchmarking flow is of high relevance nowadays. A tool called MLonMCU is proposed in this paper and demonstrated by benchmarking the state-of-the-art TinyML frameworks TFLite for Mi…
▽ More
While there exist many ways to deploy machine learning models on microcontrollers, it is non-trivial to choose the optimal combination of frameworks and targets for a given application. Thus, automating the end-to-end benchmarking flow is of high relevance nowadays. A tool called MLonMCU is proposed in this paper and demonstrated by benchmarking the state-of-the-art TinyML frameworks TFLite for Microcontrollers and TVM effortlessly with a large number of configurations in a low amount of time.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Computational and Storage Efficient Quadratic Neurons for Deep Neural Networks
Authors:
Chuangtao Chen,
Grace Li Zhang,
Xunzhao Yin,
Cheng Zhuo,
Ulf Schlichtmann,
Bing Li
Abstract:
Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structu…
▽ More
Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structures, the fundamental building blocks of neural networks, to alleviate the computational and storage cost. In this work, an efficient quadratic neuron architecture distinguished by its enhanced utilization of second-order computational information is introduced. By virtue of their better expressivity, DNNs employing the proposed quadratic neurons can attain similar accuracy with fewer neurons and computational cost. Experimental results have demonstrated that the proposed quadratic neuron structure exhibits superior computational and storage efficiency across various tasks when compared with both linear and non-linear neurons in prior work.
△ Less
Submitted 27 November, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference
Authors:
Rafael Stahl,
Daniel Mueller-Gritschneder,
Ulf Schlichtmann
Abstract:
Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate r…
▽ More
Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration
Authors:
Richard Petri,
Grace Li Zhang,
Yiran Chen,
Ulf Schlichtmann,
Bing Li
Abstract:
Deep neural networks (DNNs) have been successfully applied in various fields. A major challenge of deploying DNNs, especially on edge devices, is power consumption, due to the large number of multiply-and-accumulate (MAC) operations. To address this challenge, we propose PowerPruning, a novel method to reduce power consumption in digital neural network accelerators by selecting weights that lead t…
▽ More
Deep neural networks (DNNs) have been successfully applied in various fields. A major challenge of deploying DNNs, especially on edge devices, is power consumption, due to the large number of multiply-and-accumulate (MAC) operations. To address this challenge, we propose PowerPruning, a novel method to reduce power consumption in digital neural network accelerators by selecting weights that lead to less power consumption in MAC operations. In addition, the timing characteristics of the selected weights together with all activation transitions are evaluated. The weights and activations that lead to small delays are further selected. Consequently, the maximum delay of the sensitized circuit paths in the MAC units is reduced even without modifying MAC units, which thus allows a flexible scaling of supply voltage to reduce power consumption further. Together with retraining, the proposed method can reduce power consumption of DNNs on hardware by up to 78.3% with only a slight accuracy loss.
△ Less
Submitted 27 November, 2023; v1 submitted 24 March, 2023;
originally announced March 2023.
-
Biologically Plausible Learning on Neuromorphic Hardware Architectures
Authors:
Christopher Wolters,
Brady Taylor,
Edward Hanson,
Xiaoxuan Yang,
Ulf Schlichtmann,
Yiran Chen
Abstract:
With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog…
▽ More
With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog memories. On the software side, the sequential Backpropagation algorithm prevents efficient parallelization and thus fast convergence. A novel method, Direct Feedback Alignment, resolves inherent layer dependencies by directly passing the error from the output to each layer. At the intersection of hardware/software co-design, there is a demand for develo** algorithms that are tolerable to hardware nonidealities. Therefore, this work explores the interrelationship of implementing bio-plausible learning in-situ on neuromorphic hardware, emphasizing energy, area, and latency constraints. Using the benchmarking framework DNN+NeuroSim, we investigate the impact of hardware nonidealities and quantization on algorithm performance, as well as how network topologies and algorithm-level design choices can scale latency, energy and area consumption of a chip. To the best of our knowledge, this work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa. The best results achieved for accuracy remain Backpropagation-based, notably when facing hardware imperfections. Direct Feedback Alignment, on the other hand, allows for significant speedup due to parallelization, reducing training time by a factor approaching N for N-layered networks.
△ Less
Submitted 11 April, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Class-based Quantization for Neural Networks
Authors:
Wenhao Sun,
Grace Li Zhang,
Huaxi Gu,
Bing Li,
Ulf Schlichtmann
Abstract:
In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform quantization…
▽ More
In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform quantization or focus on model-wise and layer-wise uniform quantization, which are not as fine-grained as filter-wise quantization. In this paper, we propose a class-based quantization method to determine the minimum number of quantization bits for each filter or neuron in DNNs individually. In the proposed method, the importance score of each filter or neuron with respect to the number of classes in the dataset is first evaluated. The larger the score is, the more important the filter or neuron is and thus the larger the number of quantization bits should be. Afterwards, a search algorithm is adopted to exploit the different importance of filters and neurons to determine the number of quantization bits of each filter or neuron. Experimental results demonstrate that the proposed method can maintain the inference accuracy with low bit-width quantization. Given the same number of quantization bits, the proposed method can also achieve a better inference accuracy than the existing methods.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
Step**Net: A Step** Neural Network with Incremental Accuracy Enhancement
Authors:
Wenhao Sun,
Grace Li Zhang,
Xunzhao Yin,
Cheng Zhuo,
Huaxi Gu,
Bing Li,
Ulf Schlichtmann
Abstract:
Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy…
▽ More
Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy of the results should be able to be enhanced dynamically according to the computational resources available in the computing system. To address these challenges, we propose a design framework called Step**Net. Step**Net constructs a series of subnets whose accuracy is incrementally enhanced as more MAC operations become available. Therefore, this design allows a trade-off between accuracy and latency. In addition, the larger subnets in Step**Net are built upon smaller subnets, so that the results of the latter can directly be reused in the former without recomputation. This property allows Step**Net to decide on-the-fly whether to enhance the inference accuracy by executing further MAC operations. Experimental results demonstrate that Step**Net provides an effective incremental accuracy improvement and its inference accuracy consistently outperforms the state-of-the-art work under the same limit of computational resources.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
CorrectNet: Robustness Enhancement of Analog In-Memory Computing for Neural Networks by Error Suppression and Compensation
Authors:
Amro Eldebiky,
Grace Li Zhang,
Georg Boecherer,
Bing Li,
Ulf Schlichtmann
Abstract:
The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms…
▽ More
The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms rely on analog properties of the devices and thus suffer from process variations and noise. Consequently, weights in neural networks configured into these platforms can deviate from the expected values, which may lead to feature errors and a significant degradation of inference accuracy. To address this issue, in this paper, we propose a framework to enhance the robustness of neural networks under variations and noise. First, a modified Lipschitz constant regularization is proposed during neural network training to suppress the amplification of errors propagated through network layers. Afterwards, error compensation is introduced at necessary locations determined by reinforcement learning to rescue the feature maps with remaining errors. Experimental results demonstrate that inference accuracy of neural networks can be recovered from as low as 1.69% under variations and noise back to more than 95% of their original accuracy, while the training and hardware cost are negligible.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
VirtualSync+: Timing Optimization with Virtual Synchronization
Authors:
Grace Li Zhang,
Bing Li,
Xing Huang,
Xunzhao Yin,
Cheng Zhuo,
Masanori Hashimoto,
Ulf Schlichtmann
Abstract:
In digital circuit designs, sequential components such as flip-flops are used to synchronize signal propagations. Logic computations are aligned at and thus isolated by flip-flop stages. Although this fully synchronous style can reduce design efforts significantly, it may affect circuit performance negatively, because sequential components can only introduce delays into signal propagations but nev…
▽ More
In digital circuit designs, sequential components such as flip-flops are used to synchronize signal propagations. Logic computations are aligned at and thus isolated by flip-flop stages. Although this fully synchronous style can reduce design efforts significantly, it may affect circuit performance negatively, because sequential components can only introduce delays into signal propagations but never accelerate them. In this paper, we propose a new timing model, VirtualSync+, in which signals, specially those along critical paths, are allowed to propagate through several sequential stages without flip-flops. Timing constraints are still satisfied at the boundary of the optimized circuit to maintain a consistent interface with existing designs. By removing clock-to-q delays and setup time requirements of flip-flops on critical paths, the performance of a circuit can be pushed even beyond the limit of traditional sequential designs. In addition, we further enhance the optimization with VirtualSync+ by fine-tuning with commercial design tools, e.g., Design Compiler from Synopsys, to achieve more accurate result. Experimental results demonstrate that circuit performance can be improved by up to 4% (average 1.5%) compared with that after extreme retiming and sizing, while the increase of area is still negligible. This timing performance is enhanced beyond the limit of traditional sequential designs. It also demonstrates that compared with those after retiming and sizing, the circuits with VirtualSync+ can achieve better timing performance under the same area cost or smaller area cost under the same clock period, respectively.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
Differentially Evolving Memory Ensembles: Pareto Optimization based on Computational Intelligence for Embedded Memories on a System Level
Authors:
Felix Last,
Ceren Yeni,
Ulf Schlichtmann
Abstract:
As the relative power, performance, and area (PPA) impact of embedded memories continues to grow, proper parameterization of each of the thousands of memories on a chip is essential. When the parameters of all memories of a product are optimized together as part of a single system, better trade-offs may be achieved than if the same memories were optimized in isolation. However, challenges such as…
▽ More
As the relative power, performance, and area (PPA) impact of embedded memories continues to grow, proper parameterization of each of the thousands of memories on a chip is essential. When the parameters of all memories of a product are optimized together as part of a single system, better trade-offs may be achieved than if the same memories were optimized in isolation. However, challenges such as a sparse solution space, conflicting objectives, and computationally expensive PPA estimation impede the application of common optimization heuristics. We show how the memory system optimization problem can be solved through computational intelligence. We apply a Pareto-based Differential Evolution to ensure unbiased optimization of multiple PPA objectives. To ensure efficient exploration of a sparse solution space, we repair individuals to yield feasible parameterizations. PPA is estimated efficiently in large batches by pre-trained regression neural networks. Our framework enables the system optimization of thousands of memories while kee** a small resource footprint. Evaluating our method on a tractable system, we find that our method finds diverse solutions which exhibit less than 0.5% distance from known global optima.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Predicting Memory Compiler Performance Outputs using Feed-Forward Neural Networks
Authors:
Felix Last,
Max Haeberlein,
Ulf Schlichtmann
Abstract:
Typical semiconductor chips include thousands of mostly small memories. As memories contribute an estimated 25% to 40% to the overall power, performance, and area (PPA) of a chip, memories must be designed carefully to meet the system's requirements. Memory arrays are highly uniform and can be described by approximately 10 parameters depending mostly on the complexity of the periphery. Thus, to im…
▽ More
Typical semiconductor chips include thousands of mostly small memories. As memories contribute an estimated 25% to 40% to the overall power, performance, and area (PPA) of a chip, memories must be designed carefully to meet the system's requirements. Memory arrays are highly uniform and can be described by approximately 10 parameters depending mostly on the complexity of the periphery. Thus, to improve PPA utilization, memories are typically generated by memory compilers. A key task in the design flow of a chip is to find optimal memory compiler parametrizations which on the one hand fulfill system requirements while on the other hand optimize PPA. Although most compiler vendors also provide optimizers for this task, these are often slow or inaccurate. To enable efficient optimization in spite of long compiler run times, we propose training fully connected feed-forward neural networks to predict PPA outputs given a memory compiler parametrization. Using an exhaustive search-based optimizer framework which obtains neural network predictions, PPA-optimal parametrizations are found within seconds after chip designers have specified their requirements. Average model prediction errors of less than 3%, a decision reliability of over 99% and productive usage of the optimizer for successful, large volume chip design projects illustrate the effectiveness of the approach.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
TimingCamouflage+: Netlist Security Enhancement with Unconventional Timing (with Appendix)
Authors:
Grace Li Zhang,
Bing Li,
Meng Li,
Bei Yu,
David Z. Pan,
Michaela Brunner,
Georg Sigl,
Ulf Schlichtmann
Abstract:
With recent advances in reverse engineering, attackers can reconstruct a netlist to counterfeit chips by opening the die and scanning all layers of authentic chips. This relatively easy counterfeiting is made possible by the use of the standard simple clocking scheme, where all combinational blocks function within one clock period, so that a netlist of combinational logic gates and flip-flops is s…
▽ More
With recent advances in reverse engineering, attackers can reconstruct a netlist to counterfeit chips by opening the die and scanning all layers of authentic chips. This relatively easy counterfeiting is made possible by the use of the standard simple clocking scheme, where all combinational blocks function within one clock period, so that a netlist of combinational logic gates and flip-flops is sufficient to duplicate a design. In this paper, we propose to invalidate the assumption that a netlist completely represents the function of a circuit with unconventional timing. With the introduced wave-pipelining paths, attackers have to capture gate and interconnect delays during reverse engineering, or to test a huge number of combinational paths to identify the wave-pipelining paths. To hinder the test-based attack, we construct false paths with wave-pipelining to increase the counterfeiting challenge. Experimental results confirm that wave-pipelining true paths and false paths can be constructed in benchmark circuits successfully with only a negligible cost, thus thwarting the potential attack techniques.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
Transport or Store? Synthesizing Flow-based Microfluidic Biochips using Distributed Channel Storage
Authors:
Chunfeng Liu,
Bing Li,
Hailong Yao,
Paul Pop,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
Flow-based microfluidic biochips have attracted much atten- tion in the EDA community due to their miniaturized size and execution efficiency. Previous research, however, still follows the traditional computing model with a dedicated storage unit, which actually becomes a bottleneck of the performance of bio- chips. In this paper, we propose the first architectural synthe- sis framework considerin…
▽ More
Flow-based microfluidic biochips have attracted much atten- tion in the EDA community due to their miniaturized size and execution efficiency. Previous research, however, still follows the traditional computing model with a dedicated storage unit, which actually becomes a bottleneck of the performance of bio- chips. In this paper, we propose the first architectural synthe- sis framework considering distributed storage constructed tem- porarily from transportation channels to cache fluid samples. Since distributed storage can be accessed more efficiently than a dedicated storage unit and channels can switch between the roles of transportation and storage easily, biochips with this dis- tributed computing architecture can achieve a higher execution efficiency even with fewer resources. Experimental results con- firm that the execution efficiency of a bioassay can be improved by up to 28% while the number of valves in the biochip can be reduced effectively.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Testing Microfluidic Fully Programmable Valve Arrays (FPVAs)
Authors:
Chunfeng Liu,
Bing Li,
Bhargab B. Bhattacharya,
Krishnendu Chakrabarty,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
Fully Programmable Valve Array (FPVA) has emerged as a new architecture for the next-generation flow-based microfluidic biochips. This 2D-array consists of regularly-arranged valves, which can be dynamically configured by users to realize microfluidic devices of different shapes and sizes as well as interconnections. Additionally, the regularity of the underlying structure renders FPVAs easier to…
▽ More
Fully Programmable Valve Array (FPVA) has emerged as a new architecture for the next-generation flow-based microfluidic biochips. This 2D-array consists of regularly-arranged valves, which can be dynamically configured by users to realize microfluidic devices of different shapes and sizes as well as interconnections. Additionally, the regularity of the underlying structure renders FPVAs easier to integrate on a tiny chip. However, these arrays may suffer from various manufacturing defects such as blockage and leakage in control and flow channels. Unfortunately, no efficient method is yet known for testing such a general-purpose architecture. In this paper, we present a novel formulation using the concept of flow paths and cut-sets, and describe an ILP-based hierarchical strategy for generating compact test sets that can detect multiple faults in FPVAs. Simulation results demonstrate the efficacy of the proposed method in detecting manufacturing faults with only a small number of test vectors.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Design-Phase Buffer Allocation for Post-Silicon Clock Binning by Iterative Learning
Authors:
Li Zhang,
Bing Li,
**glan Liu,
Yiyu Shi,
Ulf Schlichtmann
Abstract:
At submicron manufacturing technology nodes, pro- cess variations affect circuit performance significantly. To counter these variations, engineers are reserving more timing margin to maintain yield, leading to an unaffordable overdesign. Most of these margins, however, are wasted after manufacturing, because process variations cause only some chips to be really slow, while other chips can easily m…
▽ More
At submicron manufacturing technology nodes, pro- cess variations affect circuit performance significantly. To counter these variations, engineers are reserving more timing margin to maintain yield, leading to an unaffordable overdesign. Most of these margins, however, are wasted after manufacturing, because process variations cause only some chips to be really slow, while other chips can easily meet given timing specifications. To reduce this pessimism, we can reserve less timing margin and tune failed chips after manufacturing with clock buffers to make them meet timing specifications. With this post-silicon clock tuning, critical paths can be balanced with neighboring paths in each chip specifically to counter the effect of process variations. Consequently, chips with timing failures can be rescued and the yield can thus be improved. This is specially useful in high- performance designs, e.g., high-end CPUs, where clock binning makes chips with higher performance much more profitable. In this paper, we propose a method to determine where to insert post-silicon tuning buffers during the design phase to improve the overall profit with clock binning. This method learns the buffer locations with a Sobol sequence iteratively and reduces the buffer ranges afterwards with tuning concentration and buffer grou**. Experimental results demonstrate that the proposed method can achieve a profit improvement of about 14% on average and up to 26%, with only a small number of tuning buffers inserted into the circuit.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
PieceTimer: A Holistic Timing Analysis Framework Considering Setup/Hold Time Interdependency Using A Piecewise Model
Authors:
Grace Li Zhang,
Bing Li,
Ulf Schlichtmann
Abstract:
In static timing analysis, clock-to-q delays of flip-flops are considered as constants. Setup times and hold times are characterized separately and also used as constants. The characterized delays, setup times and hold times, are ap- plied in timing analysis independently to verify the perfor- mance of circuits. In reality, however, clock-to-q delays of flip-flops depend on both setup and hold tim…
▽ More
In static timing analysis, clock-to-q delays of flip-flops are considered as constants. Setup times and hold times are characterized separately and also used as constants. The characterized delays, setup times and hold times, are ap- plied in timing analysis independently to verify the perfor- mance of circuits. In reality, however, clock-to-q delays of flip-flops depend on both setup and hold times. Instead of being constants, these delays change with respect to different setup/hold time combinations. Consequently, the simple ab- straction of setup/hold times and constant clock-to-q delays introduces inaccuracy in timing analysis. In this paper, we propose a holistic method to consider the relation between clock-to-q delays and setup/hold time combinations with a piecewise linear model. The result is more accurate than that of traditional timing analysis, and the incorporation of the interdependency between clock-to-q delays, setup times and hold times may also improve circuit performance.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
EffiTest: Efficient Delay Test and Statistical Prediction for Configuring Post-silicon Tunable Buffers
Authors:
Grace Li Zhang,
Bing Li,
Ulf Schlichtmann
Abstract:
At nanometer manufacturing technology nodes, process variations significantly affect circuit performance. To combat them, post- silicon clock tuning buffers can be deployed to balance timing bud- gets of critical paths for each individual chip after manufacturing. The challenge of this method is that path delays should be mea- sured for each chip to configure the tuning buffers properly. Current m…
▽ More
At nanometer manufacturing technology nodes, process variations significantly affect circuit performance. To combat them, post- silicon clock tuning buffers can be deployed to balance timing bud- gets of critical paths for each individual chip after manufacturing. The challenge of this method is that path delays should be mea- sured for each chip to configure the tuning buffers properly. Current methods for this delay measurement rely on path-wise frequency step**. This strategy, however, requires too much time from ex- pensive testers. In this paper, we propose an efficient delay test framework (EffiTest) to solve the post-silicon testing problem by aligning path delays using the already-existing tuning buffers in the circuit. In addition, we only test representative paths and the delays of other paths are estimated by statistical delay prediction. Exper- imental results demonstrate that the proposed method can reduce the number of frequency step** iterations by more than 94% with only a slight yield loss.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Novel CMOS RFIC Layout Generation with Concurrent Device Placement and Fixed-Length Microstrip Routing
Authors:
Tsun-Ming Tseng,
Bing Li,
Ching-Feng Yeh,
Hsiang-Chieh Jhan,
Zuo-Ming Tsai,
Mark Po-Hung Lin,
Ulf Schlichtmann
Abstract:
With advancing process technologies and booming IoT markets, millimeter-wave CMOS RFICs have been widely developed in re- cent years. Since the performance of CMOS RFICs is very sensi- tive to the precision of the layout, precise placement of devices and precisely matched microstrip lengths to given values have been a labor-intensive and time-consuming task, and thus become a major bottleneck for…
▽ More
With advancing process technologies and booming IoT markets, millimeter-wave CMOS RFICs have been widely developed in re- cent years. Since the performance of CMOS RFICs is very sensi- tive to the precision of the layout, precise placement of devices and precisely matched microstrip lengths to given values have been a labor-intensive and time-consuming task, and thus become a major bottleneck for time to market. This paper introduces a progressive integer-linear-programming-based method to gener- ate high-quality RFIC layouts satisfying very stringent routing requirements of microstrip lines, including spacing/non-crossing rules, precise length, and bend number minimization, within a given layout area. The resulting RFIC layouts excel in both per- formance and area with much fewer bends compared with the simulation-tuning based manual layout, while the layout gener- ation time is significantly reduced from weeks to half an hour.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Sampling-based Buffer Insertion for Post-Silicon Yield Improvement under Process Variability
Authors:
Grace Li Zhang,
Bing Li,
Ulf Schlichtmann
Abstract:
At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured…
▽ More
At submicron manufacturing technology nodes process variations affect circuit performance significantly. This trend leads to a large timing margin and thus overdesign to maintain yield. To combat this pessimism, post-silicon clock tuning buffers can be inserted into circuits to balance timing budgets of critical paths with their neighbors. After manufacturing, these clock buffers can be configured for each chip individually so that chips with timing failures may be rescued to improve yield. In this paper, we propose a sampling-based method to determine the proper locations of these buffers. The goal of this buffer insertion is to reduce the number of buffers and their ranges, while still maintaining a good yield improvement. Experimental results demonstrate that our algorithm can achieve a significant yield improvement (up to 35%) with only a small number of buffers.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Storage and Caching: Synthesis of Flow-based Microfluidic Biochips
Authors:
Tsun-Ming Tseng,
Bing Li,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
Flow-based microfluidic biochips are widely used in lab- on-a-chip experiments. In these chips, devices such as mixers and detectors connected by micro-channels execute specific operations. Intermediate fluid samples are saved in storage temporarily until target devices become avail- able. However, if the storage unit does not have enough capacity, fluid samples must wait in devices, reducing thei…
▽ More
Flow-based microfluidic biochips are widely used in lab- on-a-chip experiments. In these chips, devices such as mixers and detectors connected by micro-channels execute specific operations. Intermediate fluid samples are saved in storage temporarily until target devices become avail- able. However, if the storage unit does not have enough capacity, fluid samples must wait in devices, reducing their efficiency and thus increasing the overall execution time. Consequently, storage and caching of fluid samples in such microfluidic chips must be considered during synthesis to balance execution efficiency and chip area.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Statistical Timing Analysis and Criticality Computation for Circuits with Post-Silicon Clock Tuning Elements
Authors:
Bing Li,
Ulf Schlichtmann
Abstract:
Post-silicon clock tuning elements are widely used in high-performance designs to mitigate the effects of process variations and aging. Located on clock paths to flip-flops, these tuning elements can be configured through the scan chain so that clock skews to these flip-flops can be adjusted after man- ufacturing. Owing to the delay compensation across consecutive register stages enabled by the cl…
▽ More
Post-silicon clock tuning elements are widely used in high-performance designs to mitigate the effects of process variations and aging. Located on clock paths to flip-flops, these tuning elements can be configured through the scan chain so that clock skews to these flip-flops can be adjusted after man- ufacturing. Owing to the delay compensation across consecutive register stages enabled by the clock tuning elements, higher yield and enhanced robustness can be achieved. These benefits are, nonetheless, attained by increasing die area due to the inserted clock tuning elements. For balancing performance improvement and area cost, an efficient timing analysis algorithm is needed to evaluate the performance of such a circuit. So far this evaluation is only possible by Monte Carlo simulation which is very timing- consuming. In this paper, we propose an alternative method using graph transformation, which computes a parametric minimum clock period and is more than 10 4 times faster than Monte Carlo simulation while maintaining a good accuracy. This method also identifies the gates that are critical to circuit performance, so that a fast analysis-optimization flow becomes possible.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
ILP-based Alleviation of Dense Meander Segments with Prioritized Shifting and Progressive Fixing in PCB Routing
Authors:
Tsun-Ming Tseng,
Bing Li,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
Length-matching is an important technique to bal- ance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate very dense meander segments. Signals propagating along these meander segments exhibit a speedup effect due to crosstalk between the segments of the same wire, thus leading to mismatch of arrival times even under the same physical wire length. In this…
▽ More
Length-matching is an important technique to bal- ance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate very dense meander segments. Signals propagating along these meander segments exhibit a speedup effect due to crosstalk between the segments of the same wire, thus leading to mismatch of arrival times even under the same physical wire length. In this paper, we present a post-processing method to enlarge the width and the distance of meander segments and hence distribute them more evenly on the board so that crosstalk can be reduced. In the proposed framework, we model the sharing of available routing areas after removing dense meander segments from the initial routing, as well as the generation of relaxed meander segments and their groups for wire length compensation. This model is transformed into an ILP problem and solved for a balanced distribution of wire patterns. In addition, we adjust the locations of long wire segments according to wire priorities to swap free spaces toward critical wires that need much length compensation. To reduce the problem space of the ILP model, we also introduce a progressive fixing technique so that wire patterns are grown gradually from the edge of the routing toward the center area. Experimental results show that the proposed method can expand meander segments significantly even under very tight area constraints, so that the speedup effect can be alleviated effectively in high- performance PCB designs.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Post-Route Alleviation of Dense Meander Segments in High-Performance Printed Circuit Boards
Authors:
Tsun-Ming Tseng,
Bing Li,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
Length-matching is an important technique to balance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate dense meander segments with small distance. Signals propagating across these meander segments exhibit a speedup effect due to crosstalks between the segments of the same wire, thus leading to mismatch of arrival times even with the same physical wire l…
▽ More
Length-matching is an important technique to balance delays of bus signals in high-performance PCB routing. Existing routers, however, may generate dense meander segments with small distance. Signals propagating across these meander segments exhibit a speedup effect due to crosstalks between the segments of the same wire, thus leading to mismatch of arrival times even with the same physical wire length. In this paper, we propose a post-processing method to enlarge the width and the distance of meander segments and distribute them more evenly on the board so that the crosstalks can be reduced. In the proposed framework, we model the sharing combinations of available routing areas after removing dense meander segments from the initial routing, as well as the generation of relaxed meander segments and their groups in subareas. Thereafter, this model is transformed into an ILP problem and solved efficiently. Experimental results show that the proposed method can extend the width and the distance of meander segments about two times even under very tight area constraints, so that the crosstalks and thus the speedup effect can be alleviated effectively in high-performance PCB designs.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Post-Route Refinement for High-Frequency PCBs Considering Meander Segment Alleviation
Authors:
Tsun-Ming,
Tseng Bing Li,
Tsung-Yi Ho,
Ulf Schlichtmann
Abstract:
In this paper, we propose a post-processing framework which iteratively refines the routing results from an existing PCB router by removing dense meander segments. By swap** and detouring dense meander segments the proposed method can effectively alleviate accumulating crosstalk noise, while respecting pre-defined area constraints. Experimental results show more than 85% reduction of the meander…
▽ More
In this paper, we propose a post-processing framework which iteratively refines the routing results from an existing PCB router by removing dense meander segments. By swap** and detouring dense meander segments the proposed method can effectively alleviate accumulating crosstalk noise, while respecting pre-defined area constraints. Experimental results show more than 85% reduction of the meander segments and hence the noise cost.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
On Timing Model Extraction and Hierarchical Statistical Timing Analysis
Authors:
Bing Li,
Ning Chen,
Yang Xu,
Ulf Schlichtmann
Abstract:
In this paper, we investigate the challenges to apply Statistical Static Timing Analysis (SSTA) in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based and latch-controlled, we propose methods to extract timing models…
▽ More
In this paper, we investigate the challenges to apply Statistical Static Timing Analysis (SSTA) in hierarchical design flow, where modules supplied by IP vendors are used to hide design details for IP protection and to reduce the complexity of design and verification. For the three basic circuit types, combinational, flip-flop-based and latch-controlled, we propose methods to extract timing models which contain interfacing as well as compressed internal constraints. Using these compact timing models the runtime of full-chip timing analysis can be reduced, while circuit details from IP vendors are not exposed. We also propose a method to reconstruct the correlation between modules during full-chip timing analysis. This correlation can not be incorporated into timing models because it depends on the layout of the corresponding modules in the chip. In addition, we investigate how to apply the extracted timing models with the reconstructed correlation to evaluate the performance of the complete design. Experiments demonstrate that using the extracted timing models and reconstructed correlation full-chip timing analysis can be several times faster than applying the flattened circuit directly, while the accuracy of statistical timing analysis is still well maintained.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Statistical Timing Analysis for Latch-Controlled Circuits with Reduced Iterations and Graph Transformations
Authors:
Bing Li,
Ning Chen,
Ulf Schlichtmann
Abstract:
Level-sensitive latches are widely used in high- performance designs. For such circuits efficient statistical timing analysis algorithms are needed to take increasing process vari- ations into account. But existing methods solving this problem are still computationally expensive and can only provide the yield at a given clock period. In this paper we propose a method combining reduced iterations a…
▽ More
Level-sensitive latches are widely used in high- performance designs. For such circuits efficient statistical timing analysis algorithms are needed to take increasing process vari- ations into account. But existing methods solving this problem are still computationally expensive and can only provide the yield at a given clock period. In this paper we propose a method combining reduced iterations and graph transformations. The reduced iterations extract setup time constraints and identify a subgraph for the following graph transformations handling the constraints from nonpositive loops. The combined algorithms are very efficient, more than 10 times faster than other existing methods, and result in a parametric minimum clock period, which together with the hold time constraints can be used to compute the yield at any given clock period very easily.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Fast Statistical Timing Analysis for Circuits with Post-Silicon Tunable Clock Buffers
Authors:
Bing Li,
Ning Chen,
Ulf Schlichtmann
Abstract:
Post-Silicon Tunable (PST) clock buffers are widely used in high performance designs to counter process variations. By allowing delay compensation between consecutive register stages, PST buffers can effectively improve the yield of digital circuits. To date, the evaluation of manufacturing yield in the presence of PST buffers is only possible using Monte Carlo simulation. In this paper, we propos…
▽ More
Post-Silicon Tunable (PST) clock buffers are widely used in high performance designs to counter process variations. By allowing delay compensation between consecutive register stages, PST buffers can effectively improve the yield of digital circuits. To date, the evaluation of manufacturing yield in the presence of PST buffers is only possible using Monte Carlo simulation. In this paper, we propose an alternative method based on graph transformations, which is much faster, more than 1000 times, and computes a parametric minimum clock period. It also identifies the gates which are most critical to the circuit performance, therefore enabling a fast analysis-optimization flow.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Timing Model Extraction for Sequential Circuits Considering Process Variations
Authors:
Bing Li,
Ning Chen,
Ulf Schlichtmann
Abstract:
As semiconductor devices continue to scale down, process vari- ations become more relevant for circuit design. Facing such variations, statistical static timing analysis is introduced to model variations more accurately so that the pessimism in tra- ditional worst case timing analysis is reduced. Because all de- lays are modeled using correlated random variables, most statis- tical timing methods…
▽ More
As semiconductor devices continue to scale down, process vari- ations become more relevant for circuit design. Facing such variations, statistical static timing analysis is introduced to model variations more accurately so that the pessimism in tra- ditional worst case timing analysis is reduced. Because all de- lays are modeled using correlated random variables, most statis- tical timing methods are much slower than corner based timing analysis. To speed up statistical timing analysis, we propose a method to extract timing models for flip-flop and latch based sequential circuits respectively. When such a circuit is used as a module in a hierarchical design, the timing model instead of the original circuit is used for timing analysis. The extracted timing models are much smaller than the original circuits. Ex- periments show that using extracted timing models accelerates timing verification by orders of magnitude compared to previ- ous approaches using flat netlists directly. Accuracy is main- tained, however, with the mean and standard deviation of the clock period both showing usually less than 1% error compared to Monte Carlo simulation on a number of benchmark circuits.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
On Hierarchical Statistical Static Timing Analysis
Authors:
Bing Li,
Ning Chen,
Manuel Schmidt,
Walter Schneider,
Ulf Schlichtmann
Abstract:
Statistical static timing analysis deals with the increasing variations in manufacturing processes to reduce the pessimism in the worst case timing analysis. Because of the correlation between delays of circuit components, timing model generation and hierarchical timing analysis face more challenges than in static timing analysis. In this paper, a novel method to generate timing models for combina…
▽ More
Statistical static timing analysis deals with the increasing variations in manufacturing processes to reduce the pessimism in the worst case timing analysis. Because of the correlation between delays of circuit components, timing model generation and hierarchical timing analysis face more challenges than in static timing analysis. In this paper, a novel method to generate timing models for combinational circuits considering variations is proposed. The resulting timing models have accurate input-output delays and are about 80% smaller than the original circuits. Additionally, an accurate hierarchical timing analysis method at design level using pre-characterized timing models is proposed. This method incorporates the correlation between modules by replacing independent random variables to improve timing accuracy. Experimental results show that the correlation between modules strongly affects the delay distribution of the hierarchical design and the proposed method has good accuracy compared with Monte Carlo simulation, but is faster by three orders of magnitude.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.
-
Static Timing Model Extraction for Combinational Circuits
Authors:
Bing Li,
Christoph Knoth,
Walter Schneider,
Manuel Schmidt,
Ulf Schlichtmann
Abstract:
For large circuits, static timing analysis (STA) needs to be performed in a hierarchical manner to achieve higher performance in arrival time propagation. In hierarchical STA, efficient and accurate timing models of sub-modules need to be created. We propose a timing model extraction method that significantly reduces the size of timing models without losing any accuracy by removing redundant timin…
▽ More
For large circuits, static timing analysis (STA) needs to be performed in a hierarchical manner to achieve higher performance in arrival time propagation. In hierarchical STA, efficient and accurate timing models of sub-modules need to be created. We propose a timing model extraction method that significantly reduces the size of timing models without losing any accuracy by removing redundant timing information. Circuit components which do not contribute to the delay of any input to output pair are removed. The proposed method is deterministic. Compared to the original models, the numbers of edges and vertices of the resulting timing models are reduced by 84% and 85% on average, respectively, which are significantly more than the results achieved by other methods.
△ Less
Submitted 7 May, 2017;
originally announced May 2017.
-
Emulated ASIC Power and Temperature Monitor System for FPGA Prototy** of an Invasive MPSoC Computing Architecture
Authors:
Elisabeth Glocker,
Qingqing Chen,
Asheque M. Zaidi,
Ulf Schlichtmann,
Doris Schmitt-Landsiedel
Abstract:
In this contribution the emulation of an ASIC temperature and power monitoring system (TPMon) for FPGA prototy** is presented and tested to control processor temperatures under different control targets and operating strategies. The approach for emulating the power monitor is based on an instruction-level energy model. For emulating the temperature monitor, a thermal RC model is used. The monito…
▽ More
In this contribution the emulation of an ASIC temperature and power monitoring system (TPMon) for FPGA prototy** is presented and tested to control processor temperatures under different control targets and operating strategies. The approach for emulating the power monitor is based on an instruction-level energy model. For emulating the temperature monitor, a thermal RC model is used. The monitoring system supplies an invasive MPSoC computing architecture with hardware status information (power and temperature data of the processors within the system). These data are required for resource-aware load distribution. As a proof of concept different operating strategies and control targets were evaluated for a 2-tile invasive MPSoC computing system.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.