-
LLVM Static Analysis for Program Characterization and Memory Reuse Profile Estimation
Authors:
Atanu Barai,
Nandakishore Santhi,
Abdur Razzak,
Stephan Eidenbenz,
Abdel-Hameed A. Badawy
Abstract:
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estim…
▽ More
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estimates the reuse distance profile of a program by analyzing the LLVM IR file in constant time, regardless of program input size. We generate the basic-block-level control flow graph of the target application kernel and determine basic-block execution counts by solving the linear balance equation involving the adjacent basic blocks' transition probabilities. Finally, we represent the kernel memory accesses in a bracketed format and employ a recursive algorithm to calculate the reuse distance profile. The results show that our approach can predict application characteristics accurately compared to another LLVM-based dynamic code analysis tool, Byfl.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis
Authors:
Hamdy Abdelkhalik,
Yehia Arafa,
Nandakishore Santhi,
Abdel-Hameed Badawy
Abstract:
Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more effi…
▽ More
Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Authors:
Hamdy Abdelkhalik,
Shamminuj Aktar,
Yehia Arafa,
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Nishant Panda,
Nirmal Prajapati,
Nazmul Haque Turja,
Stephan Eidenbenz,
Abdel-Hameed Badawy
Abstract:
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a…
▽ More
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.
△ Less
Submitted 11 November, 2023; v1 submitted 15 February, 2022;
originally announced February 2022.
-
PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling
Authors:
Atanu Barai,
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's l…
▽ More
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections. We model Intel's Broadwell, Haswell, and AMD's Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed Badawy,
Yehia Arafa,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic…
▽ More
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic and computationally-efficient method to predict the reuse distance profiles of caches in multicores. SASMM relies on a stochastic, static basic block-level analysis of reuse profiles. The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces. The experiments show that our model can predict private L1 cache hit rates with 2.12% and shared L2 cache hit rates with about 1.50% error rate.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Machine Learning Enabled Scalable Performance Prediction of Scientific Codes
Authors:
Gopinath Chennupati,
Nandakishore Santhi,
Phill Romero,
Stephan Eidenbenz
Abstract:
We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) anal…
▽ More
We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) analyzes the basic block structure of the code, (ii) processes architecture-independent virtual memory access patterns that it uses to build memory reuse distance distribution models for each basic block, (iii) runs detailed basic-block level simulations to determine hardware pipeline usage.
PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of PPT running on Simian PDES engine. We validate PPT-AMMP on four standard computational physics benchmarks, finally present a use case of hardware parameter sensitivity analysis to identify bottleneck hardware resources on different code inputs. We further extend PPT-AMMP to predict the performance of scientific application (radiation transport), SNAP. We analyze the application of multi-variate regression models that accurately predict the reuse profiles and the basic block counts. The predicted runtimes of SNAP when compared to that of actual times are accurate.
△ Less
Submitted 12 November, 2020; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs
Authors:
Yehia Arafa,
Ammar ElWazir,
Abdelrahman ElKanishy,
Youssef Aly,
Ayatelrahman Elsayed,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Stephan Eidenbenz,
Nandakishore Santhi
Abstract:
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of…
▽ More
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of more than 40 instructions for four high-end NVIDIA GPUs from four different generations (Maxwell, Pascal, Volta, and Turing). Furthermore, we show the effect of the CUDA compiler optimizations on the energy consumption of each instruction. We use three different software techniques to read the GPU on-chip power sensors, which use NVIDIA's NVML API and provide an in-depth comparison between these techniques. Additionally, we verified the software measurement techniques against a custom-designed hardware power measurement. The results show that Volta GPUs have the best energy efficiency of all the other generations for the different categories of the instructions. This work should aid in understanding NVIDIA GPUs' microarchitecture. It should also make energy measurements of any GPU kernel both efficient and accurate.
△ Less
Submitted 2 June, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed A. Badawy,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hier…
▽ More
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy. This model uses a computationally efficient, probabilistic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. It relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications ran sequentially on small instances rather than using a multi-threaded trace. The results indicate that the hit-rate predictions on the shared cache are accurate.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs
Authors:
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed ch…
▽ More
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.
△ Less
Submitted 1 September, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Quantum Algorithm Implementations for Beginners
Authors:
Abhijith J.,
Adetokunbo Adedoyin,
John Ambrosiano,
Petr Anisimov,
William Casper,
Gopinath Chennupati,
Carleton Coffrin,
Hristo Djidjev,
David Gunter,
Satish Karra,
Nathan Lemons,
Shizeng Lin,
Alexander Malyzhenkov,
David Mascarenas,
Susan Mniszewski,
Balu Nadiga,
Daniel O'Malley,
Diane Oyen,
Scott Pakin,
Lakshman Prasad,
Randy Roberts,
Phillip Romero,
Nandakishore Santhi,
Nikolai Sinitsyn,
Pieter J. Swart
, et al. (9 additional authors not shown)
Abstract:
As quantum computers become available to the general public, the need has arisen to train a cohort of quantum programmers, many of whom have been develo** classical computer programs for most of their careers. While currently available quantum computers have less than 100 qubits, quantum computing hardware is widely expected to grow in terms of qubit count, quality, and connectivity. This review…
▽ More
As quantum computers become available to the general public, the need has arisen to train a cohort of quantum programmers, many of whom have been develo** classical computer programs for most of their careers. While currently available quantum computers have less than 100 qubits, quantum computing hardware is widely expected to grow in terms of qubit count, quality, and connectivity. This review aims to explain the principles of quantum programming, which are quite different from classical programming, with straightforward algebra that makes understanding of the underlying fascinating quantum mechanical principles optional. We give an introduction to quantum computing algorithms and their implementation on real quantum hardware. We survey 20 different quantum algorithms, attempting to describe each in a succinct and self-contained fashion. We show how these algorithms can be implemented on IBM's quantum computer, and in each case, we discuss the results of the implementation with respect to differences between the simulator and the actual hardware runs. This article introduces computer scientists, physicists, and engineers to quantum algorithms and provides a blueprint for their implementations.
△ Less
Submitted 26 June, 2022; v1 submitted 10 April, 2018;
originally announced April 2018.
-
Accelerator Codesign as Non-Linear Optimization
Authors:
Nirmal Prajapati,
Sanjay Rajopadhye,
Hristo Djidjev,
Nandkishore Santhi,
Tobias Grosser,
Rumen Andonov
Abstract:
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterizati…
▽ More
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of 'all the hardware and software parameters'. The solution to this problem, therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance.
We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement.
△ Less
Submitted 13 December, 2017;
originally announced December 2017.
-
Understanding Cascading Failures in Power Grids
Authors:
Sachin Kadloor,
Nandakishore Santhi
Abstract:
In the past, we have observed several large blackouts, i.e. loss of power to large areas. It has been noted by several researchers that these large blackouts are a result of a cascade of failures of various components. As a power grid is made up of several thousands or even millions of components (relays, breakers, transformers, etc.), it is quite plausible that a few of these components do not pe…
▽ More
In the past, we have observed several large blackouts, i.e. loss of power to large areas. It has been noted by several researchers that these large blackouts are a result of a cascade of failures of various components. As a power grid is made up of several thousands or even millions of components (relays, breakers, transformers, etc.), it is quite plausible that a few of these components do not perform their function as desired. Their failure/misbehavior puts additional burden on the working components causing them to misbehave, and thus leading to a cascade of failures.
The complexity of the entire power grid makes it difficult to model each and every individual component and study the stability of the entire system. For this reason, it is often the case that abstract models of the working of the power grid are constructed and then analyzed. These models need to be computationally tractable while serving as a reasonable model for the entire system. In this work, we construct one such model for the power grid, and analyze it.
△ Less
Submitted 17 November, 2010;
originally announced November 2010.
-
On Algebraic Decoding of $q$-ary Reed-Muller and Product-Reed-Solomon Codes
Authors:
Nandakishore Santhi
Abstract:
We consider a list decoding algorithm recently proposed by Pellikaan-Wu \cite{PW2005} for $q$-ary Reed-Muller codes $\mathcal{RM}_q(\ell, m, n)$ of length $n \leq q^m$ when $\ell \leq q$. A simple and easily accessible correctness proof is given which shows that this algorithm achieves a relative error-correction radius of $τ\leq (1 - \sqrt{\ell q^{m-1}/{n}})$. This is an improvement over the pr…
▽ More
We consider a list decoding algorithm recently proposed by Pellikaan-Wu \cite{PW2005} for $q$-ary Reed-Muller codes $\mathcal{RM}_q(\ell, m, n)$ of length $n \leq q^m$ when $\ell \leq q$. A simple and easily accessible correctness proof is given which shows that this algorithm achieves a relative error-correction radius of $τ\leq (1 - \sqrt{\ell q^{m-1}/{n}})$. This is an improvement over the proof using one-point Algebraic-Geometric codes given in \cite{PW2005}. The described algorithm can be adapted to decode Product-Reed-Solomon codes.
We then propose a new low complexity recursive algebraic decoding algorithm for Reed-Muller and Product-Reed-Solomon codes. Our algorithm achieves a relative error correction radius of $τ\leq \prod_{i=1}^m (1 - \sqrt{k_i/q})$. This technique is then proved to outperform the Pellikaan-Wu method in both complexity and error correction radius over a wide range of code rates.
△ Less
Submitted 21 April, 2007;
originally announced April 2007.
-
On an Improvement over Rényi's Equivocation Bound
Authors:
Nandakishore Santhi,
Alexander Vardy
Abstract:
We consider the problem of estimating the probability of error in multi-hypothesis testing when MAP criterion is used. This probability, which is also known as the Bayes risk is an important measure in many communication and information theory problems. In general, the exact Bayes risk can be difficult to obtain. Many upper and lower bounds are known in literature. One such upper bound is the eq…
▽ More
We consider the problem of estimating the probability of error in multi-hypothesis testing when MAP criterion is used. This probability, which is also known as the Bayes risk is an important measure in many communication and information theory problems. In general, the exact Bayes risk can be difficult to obtain. Many upper and lower bounds are known in literature. One such upper bound is the equivocation bound due to Rényi which is of great philosophical interest because it connects the Bayes risk to conditional entropy. Here we give a simple derivation for an improved equivocation bound.
We then give some typical examples of problems where these bounds can be of use. We first consider a binary hypothesis testing problem for which the exact Bayes risk is difficult to derive. In such problems bounds are of interest. Furthermore using the bounds on Bayes risk derived in the paper and a random coding argument, we prove a lower bound on equivocation valid for most random codes over memoryless channels.
△ Less
Submitted 22 August, 2006;
originally announced August 2006.
-
Analog Codes on Graphs
Authors:
Nandakishore Santhi,
Alexander Vardy
Abstract:
We consider the problem of transmission of a sequence of real data produced by a Nyquist sampled band-limited analog source over a band-limited analog channel, which introduces an additive white Gaussian noise. An analog coding scheme is described, which can achieve a mean-squared error distortion proportional to $(1+SNR)^{-B}$ for a bandwidth expansion factor of $B/R$, where $0 < R < 1$ is the…
▽ More
We consider the problem of transmission of a sequence of real data produced by a Nyquist sampled band-limited analog source over a band-limited analog channel, which introduces an additive white Gaussian noise. An analog coding scheme is described, which can achieve a mean-squared error distortion proportional to $(1+SNR)^{-B}$ for a bandwidth expansion factor of $B/R$, where $0 < R < 1$ is the rate of individual component binary codes used in the construction and $B \geq 1$ is an integer. Thus, over a wide range of SNR values, the proposed code performs much better than any single previously known analog coding system.
△ Less
Submitted 22 August, 2006;
originally announced August 2006.
-
A Quadratic Time-Space Tradeoff for Unrestricted Deterministic Decision Branching Programs
Authors:
Nandakishore Santhi,
Alexander Vardy
Abstract:
For a decision problem from coding theory, we prove a quadratic expected time-space tradeoff of the form $\eT\eS=Ω(\tfrac{n^2}{q})$ for $q$-way deterministic decision branching programs, where $q\geq 2$. Here $\eT$ is the expected computation time and $\eS$ is the expected space, when all inputs are equally likely. This bound is to our knowledge, the first such to show an exponential size requirem…
▽ More
For a decision problem from coding theory, we prove a quadratic expected time-space tradeoff of the form $\eT\eS=Ω(\tfrac{n^2}{q})$ for $q$-way deterministic decision branching programs, where $q\geq 2$. Here $\eT$ is the expected computation time and $\eS$ is the expected space, when all inputs are equally likely. This bound is to our knowledge, the first such to show an exponential size requirement whenever $\eT = O(n^2)$. Previous exponential size tradeoffs for Boolean decision branching programs were valid for time-restricted models with $T=o(n\log_2{n})$. Proving quadratic time-space tradeoffs for unrestricted time decision branching programs has been a major goal of recent research -- this goal has already been achieved for multiple-output branching programs two decades ago. We also show the first quadratic time-space tradeoffs for Boolean decision branching programs verifying circular convolution, matrix-vector multiplication and discrete Fourier transform. Furthermore, we demonstrate a constructive Boolean decision function which has a quadratic expected time-space tradeoff in the Boolean deterministic decision branching program model. When $q$ is a constant the tradeoff results derived here for decision functions verifying various functions are order-comparable to previously known tradeoff bounds for calculating the corresponding multiple-output functions.
△ Less
Submitted 17 November, 2010; v1 submitted 22 August, 2006;
originally announced August 2006.