-
Neural Combinatorial Optimization Algorithms for Solving Vehicle Routing Problems: A Comprehensive Survey with Perspectives
Authors:
Xuan Wu,
Di Wang,
Lijie Wen,
Yubin Xiao,
Chunguo Wu,
Yuesong Wu,
Chaoyu Yu,
Douglas L. Maskell,
You Zhou
Abstract:
Although several surveys on Neural Combinatorial Optimization (NCO) solvers specifically designed to solve Vehicle Routing Problems (VRPs) have been conducted. These existing surveys did not cover the state-of-the-art (SOTA) NCO solvers emerged recently. More importantly, to provide a comprehensive taxonomy of NCO solvers with up-to-date coverage, based on our thorough review of relevant publicati…
▽ More
Although several surveys on Neural Combinatorial Optimization (NCO) solvers specifically designed to solve Vehicle Routing Problems (VRPs) have been conducted. These existing surveys did not cover the state-of-the-art (SOTA) NCO solvers emerged recently. More importantly, to provide a comprehensive taxonomy of NCO solvers with up-to-date coverage, based on our thorough review of relevant publications and preprints, we divide all NCO solvers into four distinct categories, namely Learning to Construct, Learning to Improve, Learning to Predict-Once, and Learning to Predict-Multiplicity solvers. Subsequently, we present the inadequacies of the SOTA solvers, including poor generalization, incapability to solve large-scale VRPs, inability to address most types of VRP variants simultaneously, and difficulty in comparing these NCO solvers with the conventional Operations Research algorithms. Simultaneously, we propose promising and viable directions to overcome these inadequacies. In addition, we compare the performance of representative NCO solvers from the Reinforcement, Supervised, and Unsupervised Learning paradigms across both small- and large-scale VRPs. Finally, following the proposed taxonomy, we provide an accompanying web page as a live repository for NCO solvers. Through this survey and the live repository, we hope to make the research community of NCO solvers for VRPs more thriving.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Fault-Tolerant Design Approach Based on Approximate Computing
Authors:
P Balasubramanian,
D L Maskell
Abstract:
Triple Modular Redundancy (TMR) has been traditionally used to ensure complete tolerance to a single fault or a faulty processing unit, where the processing unit may be a circuit or a system. However, TMR incurs more than 200% overhead in terms of area and power compared to a single processing unit. Hence, alternative redundancy approaches were proposed in the literature to mitigate the design ove…
▽ More
Triple Modular Redundancy (TMR) has been traditionally used to ensure complete tolerance to a single fault or a faulty processing unit, where the processing unit may be a circuit or a system. However, TMR incurs more than 200% overhead in terms of area and power compared to a single processing unit. Hence, alternative redundancy approaches were proposed in the literature to mitigate the design overheads associated with TMR, but they provide only partial or moderate fault tolerance. This research presents a new fault-tolerant design approach based on approximate computing called FAC that has the same fault tolerance as TMR and achieves significant reductions in the design metrics for physical implementation. FAC is suited for a plethora of error-tolerant applications. Here, the performance of TMR and FAC has been evaluated for a digital image processing application. The image processing results obtained confirm the usefulness of FAC. When an example processing unit was implemented using a 28-nm CMOS technology, FAC achieved a 15.3% reduction in delay, a 19.5% reduction in area, and a 24.7% reduction in power compared to TMR.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Gate-Level Static Approximate Adders
Authors:
P Balasubramanian,
R Nayar,
D L Maskell
Abstract:
This work compares and analyzes static approximate adders which are suitable for FPGA and ASIC type implementations. We consider many static approximate adders and evaluate their performance with respect to a digital image processing application using standard figures of merit such as peak signal to noise ratio and structural similarity index metric. We provide the error metrics of approximate add…
▽ More
This work compares and analyzes static approximate adders which are suitable for FPGA and ASIC type implementations. We consider many static approximate adders and evaluate their performance with respect to a digital image processing application using standard figures of merit such as peak signal to noise ratio and structural similarity index metric. We provide the error metrics of approximate adders, and the design metrics of accurate and approximate adders corresponding to FPGA and ASIC type implementations. For the FPGA implementation, we considered a Xilinx Artix-7 FPGA, and for an ASIC type implementation, we considered a 32-28 nm CMOS standard digital cell library. While the inferences from this work could serve as a useful reference to determine an optimum static approximate adder for a practical application, in particular, we found approximate adders HOAANED, HERLOA and M-HERLOA to be preferable.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Area Optimized Quasi Delay Insensitive Majority Voter for TMR Applications
Authors:
P Balasubramanian,
D L Maskell,
N E Mastorakis
Abstract:
Mission-critical and safety-critical applications generally tend to incorporate triple modular redundancy (TMR) to embed fault tolerance in their physical implementations. In a TMR realization, an original function block, which may be a circuit or a system, and two exact copies of the function block are used to successfully overcome any temporary fault or permanent failure of an arbitrary function…
▽ More
Mission-critical and safety-critical applications generally tend to incorporate triple modular redundancy (TMR) to embed fault tolerance in their physical implementations. In a TMR realization, an original function block, which may be a circuit or a system, and two exact copies of the function block are used to successfully overcome any temporary fault or permanent failure of an arbitrary function block during the routine operation. The corresponding outputs of the function blocks are majority voted using 3-input majority voters whose outputs define the outputs of a TMR realization. Hence, a 3-input majority voter forms an important component of a TMR realization. Many synchronous majority voters and an asynchronous non-delay insensitive majority voter have been presented in the literature. Recently, quasi delay insensitive (QDI) asynchronous majority voters for TMR applications were also discussed in the literature. In this regard, this paper presents a new QDI asynchronous majority voter for TMR applications, which is better optimized in area compared to the existing QDI majority voters. The proposed QDI majority voter requires 30.2% less area compared to the best of the existing QDI majority voters, and this could be useful for resource-constrained fault tolerance applications. The example QDI TMR circuits were implemented using a 32/28nm complementary metal oxide semiconductor (CMOS) process. The delay insensitive dual rail code was used for data encoding, and 4-phase return-to-zero and return-to-one handshake protocols were used for data communication.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
Indicating Asynchronous Multipliers
Authors:
P Balasubramanian,
D L Maskell,
N E Mastorakis
Abstract:
Multiplication is a basic arithmetic operation that is encountered in almost all general-purpose microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical implementation of indicating asynchronous multipliers, which are inherently elastic and are robust to timing, process, and parametric variations, a…
▽ More
Multiplication is a basic arithmetic operation that is encountered in almost all general-purpose microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical implementation of indicating asynchronous multipliers, which are inherently elastic and are robust to timing, process, and parametric variations, and are modular. We consider the physical implementation of many weak-indication asynchronous multipliers using a 32/28-nm CMOS technology by adopting the array multiplier architecture. The multipliers are synthesized in a semi-custom ASIC-design style. The 4-phase return-to-zero (RTZ) and the 4-phase return-to-one (RTO) handshake protocols are considered for the data communication. The multipliers are realized using strong-indication or weak-indication full adders. Strong-indication 2-input AND function is used to generate the partial products in the case of both RTZ and RTO handshaking. The full adders considered are derived from different indicating asynchronous logic design methods. Among the multipliers considered, a weak-indication asynchronous multiplier utilizing the biased weak-indication full adder is found to be efficient in terms of the cycle time and the power-cycle time product with respect to both RTZ and RTO handshaking. Also, the 4-phase RTO handshake protocol is found to be preferable than the 4-phase RTZ handshake protocol for achieving enhanced optimizations in the design metrics.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
Indicating Asynchronous Array Multipliers
Authors:
P Balasubramanian,
D L Maskell
Abstract:
Multiplication is an important arithmetic operation that is frequently encountered in microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical implementation of many indicating asynchronous array multipliers, which are inherently elastic and modular and are robust to timing, process and parametric va…
▽ More
Multiplication is an important arithmetic operation that is frequently encountered in microprocessing and digital signal processing applications, and multiplication is physically realized using a multiplier. This paper discusses the physical implementation of many indicating asynchronous array multipliers, which are inherently elastic and modular and are robust to timing, process and parametric variations. We consider the physical realization of many indicating asynchronous array multipliers using a 32/28nm CMOS technology. The weak-indication array multipliers comprise strong-indication or weak-indication full adders, and strong-indication 2-input AND functions to realize the partial products. The multipliers were synthesized in a semi-custom ASIC design style using standard library cells including a custom-designed 2-input C-element. 4x4 and 8x8 multiplication operations were considered for the physical implementations. The 4-phase return-to-zero (RTZ) and the 4-phase return-to-one (RTO) handshake protocols were utilized for data communication, and the delay-insensitive dual-rail code was used for data encoding. Among several weak-indication array multipliers, a weak-indication array multiplier utilizing a biased weak-indication full adder and the strong-indication 2-input AND function is found to have reduced cycle time and power-cycle time product with respect to RTZ and RTO handshaking for 4x4 and 8x8 multiplications. Further, the 4-phase RTO handshaking is found to be preferable to the 4-phase RTZ handshaking for achieving enhanced optimizations of the design metrics.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
Speed and Energy Optimised Quasi-Delay-Insensitive Block Carry Lookahead Adder
Authors:
P. Balasubramanian,
D. L. Maskell,
N. E. Mastorakis
Abstract:
We present a new asynchronous quasi-delay-insensitive (QDI) block carry lookahead adder with redundancy carry (BCLARC) realized using delay-insensitive dual-rail data encoding and 4-phase return-to-zero (RTZ) and 4-phase return-to-one (RTO) handshaking. The proposed QDI BCLARC is found to be faster and energy-efficient than the existing asynchronous adders which are QDI and non-QDI (i.e., relative…
▽ More
We present a new asynchronous quasi-delay-insensitive (QDI) block carry lookahead adder with redundancy carry (BCLARC) realized using delay-insensitive dual-rail data encoding and 4-phase return-to-zero (RTZ) and 4-phase return-to-one (RTO) handshaking. The proposed QDI BCLARC is found to be faster and energy-efficient than the existing asynchronous adders which are QDI and non-QDI (i.e., relative-timed). Compared to existing asynchronous adders corresponding to various architectures such as ripple carry adder (RCA), conventional carry lookahead adder (CCLA), carry select adder (CSLA), BCLARC, and hybrid BCLARC-RCA, the proposed BCLARC is found to be faster and more energy-optimised. The cycle time (CT), which is the sum of forward and reverse latencies, governs the speed; and the product of average power dissipation and cycle time viz. the power-cycle time product (PCTP) defines the low power/energy efficiency. For a 32-bit addition, the proposed QDI BCLARC achieves the following average reductions in design metrics over its counterparts when considering RTZ and RTO handshaking: i) 20.5% and 19.6% reductions in CT and PCTP respectively compared to an optimum QDI early output RCA, ii) 16.5% and 15.8% reductions in CT and PCTP respectively compared to an optimum relative-timed RCA, iii) 32.9% and 35.9% reductions in CT and PCTP respectively compared to an optimum uniform input-partitioned QDI early output CSLA, iv) 47.5% and 47.2% reductions in CT and PCTP respectively compared to an optimum QDI early output CCLA, v) 14.2% and 27.3% reductions in CT and PCTP respectively compared to an optimum QDI early output BCLARC, and vi) 12.2% and 11.6% reductions in CT and PCTP respectively compared to an optimum QDI early output hybrid BCLARC-RCA. The adders were implemented using a 32/28nm CMOS technology.
△ Less
Submitted 22 March, 2019;
originally announced March 2019.
-
Majority and Minority Voted Redundancy for Safety-Critical Applications
Authors:
P Balasubramanian,
D L Maskell,
N E Mastorakis
Abstract:
A new majority and minority voted redundancy (MMR) scheme is proposed that can provide the same degree of fault tolerance as N-modular redundancy (NMR) but with fewer function units and a less sophisticated voting logic. Example NMR and MMR circuits were implemented using a 32/28nm CMOS process and compared. The results show that MMR circuits dissipate less power, occupy less area, and encounter l…
▽ More
A new majority and minority voted redundancy (MMR) scheme is proposed that can provide the same degree of fault tolerance as N-modular redundancy (NMR) but with fewer function units and a less sophisticated voting logic. Example NMR and MMR circuits were implemented using a 32/28nm CMOS process and compared. The results show that MMR circuits dissipate less power, occupy less area, and encounter less critical path delay than the corresponding NMR circuits while providing the same degree of fault tolerance. Hence the MMR is a promising alternative to the NMR to efficiently implement high levels of redundancy in safety-critical applications.
△ Less
Submitted 26 January, 2019;
originally announced January 2019.
-
Asynchronous Early Output Block Carry Lookahead Adder with Improved Quality of Results
Authors:
P Balasubramanian,
D L Maskell,
N E Mastorakis
Abstract:
A new asynchronous early output block carry lookahead adder (BCLA) incorporating redundant carries is proposed. Compared to the best of existing semi-custom asynchronous carry lookahead adders (CLAs) employing delay-insensitive data encoding and following a 4-phase handshaking, the proposed BCLA with redundant carries achieves 13% reduction in forward latency and 14.8% reduction in cycle time comp…
▽ More
A new asynchronous early output block carry lookahead adder (BCLA) incorporating redundant carries is proposed. Compared to the best of existing semi-custom asynchronous carry lookahead adders (CLAs) employing delay-insensitive data encoding and following a 4-phase handshaking, the proposed BCLA with redundant carries achieves 13% reduction in forward latency and 14.8% reduction in cycle time compared to the best of the existing CLAs featuring redundant carries with no area or power penalty. A hybrid variant involving a ripple carry adder (RCA) in the least significant stages i.e. BCLA-RCA is also considered that achieves a further 4% reduction in the forward latency and a 2.4% reduction in the cycle time compared to the proposed BCLA featuring redundant carries without area or power penalties.
△ Less
Submitted 26 January, 2019;
originally announced January 2019.
-
Approximate Ripple Carry and Carry Lookahead Adders - A Comparative Analysis
Authors:
P Balasubramanian,
C Dang,
D L Maskell,
K Prasad
Abstract:
Approximate ripple carry adders (RCAs) and carry lookahead adders (CLAs) are presented which are compared with accurate RCAs and CLAs for performing a 32-bit addition. The accurate and approximate RCAs and CLAs are implemented using a 32/28nm CMOS process. Approximations ranging from 4- to 20-bits are considered for the less significant adder bit positions. The simulation results show that approxi…
▽ More
Approximate ripple carry adders (RCAs) and carry lookahead adders (CLAs) are presented which are compared with accurate RCAs and CLAs for performing a 32-bit addition. The accurate and approximate RCAs and CLAs are implemented using a 32/28nm CMOS process. Approximations ranging from 4- to 20-bits are considered for the less significant adder bit positions. The simulation results show that approximate RCAs report reductions in the power-delay product (PDP) ranging from 19.5% to 82% than the accurate RCA for approximation sizes varying from 4- to 20-bits. Also, approximate CLAs report reductions in PDP ranging from 16.7% to 74.2% than the accurate CLA for approximation sizes varying from 4- to 20-bits. On average, for the approximation sizes considered, it is observed that approximate CLAs achieve a 46.5% reduction in PDP compared to the approximate RCAs. Hence, approximate CLAs are preferable over approximate RCAs for the low power implementation of approximate computer arithmetic.
△ Less
Submitted 15 October, 2017;
originally announced October 2017.
-
Asynchronous Early Output Section-Carry Based Carry Lookahead Adder with Alias Carry Logic
Authors:
P Balasubramanian,
C Dang,
D L Maskell,
K Prasad
Abstract:
A new asynchronous early output section-carry based carry lookahead adder (SCBCLA) with alias carry output logic is presented in this paper. To evaluate the proposed SCBCLA with alias carry logic and to make a comparison with other CLAs, a 32-bit addition operation is considered. Compared to the weak-indication SCBCLA with alias logic, the proposed early output SCBCLA with alias logic reports a 13…
▽ More
A new asynchronous early output section-carry based carry lookahead adder (SCBCLA) with alias carry output logic is presented in this paper. To evaluate the proposed SCBCLA with alias carry logic and to make a comparison with other CLAs, a 32-bit addition operation is considered. Compared to the weak-indication SCBCLA with alias logic, the proposed early output SCBCLA with alias logic reports a 13% reduction in area without any increases in latency and power dissipation. On the other hand, in comparison with the early output recursive CLA (RCLA), the proposed early output SCBCLA with alias logic reports a 16% reduction in latency while occupying almost the same area and dissipating almost the same average power. All the asynchronous CLAs are quasi-delay-insensitive designs which incorporate the delay-insensitive dual-rail data encoding and adhere to the 4-phase return-to-zero handshaking. The adders were realized and the simulations were performed based on a 32/28nm CMOS process.
△ Less
Submitted 15 October, 2017;
originally announced October 2017.
-
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
Authors:
Abhishek Kumar Jain,
Douglas L. Maskell,
Suhaib A. Fahmy
Abstract:
FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohibitive compilation times (specifically the FPGA pl…
▽ More
FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohibitive compilation times (specifically the FPGA place and route times) are a major stumbling block when using OpenCL tools from FPGA vendors. The long compilation times mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling. Coarse-grained overlays represent a possible solution by virtue of their coarse granularity and fast compilation. In this paper, we present a methodology for run-time compilation of OpenCL kernels to a DSP block based coarse-grained overlay, rather than directly to the fine-grained FPGA fabric. The proposed methodology allows JIT compilation and on-demand resource-aware kernel replication to better utilize available overlay resources, raising the abstraction level while reducing compile times significantly. We further demonstrate that this approach can even be used for run-time compilation of OpenCL kernels on the ARM processor of the embedded heterogeneous Zynq device.
△ Less
Submitted 7 May, 2017;
originally announced May 2017.
-
An Area-Efficient FPGA Overlay using DSP Block based Time-multiplexed Functional Units
Authors:
Xiangwei Li,
Abhishek Jain,
Douglas Maskell,
Suhaib A. Fahmy
Abstract:
Coarse grained overlay architectures improve FPGA design productivity by providing fast compilation and software-like programmability. Throughput oriented spatially configurable overlays typically suffer from area overheads due to the requirement of one functional unit for each compute kernel operation. Hence, these overlays have often been of limited size, supporting only relatively small compute…
▽ More
Coarse grained overlay architectures improve FPGA design productivity by providing fast compilation and software-like programmability. Throughput oriented spatially configurable overlays typically suffer from area overheads due to the requirement of one functional unit for each compute kernel operation. Hence, these overlays have often been of limited size, supporting only relatively small compute kernels while consuming considerable FPGA resources. This paper examines the possibility of sharing the functional units among kernel operations for reducing area overheads. We propose a linear interconnected array of time-multiplexed FUs as an overlay architecture with reduced instruction storage and interconnect resource requirements, which uses a fully-pipelined, architecture-aware FU design supporting a fast context switching time. The results presented show a reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device.
△ Less
Submitted 21 June, 2016;
originally announced June 2016.