Search | arXiv e-print repository

doi 10.1109/ISVLSI61997.2024.00080

Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels

Authors: Oluwaseun Adewunmi Alo, Sairam Sri Vatsavai, Ishan Thakkar

Abstract: Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply (GEMM) kernels, which are often accelerated using specialized hardware architectures. Recently, analog photonic GEMM accelerators have emerged as a promising alternative, offering vastly superior speed and energy efficiency compared to traditional electronic accelerators. However, these photonic cannot support wider than 4-b… ▽ More Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply (GEMM) kernels, which are often accelerated using specialized hardware architectures. Recently, analog photonic GEMM accelerators have emerged as a promising alternative, offering vastly superior speed and energy efficiency compared to traditional electronic accelerators. However, these photonic cannot support wider than 4-bit integer operands due to their inherent trade-offs between analog dynamic range and parallelism. This is often inadequate for DNN training as at least 8-bit wide operands are deemed necessary to prevent significant accuracy drops. To address these limitations, we introduce a scalable photonic GEMM accelerator named SPOGA. SPOGA utilizes enhanced features such as analog summation of homodyne optical signals and in-transduction positional weighting of operands. By employing an extended optical-analog dataflow that minimizes overheads associated with bit-sliced integer arithmetic, SPOGA supports byte-size integer GEMM kernels, achieving significant improvements in throughput, latency, and energy efficiency. Specifically, SPOGA demonstrates up to 14.4$\times$, 2$\times$, and 28.5$\times$ improvements in frames-per-second (FPS), FPS/Watt, and FPS/Watt/mm$^2$ respectively, compared to existing state-of-the-art photonic solutions. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Presented at the IEEE ISVLSI 2024

arXiv:2402.11047 [pdf, other]

A Low-Dissipation and Scalable GEMM Accelerator with Silicon Nitride Photonics

Authors: Venkata Sai Praneeth Karempudi, Sairam Sri Vatsavai, Ishan Thakkar, Oluwaseun Adewunmi Alo, Jeffrey Todd Hastings, Justin Scott Woods

Abstract: Over the past few years, several microring resonator (MRR)-based analog photonic architectures have been proposed to accelerate general matrix-matrix multiplications (GEMMs), which are found in abundance in deep learning workloads.These architectures have dramatically grown in popularity because they offer exceptional throughput and energy efficiency compared to their electronic counterparts. Howe… ▽ More Over the past few years, several microring resonator (MRR)-based analog photonic architectures have been proposed to accelerate general matrix-matrix multiplications (GEMMs), which are found in abundance in deep learning workloads.These architectures have dramatically grown in popularity because they offer exceptional throughput and energy efficiency compared to their electronic counterparts. However, such architectures, due to their traditional realization based on the silicon-on-insulator (SOI) material platform, face two shortcomings. First, the high-index contrast of the SOI platform incurs high scattering losses, which mandates the provisioning of high optical input power.Second, SOI waveguides are susceptible to two-photon absorption, which can incur substantial optical signal losses at moderate-to-high signal fan-in. These shortcomings have severely detrimental effects on the achievable parallelism, throughput, and energy efficiency of SOI MRR-based GEMM accelerators. To address these shortcomings, we present a novel Silicon Nitride (SiN)-Based Photonic GEMM Accelerator called SiNPhAR. SiNPhAR architecture employs SiN-based active and passive devices to implement analog GEMM functions. Since the SiN material exhibits lower index contrast and no TPA, the optical signal losses in our SiNPhAR architecture are very low. This advantage significantly enhances the achievable processing parallelism, throughput, and energy efficiency of SiNPhAR architecture, compared to SOI-based photonic GEMM accelerators from prior work. We quantify and compare these benefits of SiNPhAR architecture via our cross-layer evaluation for a benchmark workload comprising four modern deep neural network models. From the system-level performance analysis, SiNPhAR demonstrates at least 1.7x better throughput FPS while consuming at least 2.8x better energy efficiency (FPS/W) than prior SOI-based GEMM accelerators. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: To Appear at ISQED 2024

arXiv:2402.03247 [pdf, other]

HEANA: A Hybrid Time-Amplitude Analog Optical Accelerator with Flexible Dataflows for Energy-Efficient CNN Inference

Authors: Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar

Abstract: Several photonic microring resonators (MRRs) based analog accelerators have been proposed to accelerate the inference of integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing analog photonic accelerators suffer from three shortcomings: (i) severe hampering of wavelength parallelism due to various crosstalk ef… ▽ More Several photonic microring resonators (MRRs) based analog accelerators have been proposed to accelerate the inference of integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing analog photonic accelerators suffer from three shortcomings: (i) severe hampering of wavelength parallelism due to various crosstalk effects, (ii) inflexibility of supporting various dataflows other than the weight-stationary dataflow, and (iii) failure in fully leveraging the ability of photodetectors to perform in-situ accumulations. These shortcomings collectively hamper the performance and energy efficiency of prior accelerators. To tackle these shortcomings, we present a novel Hybrid timE Amplitude aNalog optical Accelerator, called HEANA. HEANA employs hybrid time-amplitude analog optical multipliers (TAOMs) that increase the flexibility of HEANA to support multiple dataflows. A spectrally hitless arrangement of TAOMs significantly reduces the crosstalk effects, thereby increasing the wavelength parallelism in HEANA. Moreover, HEANA employs our invented balanced photo-charge accumulators (BPCAs) that enable buffer-less, in-situ, temporal accumulations to eliminate the need to use reduction networks in HEANA, relieving it from related latency and energy overheads. Our evaluation for the inference of four modern CNNs indicates that HEANA provides improvements of atleast 66x and 84x in frames-per-second (FPS) and FPS/W (energy-efficiency), respectively, for equal-area comparisons, on gmean over two MRR-based analog CNN accelerators from prior work. △ Less

Submitted 9 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: The paper is under review at ACM TODAES

arXiv:2402.03149 [pdf, other]

A Comparative Analysis of Microrings Based Incoherent Photonic GEMM Accelerators

Authors: Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Oluwaseun Adewunmi Alo, Ishan Thakkar

Abstract: Several microring resonator (MRR) based analog photonic architectures have been proposed to accelerate general matrix-matrix multiplications (GEMMs) in deep neural networks with exceptional throughput and energy efficiency. To implement GEMM functions, these MRR-based architectures, in general, manipulate optical signals in five different ways: (i) Splitting (copying) of multiple optical signals t… ▽ More Several microring resonator (MRR) based analog photonic architectures have been proposed to accelerate general matrix-matrix multiplications (GEMMs) in deep neural networks with exceptional throughput and energy efficiency. To implement GEMM functions, these MRR-based architectures, in general, manipulate optical signals in five different ways: (i) Splitting (copying) of multiple optical signals to achieve a certain fan-out, (ii) Aggregation (multiplexing) of multiple optical signals to achieve a certain fan-in, (iii) Modulation of optical signals to imprint input values onto analog signal amplitude, (iv) Weighting of modulated optical signals to achieve analog input-weight multiplication, (v) Summation of optical signals. The MRR-based GEMM accelerators undertake the first four ways of signal manipulation in an arbitrary order ignoring the possible impact of the order of these manipulations on their performance. In this paper, we conduct a detailed analysis of accelerator organizations with three different orders of these manipulations: (1) Modulation-Aggregation-Splitting-Weighting (MASW), (2) Aggregation-Splitting-Modulation-Weighting (ASMW), and (3) Splitting-Modulation-Weighting-Aggregation (SMWA). We show that these organizations affect the crosstalk noise and optical signal losses in different magnitudes, which renders these organizations with different levels of processing parallelism at the circuit level, and different magnitudes of throughput and energy-area efficiency at the system level. Our evaluation results for four CNN models show that SMWA organization achieves up to 4.4$\times$, 5$\times$, and 5.2$\times$ better throughput, energy efficiency, and area-energy efficiency, respectively, compared to ASMW and MASW organizations on average. △ Less

Submitted 16 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: To Appear at ISQED 2024

arXiv:2306.07241 [pdf, other]

An Analysis of Various Design Pathways Towards Multi-Terabit Photonic On-Interposer Interconnects

Authors: Venkata Sai Praneeth Karempudi, Janibul Bashir, Ishan G Thakkar

Abstract: In the wake of dwindling Moore's Law, to address the rapidly increasing complexity and cost of fabricating large-scale, monolithic systems-on-chip (SoCs), the industry has adopted dis-aggregation as a solution, wherein a large monolithic SoC is partitioned into multiple smaller chiplets that are then assembled into a large system-in-package (SiP) using advanced packaging substrates such as silicon… ▽ More In the wake of dwindling Moore's Law, to address the rapidly increasing complexity and cost of fabricating large-scale, monolithic systems-on-chip (SoCs), the industry has adopted dis-aggregation as a solution, wherein a large monolithic SoC is partitioned into multiple smaller chiplets that are then assembled into a large system-in-package (SiP) using advanced packaging substrates such as silicon interposer. For such interposer-based SiPs, there is a push to realize on-interposer inter-chiplet communication bandwidth of multi-Tb/s and end-to-end communication latency of no more than 10ns. This push comes as the natural progression from some recent prior works on SiP design, and is driven by the proliferating bandwidth demand of modern data-intensive workloads. To meet this bandwidth and latency goal, prior works have focused on a potential solution of using the silicon photonic interposer (SiPhI) for integrating and interconnecting a large number of chiplets into an SiP. Despite the early promise, the existing designs of on-SiPhI interconnects still have to evolve by leaps and bounds to meet the goal of multi-Tb/s bandwidth. However, the possible design pathways, upon which such an evolution can be achieved, have not been explored in any prior works yet. In this paper, we have identified several design pathways that can help evolve on-SiPhI interconnects to achieve multi-Tb/s aggregate bandwidth. We perform an extensive link-level and system-level analysis in which we explore these design pathways in isolation and in different combinations of each other. From our link-level analysis, we have observed that the design pathways that simultaneously enhance the spectral range and optical power budget available for wavelength multiplexing can render aggregate bandwidth of up to 4Tb/s per on-SiPhI link. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: Under review (ACM JETC)

arXiv:2306.07238 [pdf, other]

A Silicon Nitride Microring Modulator for High-Performance Photonic Integrated Circuits

Authors: Venkata Sai Praneeth Karempudi, Ishan G Thakkar, Jeffrey Todd Hastings

Abstract: The use of the Silicon-on-Insulator (SOI) platform has been prominent for realizing CMOS-compatible, high-performance photonic integrated circuits (PICs). But in recent years, the silicon-nitride-on-silicon-dioxide (SiN-on-SiO$_2$) platform has garnered increasing interest as an alternative, because of its several beneficial properties over the SOI platform, such as low optical losses, high thermo… ▽ More The use of the Silicon-on-Insulator (SOI) platform has been prominent for realizing CMOS-compatible, high-performance photonic integrated circuits (PICs). But in recent years, the silicon-nitride-on-silicon-dioxide (SiN-on-SiO$_2$) platform has garnered increasing interest as an alternative, because of its several beneficial properties over the SOI platform, such as low optical losses, high thermo-optic stability, broader wavelength transparency range, and high tolerance to fabrication-process variations. However, SiN-on-SiO$_2$ based active devices, such as modulators, are scarce and lack in desired performance due to the absence of free-carrier-based activity in the SiN material and the complexity of integrating other active materials with SiN-on-SiO$_2$ platform. This shortcoming hinders the SiN-on-SiO$_2$ platform for realizing active PICs. To address this shortcoming, in this article, we demonstrate a SiN-on-SiO$_2$ microring resonator (MRR) based active modulator. Our designed MRR modulator employs an Indium-Tin-Oxide (ITO)-SiO$_2$-ITO thin-film stack as the active upper cladding and leverages the free-carrier assisted, high-amplitude refractive index change in the ITO films to affect a large electro-refractive optical modulation in the device. Based on the electrostatic, transient, and finite difference time domain (FDTD) simulations, conducted using photonics foundry-validated tools, we show that our modulator achieves 450 pm/V resonance modulation efficiency, $\sim$46.2 GHz 3-dB modulation bandwidth, 18 nm free-spectral range (FSR), 0.24 dB insertion loss, and 8.2 dB extinction ratio for optical on-off-keying (OOK) modulation at 30 Gb/s. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2212.06326

arXiv:2304.07608 [pdf, other]

High-Speed and Energy-Efficient Non-Binary Computing with Polymorphic Electro-Optic Circuits and Architectures

Authors: Ishan Thakkar, Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi

Abstract: In this paper, we present microring resonator (MRR) based polymorphic E-O circuits and architectures that can be employed for high-speed and energy-efficient non-binary reconfigurable computing. Our polymorphic E-O circuits can be dynamically programmed to implement different logic and arithmetic functions at different times. They can provide compactness and polymorphism to consequently improve op… ▽ More In this paper, we present microring resonator (MRR) based polymorphic E-O circuits and architectures that can be employed for high-speed and energy-efficient non-binary reconfigurable computing. Our polymorphic E-O circuits can be dynamically programmed to implement different logic and arithmetic functions at different times. They can provide compactness and polymorphism to consequently improve operand handling, reduce idle time, and increase amortization of area and static power overheads. When combined with flexible photodetectors with the innate ability to accumulate a high number of optical pulses in situ, our circuits can support energy-efficient processing of data in non-binary formats such as stochastic/unary and high-dimensional reservoir formats. Furthermore, our polymorphic E-O circuits enable configurable E-O computing accelerator architectures for processing binarized and integer quantized convolutional neural networks (CNNs). We compare our designed polymorphic E-O circuits and architectures to several circuits and architectures from prior works in terms of area, latency, and energy consumption. △ Less

Submitted 15 April, 2023; originally announced April 2023.

arXiv:2302.08324 [pdf, other]

A Bit-Parallel Deterministic Stochastic Multiplier

Authors: Sairam Sri Vatsavai, Ishan Thakkar

Abstract: This paper presents a novel bit-parallel deterministic stochastic multiplier, which improves the area-energy-latency product by up to 10.6$\times$10$^4$, while improving the computational error by 32.2\%, compared to three prior stochastic multipliers. This paper presents a novel bit-parallel deterministic stochastic multiplier, which improves the area-energy-latency product by up to 10.6$\times$10$^4$, while improving the computational error by 32.2\%, compared to three prior stochastic multipliers. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: To Appear at IEEE ISQED 2023

arXiv:2302.07746 [pdf, other]

AGNI: In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning

Authors: Supreeth Mysore Shivanandamurthy, Sairam Sri Vatsavai, Ishan Thakkar, Sayed Ahmad Salehi

Abstract: Recent years have seen a rapid increase in research activity in the field of DRAM-based Processing-In-Memory (PIM) accelerators, where the analog computing capability of DRAM is employed by minimally changing the inherent structure of DRAM peripherals to accelerate various data-centric applications. Several DRAM-based PIM accelerators for Convolutional Neural Networks (CNNs) have also been reporte… ▽ More Recent years have seen a rapid increase in research activity in the field of DRAM-based Processing-In-Memory (PIM) accelerators, where the analog computing capability of DRAM is employed by minimally changing the inherent structure of DRAM peripherals to accelerate various data-centric applications. Several DRAM-based PIM accelerators for Convolutional Neural Networks (CNNs) have also been reported. Among these, the accelerators leveraging in-DRAM stochastic arithmetic have shown manifold improvements in processing latency and throughput, due to the ability of stochastic arithmetic to convert multiplications into simple bit-wise logical AND operations. However,the use of in-DRAM stochastic arithmetic for CNN acceleration requires frequent stochastic to binary number conversions. For that, prior works employ full adder-based or serial counter based in-DRAM circuits. These circuits consume large area and incur long latency. Their in-DRAM implementations also require heavy modifications in DRAM peripherals, which significantly diminishes the benefits of using stochastic arithmetic in these accelerators. To address these shortcomings, this paper presents a new substrate for in-DRAM stochastic-to-binary number conversion called AGNI. AGNI makes minor modifications in DRAM peripherals using pass transistors, capacitors, encoders, and charge pumps, and re-purposes the sense amplifiers as voltage comparators, to enable in-situ binary conversion of input statistic operands of different sizes with iso latency. △ Less

Submitted 11 February, 2023; originally announced February 2023.

Comments: (Preprint) To Appear at ISQED 2023

arXiv:2302.07036 [pdf, other]

SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs

Authors: Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar, Ahmad Salehi, Todd Hastings

Abstract: The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photoni… ▽ More The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git). △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: To Appear at IPDPS 2023

arXiv:2302.06405 [pdf, other]

An Optical XNOR-Bitcount Based Accelerator for Efficient Inference of Binary Neural Networks

Authors: Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar

Abstract: Binary Neural Networks (BNNs) are increasingly preferred over full-precision Convolutional Neural Networks(CNNs) to reduce the memory and computational requirements of inference processing with minimal accuracy drop. BNNs convert CNN model parameters to 1-bit precision, allowing inference of BNNs to be processed with simple XNOR and bitcount operations. This makes BNNs amenable to hardware acceler… ▽ More Binary Neural Networks (BNNs) are increasingly preferred over full-precision Convolutional Neural Networks(CNNs) to reduce the memory and computational requirements of inference processing with minimal accuracy drop. BNNs convert CNN model parameters to 1-bit precision, allowing inference of BNNs to be processed with simple XNOR and bitcount operations. This makes BNNs amenable to hardware acceleration. Several photonic integrated circuits (PICs) based BNN accelerators have been proposed. Although these accelerators provide remarkably higher throughput and energy efficiency than their electronic counterparts, the utilized XNOR and bitcount circuits in these accelerators need to be further enhanced to improve their area, energy efficiency, and throughput. This paper aims to fulfill this need. For that, we invent a single-MRR-based optical XNOR gate (OXG). Moreover, we present a novel design of bitcount circuit which we refer to as Photo-Charge Accumulator (PCA). We employ multiple OXGs in a cascaded manner using dense wavelength division multiplexing (DWDM) and connect them to the PCA, to forge a novel Optical XNOR-Bitcount based Binary Neural Network Accelerator (OXBNN). Our evaluation for the inference of four modern BNNs indicates that OXBNN provides improvements of up to 62x and 7.6x in frames-per-second (FPS) and FPS/W (energy efficiency), respectively, on geometric mean over two PIC-based BNN accelerators from prior work. We developed a transaction-level, event-driven python-based simulator for evaluation of accelerators (https://github.com/uky-UCAT/B_ONN_SIM). △ Less

Submitted 19 March, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: To Appear at IEEE ISQED 2023

arXiv:2301.13626 [pdf, other]

A Polymorphic Electro-Optic Logic Gate for High-Speed Reconfigurable Computing Circuits

Authors: Venkata Sai Praneeth Karempudi, Sairam Sri Vatsavai, Ishan Thakkar, Jeffrey Todd Hastings

Abstract: In the wake of dwindling Moore's law, integrated electro-optic (E-O) computing circuits have shown revolutionary potential to provide progressively faster and more efficient hardware for computing. The E-O circuits for computing from the literature can operate with minimal latency at high bit-rates. However, they face shortcomings due to their operand handling complexity, non-amortizable high area… ▽ More In the wake of dwindling Moore's law, integrated electro-optic (E-O) computing circuits have shown revolutionary potential to provide progressively faster and more efficient hardware for computing. The E-O circuits for computing from the literature can operate with minimal latency at high bit-rates. However, they face shortcomings due to their operand handling complexity, non-amortizable high area and static power overheads, and general unsuitability for large-scale integration on reticle-limited chips. To alleviate these shortcomings, in this paper, we present a microring resonator (MRR) based polymorphic E-O logic gate (MRR-PEOLG) that can be dynamically programmed to implement different logic functions at different times. Our MRR-PEOLG can provide compactness and polymorphism to E-O circuits, to consequently improve their operand handling and amortization of area and static power overheads. We model our MRR-PEOLG using photonics foundry-validated tools to perform frequency and time-domain analysis of its polymorphic logic functions. Our evaluation shows that the use of our MRR-PEOLG in two E-O circuits from prior works can reduce their area-energy-delay product by up to 82.6$\times$. A tutorial on the modeling and simulation of our MRR-PEOLG, along with related codes and files, is available on https://github.com/uky-UCAT/MRR-PEOLG. △ Less

Submitted 30 January, 2023; originally announced January 2023.

arXiv:2212.06326 [pdf, other]

A Silicon Nitride Microring Based High-Speed, Tuning-Efficient, Electro-Refractive Modulator

Authors: Venkata Sai Praneeth Karempudi, Ishan G Thakkar, Jeffrey Todd Hastings

Abstract: The use of the Silicon-on-Insulator (SOI) platform has been prominent for realizing CMOS-compatible, high-performance photonic integrated circuits (PICs). But in recent years, the silicon-nitride-on-silicon-dioxide (SiN-on-SiO$_2$) platform has garnered increasing interest as an alternative to the SOI platform for realizing high-performance PICs. This is because of its several beneficial propertie… ▽ More The use of the Silicon-on-Insulator (SOI) platform has been prominent for realizing CMOS-compatible, high-performance photonic integrated circuits (PICs). But in recent years, the silicon-nitride-on-silicon-dioxide (SiN-on-SiO$_2$) platform has garnered increasing interest as an alternative to the SOI platform for realizing high-performance PICs. This is because of its several beneficial properties over the SOI platform, such as low optical losses, high thermo-optic stability, broader wavelength transparency range, and high tolerance to fabrication-process variations. However, SiN-on-SiO$_2$ based active devices such as modulators are scarce and lack in desired performance, due to the absence of free-carrier based activity in the SiN material and the complexity of integrating other active materials with SiN-on-SiO$_2$ platform. This shortcoming hinders the SiN-on-SiO$_2$ platform for realizing active PICs. To address this shortcoming, we demonstrate a SiN-on-SiO$_2$ microring resonator (MRR) based active modulator in this article. Our designed MRR modulator employs an Indium-Tin-Oxide (ITO)-SiN-ITO thin-film stack, in which the ITO thin films act as the upper and lower claddings of the SiN MRR. The ITO-SiN-ITO thin-film stack leverages the free-carrier assisted, high-amplitude refractive index change in the ITO films to effect a large electro-refractive optical modulation in the device. Based on the electrostatic, transient, and finite difference time domain (FDTD) simulations, conducted using photonics foundry-validated tools, we show that our modulator achieves 280 pm/V resonance modulation efficiency, 67.8 GHz 3-dB modulation bandwidth, $\sim$19 nm free-spectral range (FSR), $\sim$0.23 dB insertion loss, and 10.31 dB extinction ratio for optical on-off-keying (OOK) modulation at 30 Gb/s. △ Less

Submitted 12 December, 2022; originally announced December 2022.

arXiv:2207.05278 [pdf, other]

Photonic Reconfigurable Accelerators for Efficient Inference of CNNs with Mixed-Sized Tensors

Authors: Sairam Sri Vatsavai, Ishan G Thakkar

Abstract: Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements for processing deep Convolutional Neural Networks (CNNs). However, previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. One example of such CNNs is depthwise separable CNNs. Performing inferences of… ▽ More Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements for processing deep Convolutional Neural Networks (CNNs). However, previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. One example of such CNNs is depthwise separable CNNs. Performing inferences of CNNs with mixed-sized tensors on such inflexible accelerators often leads to low hardware utilization, which diminishes the achievable performance and energy efficiency from the accelerators. In this paper, we present a novel way of introducing reconfigurability in the MRR-based CNN accelerators, to enable dynamic maximization of the size compatibility between the accelerator hardware components and the CNN tensors that are processed using the hardware components. We classify the state-of-the-art MRR-based CNN accelerators from prior works into two categories, based on the layout and relative placements of the utilized hardware components in the accelerators. We then use our method to introduce reconfigurability in accelerators from these two classes, to consequently improve their parallelism, the flexibility of efficiently map** tensors of different sizes, speed, and overall energy efficiency. We evaluate our reconfigurable accelerators against three prior works for the area proportionate outlook (equal hardware area for all accelerators). Our evaluation for the inference of four modern CNNs indicates that our designed reconfigurable CNN accelerators provide improvements of up to 1.8x in Frames-Per-Second (FPS) and up to 1.5x in FPS/W, compared to an MRR-based accelerator from prior work. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: Paper accepted at CASES (ESWEEK) 2022

arXiv:2110.06105 [pdf]

Photonic Networks-on-Chip Employing Multilevel Signaling: A Cross-Layer Comparative Study

Authors: Venkata Sai Praneeth Karempudi, Febin Sunny, Ishan G Thakkar, Sai Vineel Reddy Chittamuru, Mahdi Nikdast, Sudeep Pasricha

Abstract: Photonic network-on-chip (PNoC) architectures employ photonic links with dense wavelength-division multiplexing (DWDM) to enable high throughput on-chip transfers. Unfortunately, increasing the DWDM degree (i.e., using a larger number of wavelengths) to achieve higher aggregated datarate in photonic links, and hence higher throughput in PNoCs, requires sophisticated and costly laser sources along… ▽ More Photonic network-on-chip (PNoC) architectures employ photonic links with dense wavelength-division multiplexing (DWDM) to enable high throughput on-chip transfers. Unfortunately, increasing the DWDM degree (i.e., using a larger number of wavelengths) to achieve higher aggregated datarate in photonic links, and hence higher throughput in PNoCs, requires sophisticated and costly laser sources along with extra photonic hardware. This extra hardware can introduce undesired noise to the photonic link and increase the bit-error-rate (BER), power, and area consumption of PNoCs. To mitigate these issues, the use of 4-pulse amplitude modulation (4-PAM) signaling, instead of the conventional on-off keying (OOK) signaling, can halve the wavelength signals utilized in photonic links for achieving the target aggregate datarate while reducing the overhead of crosstalk noise, BER, and photonic hardware. There are various designs of 4- PAM modulators reported in the literature. For example, the signal superposition (SS), electrical digital-to-analog converter (EDAC), and optical digital-to-analog converter (ODAC) based designs of 4-PAM modulators have been reports. However, it is yet to be explored how these SS, EDAC, and ODAC based 4-PAM modulators can be utilized to design DWDM-based photonic links and PNoC architectures. In this paper, we provide an extensive link-level and system-level of the SS, EDAC, and ODAC types of 4-PAM modulators from prior work with regards to their applicability and utilization overheads. From our link-level and PNoC-level evaluation, we have observed that the 4-PAM EDAC based variants of photonic links and PNoCs exhibit better performance and energy-efficiency compared to the OOK, 4-PAM SS, and 4-PAM ODAC based links and PNoCs. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Submitted and Accepted to publish in ACM Journal on Emerging Technologies in Computing Systems

arXiv:2106.09308 [pdf]

Characterization and Mitigation of Electromigration Effects in TSV-Based Power Delivery Network Enabled 3D-Stacked DRAMs

Authors: Bobby Bose, Ishan Thakkar

Abstract: With 3D-stacked DRAM architectures becoming more prevalent, it has become important to find ways to characterize and mitigate the adverse effects that can hinder their inherent access parallelism and throughput. One example of such adversities is the electromigration (EM) effects in the through-silicon vias (TSVs) of the power delivery network (PDN) of 3D-stacked DRAM architectures. Several prior… ▽ More With 3D-stacked DRAM architectures becoming more prevalent, it has become important to find ways to characterize and mitigate the adverse effects that can hinder their inherent access parallelism and throughput. One example of such adversities is the electromigration (EM) effects in the through-silicon vias (TSVs) of the power delivery network (PDN) of 3D-stacked DRAM architectures. Several prior works have addressed the effects of EM in TSVs of 3D integrated circuits. However, no prior work has addressed the effects of EM in the PDN TSVs on the performance and lifetime of 3D-stacked DRAMs. In this paper, we characterize the effects of EM in PDN TSVs on a Hybrid Memory Cube (HMC) architecture employing the conventional PDN design with clustered layout of power and ground TSVs. We then present a new PDN design with a distributed layout of power and ground TSVs and show that it can mitigate the adverse effects of EM on the HMC architecture performance without requiring additional power and ground pins. Our benchmark-driven simulation-based analysis shows that compared to the clustered PDN layout, our proposed distributed PDN layout improves the EM-affected lifetime of the HMC architecture by up to 10 years. During this useful lifetime, the HMC architecture yields up to 1.51 times less energy-delay product (EDP). △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: To appear at IEEE/ACM GLSVLSI 2021

arXiv:2105.12781 [pdf]

ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing

Authors: Supreeth Mysore Shivanandamurthy, Ishan. G. Thakkar, Sayed Ahmad Salehi

Abstract: With the rapidly growing use of Convolutional Neural Networks (CNNs) in real-world applications related to machine learning and Artificial Intelligence (AI), several hardware accelerator designs for CNN inference and training have been proposed recently. In this paper, we present ATRIA, a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for energy-efficient and high-speed inferen… ▽ More With the rapidly growing use of Convolutional Neural Networks (CNNs) in real-world applications related to machine learning and Artificial Intelligence (AI), several hardware accelerator designs for CNN inference and training have been proposed recently. In this paper, we present ATRIA, a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for energy-efficient and high-speed inference of CNNs. ATRIA employs light-weight modifications in DRAM cell arrays to implement bit-parallel stochastic arithmetic based acceleration of multiply-accumulate (MAC) operations inside DRAM. ATRIA significantly improves the latency, throughput, and efficiency of processing CNN inferences by performing 16 MAC operations in only five consecutive memory operation cycles. We mapped the inference tasks of four benchmark CNNs on ATRIA to compare its performance with five state-of-the-art in-DRAM CNN accelerators from prior work. The results of our analysis show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still achieves improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency (FPS/W/mm2), compared to the best-performing in-DRAM accelerator from prior work. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Comments: Preprint accepted in ISVLSI 2021

arXiv:2103.08828 [pdf]

ARXON: A Framework for Approximate Communication over Photonic Networks-on-Chip

Authors: Febin Sunny, Asif Mirza, Ishan Thakkar, Mahdi Nikdast, Sudeep Pasricha

Abstract: The approximate computing paradigm advocates for relaxing accuracy goals in applications to improve energy-efficiency and performance. Recently, this paradigm has been explored to improve the energy-efficiency of silicon photonic networks-on-chip (PNoCs). Silicon photonic interconnects suffer from high power dissipation because of laser sources, which generate carrier wavelengths, and tuning power… ▽ More The approximate computing paradigm advocates for relaxing accuracy goals in applications to improve energy-efficiency and performance. Recently, this paradigm has been explored to improve the energy-efficiency of silicon photonic networks-on-chip (PNoCs). Silicon photonic interconnects suffer from high power dissipation because of laser sources, which generate carrier wavelengths, and tuning power required for regulating photonic devices under different uncertainties. In this paper, we propose a framework called ARXON to reduce such power dissipation overhead by enabling intelligent and aggressive approximation during communication over silicon photonic links in PNoCs. Our framework reduces laser and tuning-power overhead while intelligently approximating communication, such that application output quality is not distorted beyond an acceptable limit. Simulation results show that our framework can achieve up to 56.4% lower laser power consumption and up to 23.8% better energy-efficiency than the best-known prior work on approximate communication with silicon photonic interconnects and for the same application output quality. △ Less

Submitted 15 March, 2021; originally announced March 2021.

Comments: arXiv admin note: text overlap with arXiv:2002.11289

arXiv:2103.03953 [pdf]

ODIN: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM

Authors: Supreeth Mysore Shivanandamurthy, Ishan. G. Thakkar, Sayed Ahmad Salehi

Abstract: Due to the very rapidly growing use of Artificial Neural Networks (ANNs) in real-world applications related to machine learning and Artificial Intelligence (AI), several hardware accelerator de-signs for ANNs have been proposed recently. In this paper, we present a novel processing-in-memory (PIM) engine called ODIN that employs hybrid binary-stochastic bit-parallel arithmetic in-side phase change… ▽ More Due to the very rapidly growing use of Artificial Neural Networks (ANNs) in real-world applications related to machine learning and Artificial Intelligence (AI), several hardware accelerator de-signs for ANNs have been proposed recently. In this paper, we present a novel processing-in-memory (PIM) engine called ODIN that employs hybrid binary-stochastic bit-parallel arithmetic in-side phase change RAM (PCRAM) to enable a low-overhead in-situ acceleration of all essential ANN functions such as multiply-accumulate (MAC), nonlinear activation, and pooling. We mapped four ANN benchmark applications on ODIN to compare its performance with a conventional processor-centric design and a crossbar-based in-situ ANN accelerator from prior work. The results of our analysis for the considered ANN topologies indicate that our ODIN accelerator can be at least 5.8x faster and 23.2x more energy-efficient, and up to 90.8x faster and 1554x more energy-efficient, compared to the crossbar-based in-situ ANN accelerator from prior work. △ Less

Submitted 5 March, 2021; originally announced March 2021.

Comments: 6 pages, 6 Figures, 4 Tables

arXiv:2101.00557 [pdf]

Silicon Photonic Microring Based Chip-Scale Accelerator for Delayed Feedback Reservoir Computing

Authors: Sairam Sri Vatsavai, Ishan Thakkar

Abstract: To perform temporal and sequential machine learning tasks, the use of conventional Recurrent Neural Networks (RNNs) has been dwindling due to the training complexities of RNNs. To this end, accelerators for delayed feedback reservoir computing (DFRC) have attracted attention in lieu of RNNs, due to their simple hardware implementations. A typical implementation of a DFRC accelerator consists of a… ▽ More To perform temporal and sequential machine learning tasks, the use of conventional Recurrent Neural Networks (RNNs) has been dwindling due to the training complexities of RNNs. To this end, accelerators for delayed feedback reservoir computing (DFRC) have attracted attention in lieu of RNNs, due to their simple hardware implementations. A typical implementation of a DFRC accelerator consists of a delay loop and a single nonlinear neuron, together acting as multiple virtual nodes for computing. In prior work, photonic DFRC accelerators have shown an undisputed advantage of fast computation over their electronic counterparts. In this paper, we propose a more energy-efficient chip-scale DFRC accelerator that employs a silicon photonic microring (MR) based nonlinear neuron along with on-chip photonic waveguides-based delayed feedback loop. Our evaluations show that, compared to a well-known photonic DFRC accelerator from prior work, our proposed MR-based DFRC accelerator achieves 35% and 98.7% lower normalized root mean square error (NRMSE), respectively, for the prediction tasks of NARMA10 and Santa Fe time series. In addition, our MR-based DFRC accelerator achieves 58.8% lower symbol error rate (SER) for the Non-Linear Channel Equalization task. Moreover, our MR-based DFRC accelerator has 98% and 93% faster training time, respectively, compared to an electronic and a photonic DFRC accelerators from prior work. △ Less

Submitted 2 January, 2021; originally announced January 2021.

Comments: Paper accepted at VLSID 2021

arXiv:2008.11367 [pdf]

Mitigating the Latency-Area Tradeoffs for DRAM Design with Coarse-Grained Monolithic 3D (M3D) Integration

Authors: Chao-Hsuan Huang, Ishan G Thakkar

Abstract: Over the years, the DRAM latency has not scaled proportionally with its density due to the cost-centric mindset of the DRAM industry. Prior work has shown that this shortcoming can be overcome by reducing the critical length of DRAM access path. However, doing so decreases DRAM area-efficiency, exacerbating the latency-area tradeoffs for DRAM design. In this paper, we show that reorganizing DRAM c… ▽ More Over the years, the DRAM latency has not scaled proportionally with its density due to the cost-centric mindset of the DRAM industry. Prior work has shown that this shortcoming can be overcome by reducing the critical length of DRAM access path. However, doing so decreases DRAM area-efficiency, exacerbating the latency-area tradeoffs for DRAM design. In this paper, we show that reorganizing DRAM cell-arrays using the emerging monolithic 3D (M3D) integration technology can mitigate these fundamental latency-area tradeoffs. Based on our evaluation results for PARSEC benchmarks, our designed M3D DRAM cell-array organizations can yield up to 9.56% less latency, up to 4.96% less power consumption, and up to 21.21% less energy-delay product (EDP), with up to 14% less DRAM die area, com-pared to the conventional 2D DDR4 DRAM. △ Less

Submitted 25 August, 2020; originally announced August 2020.

Comments: Accepted in ICCD 2020

arXiv:2008.07566 [pdf]

PROTEUS: Rule-Based Self-Adaptation in Photonic NoCs for Loss-Aware Co-Management of Laser Power and Performance

Authors: Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar

Abstract: The performance of on-chip communication in the state-of-the-art multi-core processors that use the traditional electron-ic NoCs has already become severely energy-constrained. To that end, emerging photonic NoCs (PNoC) are seen as a po-tential solution to improve the energy-efficiency (performance per watt) of on-chip communication. However, existing PNoC designs cannot realize their full potenti… ▽ More The performance of on-chip communication in the state-of-the-art multi-core processors that use the traditional electron-ic NoCs has already become severely energy-constrained. To that end, emerging photonic NoCs (PNoC) are seen as a po-tential solution to improve the energy-efficiency (performance per watt) of on-chip communication. However, existing PNoC designs cannot realize their full potential due to their exces-sive laser power consumption. Prior works that attempt to improve laser power efficiency in PNoCs do not consider all key factors that affect the laser power requirement of PNoCs. Therefore, they cannot yield the desired balance between the reduction in laser power, achieved performance and energy-efficiency in PNoCs. In this paper, we present PROTEUS framework that employs rule-based self-adaptation in PNoCs. Our approach not only reduces the laser power consumption, but also minimizes the average packet latency by opportunis-tically increasing the communication data rate in PNoCs, and thus, yields the desired balance between the laser power re-duction, performance, and energy-efficiency in PNoCs. Our evaluation with PARSEC benchmarks shows that our PROTEUS framework can achieve up to 24.5% less laser power consumption, up to 31% less average packet latency, and up to 20% less energy-per-bit, compared to another laser power management technique from prior work. △ Less

Submitted 19 August, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: Submitted and Accepted at NOCs 2020

arXiv:2007.10454 [pdf]

Exploiting Process Variations to Secure Photonic NoC Architectures from Snoo** Attacks

Authors: Sai Vineel Reddy Chittamuru, Ishan G Thakkar, Sudeep Pasricha, Sairam Sri Vatsavai, Varun Bhat

Abstract: The compact size and high wavelength-selectivity of microring resonators (MRs) enable photonic networks-on-chip (PNoCs) to utilize dense-wavelength-division-multiplexing (DWDM) in their photonic waveguides, and as a result, attain high bandwidth on-chip data transfers. Unfortunately, a Hardware Trojan in a PNoC can manipulate the electrical driving circuit of its MRs to cause the MRs to snoop data… ▽ More The compact size and high wavelength-selectivity of microring resonators (MRs) enable photonic networks-on-chip (PNoCs) to utilize dense-wavelength-division-multiplexing (DWDM) in their photonic waveguides, and as a result, attain high bandwidth on-chip data transfers. Unfortunately, a Hardware Trojan in a PNoC can manipulate the electrical driving circuit of its MRs to cause the MRs to snoop data from the neighboring wavelength channels in a shared photonic waveguide, which introduces a serious security threat. This paper presents a framework that utilizes process variation-based authentication signatures along with architecture-level enhancements to protect against data-snoo** Hardware Trojans during unicast as well as multicast transfers in PNoCs. Evaluation results indicate that our framework can improve hardware security across various PNoC architectures with minimal overheads of up to 14.2% in average latency and of up to 14.6% in energy-delay-product (EDP). △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: Pre-Print: Accepted in IEEE TCAD Journal on July 16, 2020

arXiv:2003.11895 [pdf]

Redesigning Photonic Interconnects with Silicon-on-Sapphire Device Platform for Ultra-Low-Energy On-Chip Communication

Authors: Venkata Sai Praneeth Karempudi, Sairam Sri Vatsavai, Ishan Thakkar

Abstract: Traditional silicon-on-insulator (SOI) platform based on-chip photonic interconnects have limited energy-bandwidth scalability due to the optical non-linearity induced power constraints of the constituent photonic devices. In this paper, we propose to break this scalability barrier using a new silicon-on-sapphire (SOS) based photonic device platform. Our physical-layer characterization results sho… ▽ More Traditional silicon-on-insulator (SOI) platform based on-chip photonic interconnects have limited energy-bandwidth scalability due to the optical non-linearity induced power constraints of the constituent photonic devices. In this paper, we propose to break this scalability barrier using a new silicon-on-sapphire (SOS) based photonic device platform. Our physical-layer characterization results show that SOS-based photonic devices have negligible optical non-linearity effects in the mid-infrared region near 4μm, which drastically alleviates their power constraints. Our link-level analysis shows that SOS-based photonic devices can be used to realize photonic links with aggregated data rate of more than 1 Tb/s, which recently has been deemed unattainable for the SOI-based photonic on-chip links. We also show that such high-throughput SOS-based photonic links can significantly improve the energy-efficiency of on-chip photonic communication architectures. Our system-level analysis results position SOS-based photonic interconnects to pave the way for realizing ultra-low-energy (< 1 pJ/bit) on-chip data transfers. △ Less

Submitted 28 February, 2020; originally announced March 2020.

Comments: Submitted and Accepted at GLSVLSI 2020

arXiv:2002.11289 [pdf]

LORAX: Loss-Aware Approximations for Energy-Efficient Silicon Photonic Networks-on-Chip

Authors: Febin Sunny, Asif Mirza, Ishan Thakkar, Sudeep Pasricha, Nikdast Mahdi

Abstract: The approximate computing paradigm advocates for relaxing accuracy goals in applications to improve energy-efficiency and performance. Recently, this paradigm has been explored to improve the energy efficiency of silicon photonic networks-on-chip (PNoCs). In this paper, we propose a novel framework (LORAX) to enable more aggressive approximation during communication over silicon photonic links in… ▽ More The approximate computing paradigm advocates for relaxing accuracy goals in applications to improve energy-efficiency and performance. Recently, this paradigm has been explored to improve the energy efficiency of silicon photonic networks-on-chip (PNoCs). In this paper, we propose a novel framework (LORAX) to enable more aggressive approximation during communication over silicon photonic links in PNoCs. Given that silicon photonic interconnects have significant power dissipation due to the laser sources that generate the wavelengths for photonic communication, our framework attempts to reduce laser power overheads while intelligently approximating communication such that application output quality is not distorted beyond an acceptable limit. To the best of our knowledge, this is the first work that considers loss-aware laser power management and multilevel signaling to enable effective data approximation and energy-efficiency in PNoCs. Simulation results show that our framework can achieve up to 31.4% lower laser power consumption and up to 12.2% better energy efficiency than the best known prior work on approximate communication with silicon photonic interconnects, for the same application output quality △ Less

Submitted 25 February, 2020; originally announced February 2020.

Comments: Submitted and accepted at GLSVLSI 2020

Showing 1–25 of 25 results for author: Thakkar, I