-
Single Neuromorphic Memristor closely Emulates Multiple Synaptic Mechanisms for Energy Efficient Neural Networks
Authors:
Christoph Weilenmann,
Alexandros Ziogas,
Till Zellweger,
Kevin Portner,
Marko Mladenović,
Manasa Kaniselvan,
Timoleon Moraitis,
Mathieu Luisier,
Alexandros Emboras
Abstract:
Biological neural networks do not only include long-term memory and weight multiplication capabilities, as commonly assumed in artificial neural networks, but also more complex functions such as short-term memory, short-term plasticity, and meta-plasticity - all collocated within each synapse. Here, we demonstrate memristive nano-devices based on SrTiO3 that inherently emulate all these synaptic f…
▽ More
Biological neural networks do not only include long-term memory and weight multiplication capabilities, as commonly assumed in artificial neural networks, but also more complex functions such as short-term memory, short-term plasticity, and meta-plasticity - all collocated within each synapse. Here, we demonstrate memristive nano-devices based on SrTiO3 that inherently emulate all these synaptic functions. These memristors operate in a non-filamentary, low conductance regime, which enables stable and energy efficient operation. They can act as multi-functional hardware synapses in a class of bio-inspired deep neural networks (DNN) that make use of both long- and short-term synaptic dynamics and are capable of meta-learning or "learning-to-learn". The resulting bio-inspired DNN is then trained to play the video game Atari Pong, a complex reinforcement learning task in a dynamic environment. Our analysis shows that the energy consumption of the DNN with multi-functional memristive synapses decreases by about two orders of magnitude as compared to a pure GPU implementation. Based on this finding, we infer that memristive devices with a better emulation of the synaptic functionalities do not only broaden the applicability of neuromorphic computing, but could also improve the performance and energy costs of certain artificial intelligence applications.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Invariant subspaces and PCA in nearly matrix multiplication time
Authors:
Aleksandros Sobczyk,
Marko Mladenović,
Mathieu Luisier
Abstract:
Approximating invariant subspaces of generalized eigenvalue problems (GEPs) is a fundamental computational problem at the core of machine learning and scientific computing. It is, for example, the root of Principal Component Analysis (PCA) for dimensionality reduction, data visualization, and noise filtering, and of Density Functional Theory (DFT), arguably the most popular method to calculate the…
▽ More
Approximating invariant subspaces of generalized eigenvalue problems (GEPs) is a fundamental computational problem at the core of machine learning and scientific computing. It is, for example, the root of Principal Component Analysis (PCA) for dimensionality reduction, data visualization, and noise filtering, and of Density Functional Theory (DFT), arguably the most popular method to calculate the electronic structure of materials. For a Hermitian definite GEP $HC=SCΛ$, let $Π_k$ be the true spectral projector on the invariant subspace that is associated with the $k$ smallest (or largest) eigenvalues. Given $H,$ $S$, an integer $k$, and accuracy $\varepsilon\in(0,1)$, we show that we can compute a matrix $\widetildeΠ_k$ such that $\lVertΠ_k-\widetildeΠ_k\rVert_2\leq \varepsilon$, in $O\left( n^{ω+η}\mathrm{polylog}(n,\varepsilon^{-1},κ(S),\mathrm{gap}_k^{-1}) \right)$ bit operations in the floating point model with probability $1-1/n$. Here, $η>0$ is arbitrarily small, $ω\lesssim 2.372$ is the matrix multiplication exponent, $κ(S)=\lVert S\rVert_2\lVert S^{-1}\rVert_2$, and $\mathrm{gap}_k$ is the gap between eigenvalues $k$ and $k+1$. To the best of our knowledge, this is the first end-to-end analysis achieving such "forward-error" approximation guarantees with nearly $O(n^{ω+η})$ bit complexity, improving classical $\widetilde O(n^3)$ eigensolvers, even for the regular case $(S=I)$. Our methods rely on a new $O(n^{ω+η})$ stability analysis for the Cholesky factorization, and a new smoothed analysis for computing spectral gaps, which can be of independent interest. Ultimately, we obtain new matrix multiplication-type bit complexity upper bounds for PCA problems, including classical PCA and (randomized) low-rank approximation.
△ Less
Submitted 24 May, 2024; v1 submitted 17 November, 2023;
originally announced November 2023.
-
Approximate Euclidean lengths and distances beyond Johnson-Lindenstrauss
Authors:
Aleksandros Sobczyk,
Mathieu Luisier
Abstract:
A classical result of Johnson and Lindenstrauss states that a set of $n$ high dimensional data points can be projected down to $O(\log n/ε^2)$ dimensions such that the square of their pairwise distances is preserved up to a small distortion $ε\in(0,1)$. It has been proved that the JL lemma is optimal for the general case, therefore, improvements can only be explored for special cases. This work ai…
▽ More
A classical result of Johnson and Lindenstrauss states that a set of $n$ high dimensional data points can be projected down to $O(\log n/ε^2)$ dimensions such that the square of their pairwise distances is preserved up to a small distortion $ε\in(0,1)$. It has been proved that the JL lemma is optimal for the general case, therefore, improvements can only be explored for special cases. This work aims to improve the $ε^{-2}$ dependency based on techniques inspired by the Hutch++ Algorithm, which reduces $ε^{-2}$ to $ε^{-1}$ for the related problem of implicit matrix trace estimation. We first present an algorithm to estimate the Euclidean lengths of the rows of a matrix. We prove for it element-wise probabilistic bounds that are at least as good as standard JL approximations in the worst-case, but are asymptotically better for matrices with decaying spectrum. Moreover, for any matrix, regardless of its spectrum, the algorithm achieves $ε$-accuracy for the total, Frobenius norm-wise relative error using only $O(ε^{-1})$ queries. This is a quadratic improvement over the norm-wise error of standard JL approximations. We also show how these results can be extended to estimate (i) the Euclidean distances between data points and (ii) the statistical leverage scores of tall-and-skinny data matrices, which are ubiquitous for many applications, with analogous theoretical improvements. Proof-of-concept numerical experiments are presented to validate the theoretical analysis.
△ Less
Submitted 8 May, 2023; v1 submitted 24 May, 2022;
originally announced May 2022.
-
A back-end, CMOS compatible ferroelectric Field Effect Transistor for synaptic weights
Authors:
Mattia Halter,
Laura Bégon-Lours,
Valeria Bragaglia,
Marilyne Sousa,
Bert Jan Offrein,
Stefan Abel,
Mathieu Luisier,
Jean Fompeyriney
Abstract:
Neuromorphic computing architectures enable the dense co-location of memory and processing elements within a single circuit. This co-location removes the communication bottleneck of transferring data between separate memory and computing units as in standard von Neuman architectures for data-critical applications including machine learning. The essential building blocks of neuromorphic systems are…
▽ More
Neuromorphic computing architectures enable the dense co-location of memory and processing elements within a single circuit. This co-location removes the communication bottleneck of transferring data between separate memory and computing units as in standard von Neuman architectures for data-critical applications including machine learning. The essential building blocks of neuromorphic systems are non-volatile synaptic elements such as memristors. Key memristor properties include a suitable non-volatile resistance range, continuous linear resistance modulation and symmetric switching. In this work, we demonstrate voltage-controlled, symmetric and analog potentiation and depression of a ferroelectric Hf$_{57}$Zr$_{43}$O$_{2}$ (HZO) field effect transistor (FeFET) with good linearity. Our FeFET operates with a low writing energy (fJ) and fast programming time (40 ns). Retention measurements have been done over 4-bits depth with low noise (1%) in the tungsten oxide (WO$_{x}$) read out channel. By adjusting the channel thickness from 15nm to 8nm, the on/off ratio of the FeFET can be engineered from 1% to 200% with an on-resistance ideally >100 kOhm, depending on the channel geometry. The device concept is using earth-abundant materials, and is compatible with a back end of line (BEOL) integration into complementary metal-oxidesemiconductor (CMOS) processes. It has therefore a great potential for the fabrication of high density, large-scale integrated arrays of artificial analog synapses.
△ Less
Submitted 17 January, 2020;
originally announced January 2020.
-
A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations
Authors:
Alexandros Nikolaos Ziogas,
Tal Ben-Nun,
Guillermo Indalecio Fernández,
Timo Schneider,
Mathieu Luisier,
Torsten Hoefler
Abstract:
The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electro-thermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarse-and fine-grained data-movement characteristics that can be used for per…
▽ More
The computational efficiency of a state of the art ab initio quantum transport (QT) solver, capable of revealing the coupled electro-thermal properties of atomically-resolved nano-transistors, has been improved by up to two orders of magnitude through a data centric reorganization of the application. The approach yields coarse-and fine-grained data-movement characteristics that can be used for performance and communication modeling, communication-avoidance, and dataflow transformations. The resulting code has been tuned for two top-6 hybrid supercomputers, reaching a sustained performance of 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision. These computational achievements enable the restructured QT simulator to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14$\times$ shorter duration than the original code needs to handle a system with 1,000 atoms, on the same number of CPUs/GPUs and with the same physical accuracy.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming
Authors:
Alexandros Nikolaos Ziogas,
Tal Ben-Nun,
Guillermo Indalecio Fernández,
Timo Schneider,
Mathieu Luisier,
Torsten Hoefler
Abstract:
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated FinFETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centri…
▽ More
Designing efficient cooling systems for integrated circuits (ICs) relies on a deep understanding of the electro-thermal properties of transistors. To shed light on this issue in currently fabricated FinFETs, a quantum mechanical solver capable of revealing atomically-resolved electron and phonon transport phenomena from first-principles is required. In this paper, we consider a global, data-centric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse- and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. The presented results make ab initio device simulation enter a new era, where nanostructures composed of over 10,000 atoms can be investigated at an unprecedented level of accuracy, paving the way for better heat management in next-generation ICs.
△ Less
Submitted 18 December, 2019;
originally announced December 2019.
-
COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications
Authors:
Daniele Cesarini,
Andrea Bartolini,
Andrea Borghesi,
Carlo Cavazzoni,
Mathieu Luisier,
Luca Benini
Abstract:
The power consumption of supercomputers is a major challenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-…
▽ More
The power consumption of supercomputers is a major challenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-performance computing scenarios. The power management run-time frameworks proposed in the last decade are based on the assumption that the duration of communication and application phases in an MPI application can be predicted and used at run-time to trade-off communication slack with power consumption. In this manuscript, we first show that this assumption is too general and leads to mispredictions, slowing down applications, thereby jeopardizing the claimed benefits. We then propose a new approach based on (i) the separation of communication phases and slack during MPI calls and (ii) a timeout algorithm to cope with the hardware power management latency, which jointly makes it possible to achieve performance-neutral power saving in MPI applications without requiring labor-intensive and risky application source code modifications. We validate our approach in a tier-1 production environment with widely adopted scientific applications. Our approach has a time-to-completion overhead lower than 1%, while it successfully exploits slack in communication phases to achieve an average energy saving of 10%. If we focus on a large-scale application runs, the proposed approach achieves 22% energy saving with an overhead of only 0.4%. With respect to state-of-the-art approaches, COUNTDOWN Slack is the only that always leads to an energy saving with negligible overhead (<3%).
△ Less
Submitted 27 September, 2019;
originally announced September 2019.