-
ARMAN: A Reconfigurable Monolithic 3D Accelerator Architecture for Convolutional Neural Networks
Authors:
Ali Sedaghatgoo,
Amir M. Hajisadeghi,
Mahmoud Momtazpour,
Nader Bagherzadeh
Abstract:
The Convolutional Neural Network (CNN) has emerged as a powerful and versatile tool for artificial intelligence (AI) applications. Conventional computing architectures face challenges in meeting the demanding processing requirements of compute-intensive CNN applications, as they suffer from limited throughput and low utilization. To this end, specialized accelerators have been developed to speed u…
▽ More
The Convolutional Neural Network (CNN) has emerged as a powerful and versatile tool for artificial intelligence (AI) applications. Conventional computing architectures face challenges in meeting the demanding processing requirements of compute-intensive CNN applications, as they suffer from limited throughput and low utilization. To this end, specialized accelerators have been developed to speed up CNN computations. However, as we demonstrate in this paper via extensive design space exploration, different neural network models have different characteristics, which calls for different accelerator architectures and configurations to match their computing demand. We show that a one-size-fits-all fixed architecture does not guarantee optimal power/energy/performance trade-off. To overcome this challenge, this paper proposes ARMAN, a novel reconfigurable systolic-array-based accelerator architecture based on Monolithic 3D (M3D) technology for CNN inference. The proposed accelerator offers the flexibility to reconfigure among different scale-up or scale-out arrangements depending on the neural network structure, providing the optimal trade-off across power, energy, and performance for various neural network models. We demonstrate the effectiveness of our approach through evaluations of multiple benchmarks. The results demonstrate that the proposed accelerator exhibits up to 2x, 2.24x, 1.48x, and 2x improvements in terms of execution cycles, power, energy, and EDP respectively, over the non-configurable architecture.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Support for Stock Trend Prediction Using Transformers and Sentiment Analysis
Authors:
Harsimrat Kaeley,
Ye Qiao,
Nader Bagherzadeh
Abstract:
Stock trend analysis has been an influential time-series prediction topic due to its lucrative and inherently chaotic nature. Many models looking to accurately predict the trend of stocks have been based on Recurrent Neural Networks (RNNs). However, due to the limitations of RNNs, such as gradient vanish and long-term dependencies being lost as sequence length increases, in this paper we develop a…
▽ More
Stock trend analysis has been an influential time-series prediction topic due to its lucrative and inherently chaotic nature. Many models looking to accurately predict the trend of stocks have been based on Recurrent Neural Networks (RNNs). However, due to the limitations of RNNs, such as gradient vanish and long-term dependencies being lost as sequence length increases, in this paper we develop a Transformer based model that uses technical stock data and sentiment analysis to conduct accurate stock trend prediction over long time windows. This paper also introduces a novel dataset containing daily technical stock data and top news headline data spanning almost three years. Stock prediction based solely on technical data can suffer from lag caused by the inability of stock indicators to effectively factor in breaking market news. The use of sentiment analysis on top headlines can help account for unforeseen shifts in market conditions caused by news coverage. We measure the performance of our model against RNNs over sequence lengths spanning 5 business days to 30 business days to mimic different length trading strategies. This reveals an improvement in directional accuracy over RNNs as sequence length is increased, with the largest improvement being close to 18.63% at 30 business days.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Stock Trend Prediction: A Semantic Segmentation Approach
Authors:
Shima Nabiee,
Nader Bagherzadeh
Abstract:
Market financial forecasting is a trending area in deep learning. Deep learning models are capable of tackling the classic challenges in stock market data, such as its extremely complicated dynamics as well as long-term temporal correlation. To capture the temporal relationship among these time series, recurrent neural networks are employed. However, it is difficult for recurrent models to learn t…
▽ More
Market financial forecasting is a trending area in deep learning. Deep learning models are capable of tackling the classic challenges in stock market data, such as its extremely complicated dynamics as well as long-term temporal correlation. To capture the temporal relationship among these time series, recurrent neural networks are employed. However, it is difficult for recurrent models to learn to keep track of long-term information. Convolutional Neural Networks have been utilized to better capture the dynamics and extract features for both short- and long-term forecasting. However, semantic segmentation and its well-designed fully convolutional networks have never been studied for time-series dense classification. We present a novel approach to predict long-term daily stock price change trends with fully 2D-convolutional encoder-decoders. We generate input frames with daily prices for a time-frame of T days. The aim is to predict future trends by pixel-wise classification of the current price frame. We propose a hierarchical CNN structure to encode multiple price frames to multiscale latent representation in parallel using Atrous Spatial Pyramid Pooling blocks and take that temporal coarse feature stacks into account in the decoding stages. Our hierarchical structure of CNNs makes it capable of capturing both long and short-term temporal relationships effectively. The effect of increasing the input time horizon via incrementing parallel encoders has been studied with interesting and substantial changes in the output segmentation masks. We achieve overall accuracy and AUC of %78.18 and 0.88 for joint trend prediction over the next 20 days, surpassing other semantic segmentation approaches. We compared our proposed model with several deep models specifically designed for technical analysis and found that for different output horizons, our proposed models outperformed other models.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
A Two-Stage Efficient 3-D CNN Framework for EEG Based Emotion Recognition
Authors:
Ye Qiao,
Mohammed Alnemari,
Nader Bagherzadeh
Abstract:
This paper proposes a novel two-stage framework for emotion recognition using EEG data that outperforms state-of-the-art models while kee** the model size small and computationally efficient. The framework consists of two stages; the first stage involves constructing efficient models named EEGNet, which is inspired by the state-of-the-art efficient architecture and employs inverted-residual bloc…
▽ More
This paper proposes a novel two-stage framework for emotion recognition using EEG data that outperforms state-of-the-art models while kee** the model size small and computationally efficient. The framework consists of two stages; the first stage involves constructing efficient models named EEGNet, which is inspired by the state-of-the-art efficient architecture and employs inverted-residual blocks that contain depthwise separable convolutional layers. The EEGNet models on both valence and arousal labels achieve the average classification accuracy of 90%, 96.6%, and 99.5% with only 6.4k, 14k, and 25k parameters, respectively. In terms of accuracy and storage cost, these models outperform the previous state-of-the-art result by up to 9%. In the second stage, we binarize these models to further compress them and deploy them easily on edge devices. Binary Neural Networks (BNNs) typically degrade model accuracy. We improve the EEGNet binarized models in this paper by introducing three novel methods and achieving a 20\% improvement over the baseline binary models. The proposed binarized EEGNet models achieve accuracies of 81%, 95%, and 99% with storage costs of 0.11Mbits, 0.28Mbits, and 0.46Mbits, respectively. Those models help deploy a precise human emotion recognition system on the edge environment.
△ Less
Submitted 26 July, 2022;
originally announced August 2022.
-
PLAM: a Posit Logarithm-Approximate Multiplier
Authors:
Raul Murillo,
Alberto A. Del Barrio,
Guillermo Botella,
Min Soo Kim,
Hyun** Kim,
Nader Bagherzadeh
Abstract:
The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in Neural Network related tasks and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the…
▽ More
The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in Neural Network related tasks and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, the most power-hungry units within Deep Neural Network architectures. When comparing with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively, without accuracy degradation.
△ Less
Submitted 7 September, 2021; v1 submitted 18 February, 2021;
originally announced February 2021.
-
The Effects of Approximate Multiplication on Convolutional Neural Networks
Authors:
Min Soo Kim,
Alberto A. Del Barrio,
Hyun** Kim,
Nader Bagherzadeh
Abstract:
This paper analyzes the effects of approximate multiplication when performing inferences on deep convolutional neural networks (CNNs). The approximate multiplication can reduce the cost of the underlying circuits so that CNN inferences can be performed more efficiently in hardware accelerators. The study identifies the critical factors in the convolution, fully-connected, and batch normalization l…
▽ More
This paper analyzes the effects of approximate multiplication when performing inferences on deep convolutional neural networks (CNNs). The approximate multiplication can reduce the cost of the underlying circuits so that CNN inferences can be performed more efficiently in hardware accelerators. The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication. The same factors also provide an arithmetic explanation of why bfloat16 multiplication performs well on CNNs. The experiments are performed with recognized network architectures to show that the approximate multipliers can produce predictions that are nearly as accurate as the FP32 references, without additional training. For example, the ResNet and Inception-v4 models with Mitch-$w$6 multiplication produces Top-5 errors that are within 0.2% compared to the FP32 references. A brief cost comparison of Mitch-$w$6 against bfloat16 is presented, where a MAC operation saves up to 80% of energy compared to the bfloat16 arithmetic. The most far-reaching contribution of this paper is the analytical justification that multiplications can be approximated while additions need to be exact in CNN MAC operations.
△ Less
Submitted 9 January, 2021; v1 submitted 20 July, 2020;
originally announced July 2020.
-
Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators
Authors:
Masoomeh Jasemi,
Shaahin Hessabi,
Nader Bagherzadeh
Abstract:
We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit,…
▽ More
We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively.
△ Less
Submitted 14 January, 2020;
originally announced January 2020.
-
Immunity of nanoscale magnetic tunnel junctions to ionizing radiation
Authors:
Eric Arturo Montoya,
Jen-Ru Chen,
Randy Ngelale,
Han Kyu Lee,
Hsin-Wei Tseng,
Lei Wan,
En Yang,
Patrick Braganca,
Ozdal Boyraz,
Nader Bagherzadeh,
Mikael Nilsson,
Ilya N. Krivorotov
Abstract:
Spin transfer torque magnetic random access memory (STT-MRAM) is a promising candidate for next generation memory as it is non-volatile, fast, and has unlimited endurance. Another important aspect of STT-MRAM is that its core component, the nanoscale magnetic tunneling junction (MTJ), is thought to be radiation hard, making it attractive for space and nuclear technology applications. However, stud…
▽ More
Spin transfer torque magnetic random access memory (STT-MRAM) is a promising candidate for next generation memory as it is non-volatile, fast, and has unlimited endurance. Another important aspect of STT-MRAM is that its core component, the nanoscale magnetic tunneling junction (MTJ), is thought to be radiation hard, making it attractive for space and nuclear technology applications. However, studies of the effects of high doses of ionizing radiation on STT-MRAM writing process are lacking. Here we report measurements of the impact of high doses of gamma and neutron radiation on nanoscale MTJs with perpendicular magnetic anistropy used in STT-MRAM. We characterize the tunneling magnetoresistance, the magnetic field switching, and the current-induced switching before and after irradiation. Our results demonstrate that all these key properties of nanoscale MTJs relevant to STT-MRAM applications are robust against ionizing radiation. Additionally, we perform experiments on thermally driven stochastic switching in the gamma ray environment. These results indicate that nanoscale MTJs are promising building blocks for radiation-hard non-von Neumann computing.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
Partition Pruning: Parallelization-Aware Pruning for Deep Neural Networks
Authors:
Sina Shahhosseini,
Ahmad Albaqsami,
Masoomeh Jasemi,
Nader Bagherzadeh
Abstract:
Parameters of recent neural networks require a huge amount of memory. These parameters are used by neural networks to perform machine learning tasks when processing inputs. To speed up inference, we develop Partition Pruning, an innovative scheme to reduce the parameters used while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference…
▽ More
Parameters of recent neural networks require a huge amount of memory. These parameters are used by neural networks to perform machine learning tasks when processing inputs. To speed up inference, we develop Partition Pruning, an innovative scheme to reduce the parameters used while taking into consideration parallelization. We evaluated the performance and energy consumption of parallel inference of partitioned models, which showed a 7.72x speed up of performance and a 2.73x reduction in the energy used for computing pruned layers of TinyVGG16 in comparison to running the unpruned model on a single accelerator. In addition, our method showed a limited reduction some numbers in accuracy while partitioning fully connected layers.
△ Less
Submitted 27 February, 2019; v1 submitted 21 January, 2019;
originally announced January 2019.