Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals

Aruna Mohan1, Danne Elbers1,2, Or Zilbershot1, Fatemeh Afghah3, David Vorchheimer4 1A. Mohan, D. Elbers, and O. Zilbershot are with Walkky, Boston, MA, USA. {aruna,danne,or}@walkky.com2D. Elbers is with the Chobanian & Avedisian School of Medicine, Boston University, Boston, MA 02118, USA.3F. Afghah is with the Department of Electrical & Computer Engineering, Clemson University, Clemson, SC 29631, USA. [email protected]4D. Vorchheimer is with Northwell Health Physician Partners, Cardiology, 158 East 84th Street, New York, NY 10028, and Lenox Hill Hospital, 100 E 77th St, New York, NY 10075, USA. [email protected]
Abstract

Remote patient monitoring based on wearable single-lead electrocardiogram (ECG) devices has significant potential for enabling the early detection of heart disease, especially in combination with artificial intelligence (AI) approaches for automated heart disease detection. There have been prior studies applying AI approaches based on deep learning for heart disease detection. However, these models are yet to be widely accepted as a reliable aid for clinical diagnostics, in part due to the current black-box perception surrounding many AI algorithms. In particular, there is a need to identify the key features of the ECG signal that contribute toward making an accurate diagnosis, thereby enhancing the interpretability of the model. The present study develops a vision transformer approach to identify atrial fibrillation based on single-lead ECG data. A residual network (ResNet) approach is also developed for comparison with the vision transformer approach. These models are applied to the Chapman–Shaoxing dataset to classify atrial fibrillation, as well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm heartbeats. The models enable the identification of the key regions of the heartbeat that determine the resulting classification, and highlight the importance of P-waves and T-waves, as well as heartbeat duration and signal amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and sinus bradycardia.

I INTRODUCTION

Atrial fibrillation (AFIB) is one of the most common and clinically significant forms of cardiac arrhythmia. It is characterized by rapid and irregular electrical impulses in the atria, leading to an irregular and often rapid heart rhythm. In the United States, AFIB is estimated to affect 5.2 million individuals, and this number is projected to increase to 12.1 million by 2030. AFIB poses significant risks, including an increased risk of stroke, heart failure, and other cardiovascular complications [1]. The burden on society is estimated to be significant, with AFIB-associated hospitalization and treatment costs amounting to billions of dollars annually [2].

AI algorithms based on deep learning can effectively analyze and interpret complex ECG waveforms, allowing for the detection and classification of abnormalities such as AFIB. In recent years, researchers have made significant advances in the application of deep neural networks for ECG signal analysis and AFIB detection [3, 4, 5, 6, 7]. With the use of deep neural network models and a carefully-designed network structure, it is possible to achieve classification accuracy comparable to that of human expert cardiologists [8]. Moreover, AI algorithms can enable rapid and automated diagnosis, and reduce the risk of human error [9].

Researchers have also focused on the interpretability and robustness of neural network models, as their black-box perception can sometimes be a concern for medical professionals [9, 10, 11, 12]. By develo** explainable AI techniques and incorporating feature extraction methods, it is possible to provide insights into the key features of the data influencing the decision-making process of the neural network [13, 14]. In addition to deep learning models, classical machine learning approaches employing features extracted from ECG data have also been proposed for ECG classification [15]. Such models provide insights into the most important features responsible for the resulting diagnosis. However, this approach necessitates feature engineering, which requires prior domain knowledge and expertise for the appropriate selection and extraction of relevant features from the raw signals [9]. There is a need for a visually explainable model that can directly highlight the important regions of the ECG signal, without the prior necessity for extracting features from the ECG signals to serve as model inputs.

An automated and interpretable method for the identification of arrhythmias from single-lead ECG recordings has the potential to facilitate the remote monitoring of at-risk patients to enable early intervention, aid in understanding the language of the heart, and enhance the physician’s trust to utilize this diagnostic tool. Prior studies have demonstrated that high accuracy is achievable with a single-lead approach [7].

This work develops a vision transformer (ViT) approach for AFIB detection utilizing heartbeat segments contained between 3 consecutive R-peaks in the lead II ECG signals, and demonstrates the ability of the ViT [16] and ResNet [17] models to highlight the key regions of the ECG-recorded heartbeat that characterize AFIB and normal sinus rhythm (SR). In addition to AFIB, another common arrhythmia considered in this study is sinus bradycardia (SB), which is characterized by a slower-than-average heart rhythm. SB has gained attention recently, since bradycardia has been identified as a cardiac manifestation of COVID-19 [18, 19]. We used the well-established Chapman–Shaoxing dataset [15, 20] to develop our explainable multi-class classification approach to distinguish between AFIB, SB, and SR heartbeats.

Our work focuses on lead II with the aim of creating a model that could be transferable to smart watches and other wearable devices for remote monitoring [7, 21]. While previous studies have used fixed segment intervals [22] or RR segments [6] extracted from ECG signals, our study employs RRR segments. This selection was motivated by the fact that an RRR segment contains a complete heartbeat, starting with the P-wave marking the contraction of the heart, and ending with the T-wave associated with the relaxation of the heart. Faust et al. employed RR intervals extracted from 12-lead ECG signals and achieved over 97% accuracy in classifying AFIB with a ResNet-based model [6]. While Tutuko et al. demonstrated comparable accuracy of over 96% in classifying AFIB with a convolutional neural network using lead II alone, they employed longer signals containing 2 or more RR intervals [7]. Our work achieves comparable accuracy in spite of simultaneously utilizing a single lead and short RRR ECG segments.

Our work provides an interpretable approach that highlights the key regions of the heartbeat governing the classification, in addition to high classification accuracy. Our results demonstrate that P-waves and T-waves, in addition to heartbeat duration and signal amplitude, play a key role in distinguishing normal SR heartbeats from AFIB or SB arrhythmias.

II METHODS

II-A DATASET AND PREPROCESSING

We utilized the Chapman–Shaoxing database [15], and selected patients with AFIB, SB or SR labels. The database contains 10-second ECG recordings taken at a frequency of 500 Hz. We applied the neurokit2 library [23] to the raw ECG data for cleaning and peak detection. The cleaning step performs detrending and denoising via the default 0.5 Hz high-pass 5th order Butterworth filter, followed by powerline filtering. This step is followed by the default R-peak detection. We extracted non-overlap** segments between 3 consecutive R-peaks from the processed ECG data, referred to here as RRR segments, which were then supplied to the classification algorithms. Table I provides the number of patients and RRR segments obtained from the cleaned data for each category.

TABLE I: Dataset
Label Number of patients Number of RRR segments
AFIB 1654 11310
SB 3765 14635
SR 1789 9642

RRR segments were selected as inputs because each RRR segment contains a complete heartbeat, starting from the P-wave and ending with the T-wave. Moreover, we avoided the use of neighboring RR segments as inputs via our approach of using RRR segments, since adjacent R peaks are likely to be correlated. Each heartbeat extracted from a given patient was associated with the label corresponding to the patient. The application of explainable deep learning models to single heartbeats is intended to elucidate the underlying patterns in the heartbeat associated with different heart conditions. In particular, the use of RRR segments facilitates the visualization of key regions of the individual heartbeats that are associated with the disease diagnosis, leading to an explainable deep learning approach.

Since the extracted RRR segments are of varying lengths, the segments were padded with zeros to ensure that the model inputs have the same lengths. Each 10 s ECG recording originally had a length of 5000 points. Figure 1 depicts the histogram of RRR segment lengths for the 3 classes. A maximum length of 1500 was selected, as fewer than 0.1% of segments exceed this length. It is evident from Figure 1 that the 3 classes have distinct distributions of RRR segment lengths, and the segment length may be a factor in classification.

In prior studies, the ECG signals or extracted segments have been z𝑧zitalic_z-normalized [22] or rescaled by the maximum amplitude [24]. However, the ECG amplitude is influenced by the condition of the heart [25]. In other studies, non-normalized signals were directly employed, even in cases where multiple data sources were used [7]. In the present study, we utilized both non-normalized and z𝑧zitalic_z-normalized ECG signals to extract the RRR segments, for comparison.

Refer to caption
Figure 1: Histogram of RRR segment lengths for AFIB, SB and SR. The x𝑥xitalic_x-axis represents the segment length, and the y𝑦yitalic_y-axis represents the percentage in each bin.

II-B MODEL DEVELOPMENT

Refer to caption
Figure 2: Illustration of the ViT Approach. A detailed diagram of the transformer encoder was presented in [16].

We employed the ViT model including a learnable classification token [16] for the problem of classifying the RRR segments into the 3 classes of labels, AFIB, SB or SR. The model is illustrated in Fig. 2. Our ViT approach involved decomposing the RRR segment into patches, and associating learned patch and position embeddings with each patch. The sum of the patch and position embeddings constitutes the input to the transformer encoder, along with the inclusion of an extra learnable classification token. The encoder consists of alternating multihead attention and multilayer perceptron (MLP) blocks, with layer normalization preceding and residual connections following each block. Finally, a layer normalization and MLP block are employed, followed by a dense layer for classification with softmax activation.

The ViT model enables us to visualize the regions of the heartbeat to which the classification token attends via attention heatmaps [26]. For comparison, we employed a ResNet architecture based on ResNet50 [17] and visualized Grad-CAM heatmaps [27] of the key regions of the signals with regard to the final convolutional layer of the model. The models were developed using the TensorFlow library [28]. Upon comparing the model performance using the signals collected from each single ECG lead, we found that the best-performing leads for the classification model were leads II and aVF, followed by I and V1. For the present study, we selected lead II due to its consistency, reliability, and potential application to wearable devices [21].

The AFIB, SB and SR heartbeats differ in duration, with AFIB often having a faster-than-average heart rhythm. On the other hand, SB is characterized by a slower-than-average heart rhythm. As described earlier, zero-padding was employed to ensure that the model inputs had the same lengths. However, the signal lengths, reflected in the zero-padded regions, may play a role in the classification results. This was verified in the ViT model, by applying the model with and without the masking of the zeros in the input to the multihead attention layer.

The dataset was split into train, validation and test datasets using a 70-15-15 %-split. The splitting was performed following the inter-patient paradigm by using group shuffle splitting to ensure that the same patient did not appear in more than one dataset [4, 29]. Hyperparameter optimization to maximize the validation accuracy was conducted by a combination of manual tuning and the use of the automated keras-tuner library [30]. The hyperparameters tuned in the ViT model were the number of layers, the patch size, the embedding dimension, and the number of units and dropout rate in the MLP layer. The use of 2 or 3 attention heads was explored. In the ResNet model, the ResNet50 architecture was followed by dense and dropout layers preceding the final classification layer with softmax activation, wherein the number of units in the dense layer and the dropout rate were optimized. In addition to our primary ViT and ResNet approaches, we also trained a simpler model for comparison, where a 1D convolutional neural network (CNN) with max pooling followed by a long short-term memory (LSTM) layer was employed, preceding the classification layer with softmax activation. In this CNN–LSTM model, the number of filters and kernel size in the convolutional layer, and the number of units and recurrent dropout rate in the LSTM layer were tuned. The accuracy was reported by averaging the results of 5 independent iterations wherein the model was run after performing train-validation-test splitting. This method is similar to 5-fold cross-validation [22]. The model architectures and optimized hyperparameters are summarized in Table II. All computations were performed on Amazon Web Services (AWS) GPU instances.

TABLE II: Model Architectures
Model Hyperparameters
ResNet ResNet50 architecture followed by dense layer with 64 units and dropout rate of 0.2
ViT 2 or 3 attention layers, 2 or 3 attention heads, patch size of 30, embedding dimension of 16 and MLP layer with 128 units
CNN–LSTM CNN layer with 32 filters, kernel size of 3, max pooling with pool size of 2, followed by 96 LSTM units with recurrent dropout rate of 0.2

III RESULTS

The overall accuracies on the test dataset, averaged over 5 independent iterations of data splitting and model fitting, are reported in Table III. The ViT model based on the non-normalized dataset employed 3 attention layers, whereas 2 attention layers were found to be sufficient to optimize the model performance for the z𝑧zitalic_z-normalized dataset. The accuracies and other metrics for the ViT models are based on utilizing 2 attention heads. The accuracy upon employing 3 attention heads did not show notable improvement, at 0.9255 for the unmasked ViT model using non-normalized data.

TABLE III: Overall model accuracy for RRR segment classification
Model Normalization Accuracy
Non-normalized 0.9613
ResNet z-normalized 0.8755
Non-normalized 0.9246
ViT, unmasked z-normalized 0.8550
Non-normalized 0.9073
ViT, masked z-normalized 0.8918
Non-normalized 0.8878
CNN–LSTM z-normalized 0.8285

Table III demonstrates that the performance of the ViT model declined upon masking the zeros in the input segments. This indicates that the segment length is an important feature, and is captured indirectly in the model by attending to the zeros in the input data. Similarly, the model performance declined upon employing z𝑧zitalic_z-normalized ECG data, indicating that the signal amplitudes are an important indicator of the heart condition. The CNN–LSTM model had the lowest accuracy and longest run time among the tested approaches, and was not pursued further.

Table IV presents several performance metrics by label for the ResNet model and the unmasked ViT model using non-normalized data, based on a one-versus-rest calculation. The metrics are obtained at a patient-level by applying a majority vote of RRR heartbeat segment labels for each patient. The resulting patient-level metrics are listed in Table V. The gap between the heartbeat-level metrics and patient-level metrics is not very large, suggesting that a given patient’s heart condition is reflected in many of the heartbeats extracted from that patient’s ECG recording. This observation provides justification for performing classification based on individual heartbeats.

TABLE IV: Model performance by label for RRR segment classification
Model Label Accuracy Specificity Sensitivity Precision F1-score AUC
AFIB 0.9827 0.9894 0.9695 0.9789 0.9741 0.9978
ResNet SB 0.9722 0.9742 0.9691 0.9614 0.9652 0.9957
SR 0.9678 0.9781 0.9392 0.9400 0.9394 0.9922
AFIB 0.9460 0.9591 0.9203 0.9201 0.9197 0.9884
ViT (unmasked) SB 0.9622 0.9651 0.9578 0.9480 0.9528 0.9919
SR 0.9409 0.9629 0.8802 0.8964 0.8878 0.9807
TABLE V: Model performance by label for patient classification
Model Label Accuracy Specificity Sensitivity Precision F1-score AUC
AFIB 0.9928 0.9922 0.9947 0.9763 0.9853 0.9998
ResNet SB 0.9810 0.9691 0.9924 0.9710 0.9816 0.9990
SR 0.9811 0.9980 0.9296 0.9936 0.9604 0.9981
AFIB 0.9678 0.9705 0.9597 0.9132 0.9355 0.9953
ViT (unmasked) SB 0.9754 0.9642 0.9862 0.9664 0.9762 0.9983
SR 0.9636 0.9912 0.8794 0.9705 0.9225 0.9946

The ResNet model is able to achieve overall and one-versus-rest classification accuracies of over 96%, whereas the unmasked ViT model lags behind, with an overall accuracy of around 92-93%, as seen from Table III and the preceding discussion. This is expected, since the ViT model was proposed for large dataset sizes exceeding 10s of millions [16]. However, the ViT model is expected to perform better as larger datasets become available [16], and also provides a clear illustration of the key regions of the heartbeat influencing the resulting classification via the attention heatmaps obtained from each attention head.

The attention maps were obtained from the ViT self-attention calculation by employing the computed attention of the classification token toward the rest of the sequence, averaged for the corresponding labels, and scaled to lie between 0 and 1. Since the RRR segments and corresponding attention maps are of variable lengths, the attention heatmaps, as well as the RRR segments, were extrapolated to the maximum segment length of 1500, and averaged to produce average attention maps and RRR heartbeats. A similar procedure was applied to plot the Grad-CAM heatmaps.

Figures 3, 4 and 5 illustrate the averaged, scaled attention heatmaps from the unmasked ViT model using non-normalized input data for correctly classified AFIB, SB and SR cases, respectively, superimposed upon the corresponding average RRR segments. These attention heatmaps demonstrate that the model attends primarily to the P-waves, followed by the T-waves, and finally the R-peaks, for SR identification. This can be understood by the absence of P-waves in most AFIB cases. In addition, T-wave inversion has been reported to occur concurrently with AFIB in patients with long-standing hypertension [31]. Figure 3 depicts a large variation in the T-waves in AFIB cases, consistent with the inversion of the T-waves in several patients. The irregularity in the QRS complex is also evident, as is the absence of P-waves in AFIB cases. In the SB cases, the model attends to the region of the P-waves, as well as preceding the P-waves. SB is characterized by a longer heartbeat, and the segment length plays a role in the classification model. In addition, the P-wave amplitudes at the lower heart rates associated with SB cases are generally lower relative to normal SR, whereby the amplitude also plays a role in the classification result. In the AFIB cases, the model attends to the region where a P-wave would have otherwise been expected, but is generally absent in the case of AFIB, as well as the location of the R-peaks, which is indicative of the irregularity in the QRS complex in AFIB.

Refer to caption
((a)) 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT attention head, correctly classified AFIB
Refer to caption
((b)) 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT attention head, correctly classified AFIB
Figure 3: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified AFIB cases. The blue line represents the average signal amplitude, with the gray region corresponding to ±plus-or-minus\pm± 1 standard deviation. The y𝑦yitalic_y-axis represents the amplitude in μ𝜇\muitalic_μV, while the x𝑥xitalic_x-axis represents the index.
Refer to caption
((a)) 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT attention head, correctly classified SB
Refer to caption
((b)) 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT attention head, correctly classified SB
Figure 4: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified SB cases. The blue line represents the average signal amplitude, with the gray region corresponding to ±plus-or-minus\pm± 1 standard deviation. The y𝑦yitalic_y-axis represents the amplitude in μ𝜇\muitalic_μV, while the x𝑥xitalic_x-axis represents the index.
Refer to caption
((a)) 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT attention head, correctly classified SR
Refer to caption
((b)) 2ndsuperscript2nd2^{\text{nd}}2 start_POSTSUPERSCRIPT nd end_POSTSUPERSCRIPT attention head, correctly classified SR
Figure 5: Attention heatmaps from the ViT model using 2 attention heads, based on correctly classified SR cases. The blue line represents the average signal amplitude, with the gray region corresponding to ±plus-or-minus\pm± 1 standard deviation. The y𝑦yitalic_y-axis represents the amplitude in μ𝜇\muitalic_μV, while the x𝑥xitalic_x-axis represents the index.

Figure 6 depicts the averaged and scaled Grad-CAM heatmaps from the ResNet model for correctly classified AFIB, SB and SR cases, respectively. Similar to the attention maps, the Grad-CAM maps emphasize the P-waves and T-waves for the SB and SR cases, and the region where P-waves would have been expected for the AFIB cases. The results obtained from the ViT attention maps and ResNet Grad-CAM maps are consistent with each other. However, the ViT model is computationally much faster. For example, one iteration of model fitting and inference, followed by the attention or Grad-CAM calculation on the test data, took 142 seconds for the ViT model and 1126 seconds for the ResNet model on an AWS ml.g5.24xlarge NVIDIA GPU instance. The CNN–LSTM model had a much longer run time of 14561 seconds for one iteration of model fitting and inference on the same instance. The ViT model is more readily parallelizable due to its multihead attention architecture, which provides a computational benefit relative to ResNet or sequential deep learning architectures, especially for larger datasets [16]. On account of this computational advantage, the ViT approach has the potential to provide accelerated inference, either on the cloud, or real-time on edge devices [32] for applications to wearable smart devices.

We also examined the attention heatmaps and Grad-CAM maps for the misclassified cases (not shown here). The misclassified cases appear much noisier, which may have resulted in the incorrect identification of spurious P-waves and T-waves.

Refer to caption
((a)) Correctly classified AFIB
Refer to caption
((b)) Correctly classified SB
Refer to caption
((c)) Correctly classified SR
Figure 6: Grad-CAM heatmaps from the ResNet model, based on correctly classified AFIB, SB and SR cases. The blue line represents the average signal amplitude, with the gray region corresponding to ±plus-or-minus\pm± 1 standard deviation. The y𝑦yitalic_y-axis represents the amplitude in μ𝜇\muitalic_μV, while the x𝑥xitalic_x-axis represents the index.

These results demonstrate the potential for automated classification of heart conditions by deep learning models, based on the identification of waveforms in the heartbeats, in addition to the heartbeat lengths and amplitudes. Furthermore, the ViT and ResNet approaches enable the identification of the key regions of the heartbeat responsible for the classification result, thereby rendering the results more easily explainable.

IV CONCLUSIONS

This paper aims at develo** a reliable and interpretable approach for AFIB detection using short single-lead ECG signals. For this purpose, we developed ViT and ResNet models for the classification of AFIB, SB and SR cases using RRR-segmented lead II ECG signals from the Chapman–Shaoxing dataset. The ViT model was evaluated versus the ResNet model. The ResNet model achieved an overall accuracy of over 96%, whereas the ViT model provided an accuracy in the range 92–93%. This is expected, since the ViT model was developed for much larger datasets in the size range of 10s of millions [16]. We are currently addressing this issue by identifying and analyzing larger datasets to improve the performance of our ViT approach.

The attention maps and Grad-CAM maps derived from the ViT and ResNet models illustrate the regions of the heartbeats that govern the resulting classification. The heatmaps emphasize the role played by P-waves and T-waves, in addition to other factors including the segment lengths and amplitudes, in distinguishing between AFIB, SB and SR cases.

The explainable deep learning models explored in this work facilitate the detection of atrial fibrillation, as well as other arrhythmias, from single-lead ECG data, while highlighting the regions of the heartbeat that determine the diagnosis. These models can potentially be employed in conjunction with wearable ECG devices for remote monitoring of patients to facilitate early detection and intervention. Further work will include the use of larger datasets, particularly for the ViT model, which relies on the availability of large amounts of data. In addition, the application of explainable deep learning models to other heart conditions will be explored in the future.

References

  • [1] R. Gopinathannair, L. Y. Chen, M. K. Chung, W. K. Cornwell et al., “Managing atrial fibrillation in patients with heart failure and reduced ejection fraction: A scientific statement from the American Heart Association,” Circulation: Arrhythmia and Electrophysiology, vol. 14, no. 7, 07 2021.
  • [2] M. K. Chung, L. L. Eckhardt, L. Y. Chen, H. M. Ahmed et al., “Lifestyle and risk factor modification for reduction of atrial fibrillation: A scientific statement from the American Heart Association,” Circulation, vol. 141, no. 16, 04 2020.
  • [3] Z. I. Attia, S. Kapa, F. López-Jiménez, P. M. McKie et al., “Screening for cardiac contractile dysfunction using an artificial intelligence–enabled electrocardiogram,” Nature Medicine, vol. 25, no. 1, pp. 70–74, 01 2019.
  • [4] A. Vaid, J. X. Jiang, A. Sawant, S. Lerakis et al., “A foundational vision transformer improves diagnostic performance for electrocardiograms,” npj Digital Medicine, vol. 6, no. 1, 06 2023.
  • [5] Ö. Yıldırım, M. Talo, E. J. Ciaccio, R. S. Tan et al., “Accurate deep neural network model to detect cardiac arrhythmia on more than 10,000 individual subject ECG records,” Computer Methods and Programs in Biomedicine, vol. 197, pp. 105 740–105 740, 12 2020.
  • [6] O. Faust, M. Kareem, A. Ali, E. J. Ciaccio et al., “Automated arrhythmia detection based on RR intervals,” Diagnostics, vol. 11, no. 8, pp. 1446–1446, 08 2021.
  • [7] B. Tutuko, S. Nurmaini, A. E. Tondas, M. N. Rachmatullah et al., “AFibNet: an implementation of atrial fibrillation detection with convolutional neural network,” BMC Medical Informatics and Decision Making, vol. 21, no. 1, 07 2021.
  • [8] B. He, A. C. Kwan, J. H. Cho, N. Yuan et al., “Blinded, randomized trial of sonographer versus AI cardiac function assessment,” Nature, vol. 616, no. 7957, pp. 520–524, 04 2023.
  • [9] J. Kwon, Y.-Y. Jo, S.-Y. Lee, and K.-H. Kim, “Artificial intelligence using electrocardiography: strengths and pitfalls,” European Heart Journal, vol. 42, no. 30, pp. 2896–2898, 03 2021.
  • [10] M. Reyes, R. Meier, S. Pereira, C. A. Silva et al., “On the interpretability of artificial intelligence in radiology: Challenges and opportunities,” Radiology, vol. 2, no. 3, pp. e190 043–e190 043, 05 2020.
  • [11] Y. Ayano, F. Schwenker, B. Dufera, and T. Debelee, “Interpretable machine learning techniques in ECG-based heart disease classification: A systematic review.” Diagnostics, vol. 13, no. 1, 2022.
  • [12] S. Mousavi, F. Afghah, and U. R. Acharya, “HAN-ECG: An interpretable atrial fibrillation detection model using hierarchical attention networks,” Computers in Biology and Medicine, vol. 127, p. 104057, 2020.
  • [13] Z. Attia, D. Harmon, E. R. Behr, and P. A. Friedman, “Application of artificial intelligence to the electrocardiogram,” European Heart Journal, vol. 42, no. 46, pp. 4717–4730, 09 2021.
  • [14] J. Belen, S. Mousavi, A. Shamsoshoara, and F. Afghah, “An uncertainty estimation framework for risk assessment in deep learning-based AFib classification,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers, 2020, pp. 960–964.
  • [15] J. Zheng, J. Zhang, S. Danioko, H. Yao et al., “A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients,” Scientific Data, vol. 7, no. 48, 02 2020.
  • [16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [18] S. Douedi, A. Mararenko, A. Alshami, M. Al-Azzawi et al., “COVID-19 induced bradyarrhythmia and relative bradycardia: An overview,” Journal of Arrhythmia, vol. 37, no. 4, pp. 888–892, 2021.
  • [19] E. A. Amaratunga, D. S. Corwin, L. Moran, and R. Snyder, “Bradycardia in patients with COVID-19: A calm before the storm?” Cureus, vol. 12, no. 6, 2020.
  • [20] S. Aziz, S. Ahmed, and M. Alouini, “ECG-based machine-learning algorithms for heartbeat classification,” Scientific Reports, vol. 11, no. 1, 09 2021.
  • [21] M. Strik, S. Ploux, D. J. Weigel, J. van der Zande et al., “The use of smartwatch electrocardiogram beyond arrhythmia detection,” Trends in Cardiovascular Medicine, 01 2023.
  • [22] M. Porumb, E. Iadanza, S. Massaro, and L. Pecchia, “A convolutional neural network approach to detect congestive heart failure,” Biomedical Signal Processing and Control, vol. 55, p. 101597, 2020.
  • [23] D. Makowski, T. Pham, Z. J. Lau, J. C. Brammer et al., “NeuroKit2: A python toolbox for neurophysiological signal processing,” Behavior Research Methods, vol. 53, no. 4, pp. 1689–1696, 02 2021.
  • [24] J. Zheng, H. Chu, D. Struppa, J. Zhang et al., “Optimal multi-stage arrhythmia classification approach,” Scientific Reports, vol. 10, no. 2898, 02 2020.
  • [25] K. K. Proniewska, R. Abächerli, and P. M. van Dam, “The ΔΔ\Deltaroman_Δ-wave ECG: The differences to the normal 12-lead ECG amplitudes,” Journal of Electrocardiology, vol. 76, pp. 45–54, 2023.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017.
  • [27] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam et al., “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
  • [28] M. Abadi, A. Agarwal, P. Barham, E. Brevdo et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  • [29] S. Mousavi and F. Afghah, “Inter- and intra- patient ECG heartbeat classification for arrhythmia detection: A sequence to sequence deep learning approach,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 1308–1312.
  • [30] T. O’Malley, E. Bursztein, J. Long, F. Chollet et al., “Keras tuner,” 2019. [Online]. Available: https://github.com/keras-team/keras-tuner
  • [31] T. Hong, B. George, M. Hashim, and D. E. Feierman, “Preoperative atrial fibrillation and T-wave inversion associated with postoperative pulmonary edema,” Open Journal of Anesthesiology, vol. 6, no. 10, pp. 181–185, 2016.
  • [32] M. Li, W. Zhang, and D. Xia, “Transformer inference acceleration in edge computing environment,” in 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), 2023, pp. 104–109.