-
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
Authors:
Muhammad Adnan,
Amar Phanishayee,
Janardhan Kulkarni,
Prashant J. Nair,
Divya Mahajan
Abstract:
In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP und…
▽ More
In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Structured Model Pruning for Efficient Inference in Computational Pathology
Authors:
Mohammed Adnan,
Qinle Ba,
Nazim Shaikh,
Shivam Kalra,
Satarupa Mukherjee,
Auranuch Lorsakul
Abstract:
Recent years have seen significant efforts to adopt Artificial Intelligence (AI) in healthcare for various use cases, from computer-aided diagnosis to ICU triage. However, the size of AI models has been rapidly growing due to scaling laws and the success of foundational models, which poses an increasing challenge to leverage advanced models in practical applications. It is thus imperative to devel…
▽ More
Recent years have seen significant efforts to adopt Artificial Intelligence (AI) in healthcare for various use cases, from computer-aided diagnosis to ICU triage. However, the size of AI models has been rapidly growing due to scaling laws and the success of foundational models, which poses an increasing challenge to leverage advanced models in practical applications. It is thus imperative to develop efficient models, especially for deploying AI solutions under resource-constrains or with time sensitivity. One potential solution is to perform model compression, a set of techniques that remove less important model components or reduce parameter precision, to reduce model computation demand. In this work, we demonstrate that model pruning, as a model compression technique, can effectively reduce inference cost for computational and digital pathology based analysis with a negligible loss of analysis performance. To this end, we develop a methodology for pruning the widely used U-Net-style architectures in biomedical imaging, with which we evaluate multiple pruning heuristics on nuclei instance segmentation and classification, and empirically demonstrate that pruning can compress models by at least 70% with a negligible drop in performance.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Accelerating Recommender Model Training by Dynamically Skip** Stale Embeddings
Authors:
Yassaman Ebrahimzadeh Maboud,
Muhammad Adnan,
Divya Mahajan,
Prashant J. Nair
Abstract:
Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variatio…
▽ More
Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
△ Less
Submitted 21 March, 2024;
originally announced April 2024.
-
Cross-layer Modeling and Design of Content Addressable Memories in Advanced Technology Nodes for Similarity Search
Authors:
Siri Narla,
Piyush Kumar,
Mohammad Adnaan,
Azad Naeemi
Abstract:
In this paper we present a comprehensive design and benchmarking study of Content Addressable Memory (CAM) at the 7nm technology node in the context of similarity search applications. We design CAM cells based on SRAM, spin-orbit torque, and ferroelectric field effect transistor devices and from their layouts extract cell parasitics using state of the art EDA tools. These parasitics are used to de…
▽ More
In this paper we present a comprehensive design and benchmarking study of Content Addressable Memory (CAM) at the 7nm technology node in the context of similarity search applications. We design CAM cells based on SRAM, spin-orbit torque, and ferroelectric field effect transistor devices and from their layouts extract cell parasitics using state of the art EDA tools. These parasitics are used to develop SPICE netlists to model search operations. We use a CAM-based dataset search and a sequential recommendation system to highlight the application-level performance degradation due to interconnect parasitics. We propose and evaluate two solutions to mitigate interconnect effects.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Authors:
Muhammad Adnan,
Akhil Arunkumar,
Gaurav Jain,
Prashant J. Nair,
Ilya Soloveychik,
Purushotham Kamath
Abstract:
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phas…
▽ More
Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs.
This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.
△ Less
Submitted 5 April, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Optimal EEG Electrode Set for Emotion Recognition From Brain Signals: An Empirical Quest
Authors:
Rumman Ahmed Prodhan,
Sumya Akter,
Tanmoy Sarkar Pias,
Md. Akhtaruzzaman Adnan
Abstract:
The human brain is a complex organ, still completely undiscovered, that controls almost all the parts of the body. Apart from survival, the human brain stimulates emotions. Recent research indicates that brain signals can be very effective for emotion recognition. However, which parts of the brain exhibit most of the emotions is still under-explored. In this study, we empirically analyze the contr…
▽ More
The human brain is a complex organ, still completely undiscovered, that controls almost all the parts of the body. Apart from survival, the human brain stimulates emotions. Recent research indicates that brain signals can be very effective for emotion recognition. However, which parts of the brain exhibit most of the emotions is still under-explored. In this study, we empirically analyze the contribution of each part of the brain in exhibiting emotions. We use the DEAP dataset to find the most optimal electrode set which eventually leads to the effective brain part associated with emotions. We use Fast Fourier Transformation for effective feature extraction and a 1D-CNN with residual connection for classification. Though 32 electrodes from the DEAP dataset got an accuracy of 97.34%, only 12 electrodes (F7, P8, O1, F8, C4, T7, PO3, Fp1, Fp2, O2, P3, and Fz) achieve 95.81% accuracy. This study also shows that adding more than 10 electrodes does not improve performance significantly. Moreover, the frontal lobe is the most important for recognizing emotion.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Attention-Driven Multi-Modal Fusion: Enhancing Sign Language Recognition and Translation
Authors:
Zaber Ibn Abdul Hakim,
Rasman Mubtasim Swargo,
Muhammad Abdullah Adnan
Abstract:
In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plu…
▽ More
In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
△ Less
Submitted 6 December, 2023; v1 submitted 4 September, 2023;
originally announced September 2023.
-
Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user b…
▽ More
Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user behaviour. Traditional feature interaction techniques have limitations in achieving optimal performance in this context.
This work introduces Ad-Rec, an advanced network that leverages feature interaction techniques to address covariate shifts. This helps eliminate irrelevant interactions in recommendation tasks. Ad-Rec leverages masked transformers to enable the learning of higher-order cross-features while mitigating the impact of data distribution drift. Our approach improves model quality, accelerates convergence, and reduces training time, as measured by the Area Under Curve (AUC) metric. We demonstrate the scalability of Ad-Rec and its ability to achieve superior model quality through comprehensive ablation studies.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Go Together: Bridging the Gap between Learners and Teachers
Authors:
Asim Irfan,
Atif Nawaz,
Muhammad Turab,
Muhmmad Azeem,
Mashal Adnan,
Ahsan Mehmood,
Sarfaraz Ahmed,
Adnan Ashraf
Abstract:
After the pandemic, humanity has been facing different types of challenges. Social relationships, societal values, and academic and professional behavior have been hit the most. People are shifting their routines to social media and gadgets, and getting addicted to their isolation. This sudden change in their lives has caused an unusual social breakdown and endangered their mental health. In mid-2…
▽ More
After the pandemic, humanity has been facing different types of challenges. Social relationships, societal values, and academic and professional behavior have been hit the most. People are shifting their routines to social media and gadgets, and getting addicted to their isolation. This sudden change in their lives has caused an unusual social breakdown and endangered their mental health. In mid-2021, Pakistan's first Human Library was established under Hel**Mind to overcome these effects. Despite online sessions and webinars, Hel**Mind needs technology to reach the masses. In this work, we customized the UI or UX of a Go Together Mobile Application (GTMA) to meet the requirements of the client organization. A very interesting concept of the book (expert listener or psychologist) and the reader is introduced in GTMA. It offers separate dashboards, separate reviews or rating systems, booking, and venue information to engage the human reader with his or her favorite human book. The loyalty program enables the members to avail discounts through a mobile application and its membership is global where both the human-reader and human-books can register under the platform. The minimum viable product has been approved by our client organization.
△ Less
Submitted 23 July, 2023;
originally announced August 2023.
-
Design and Development of a Java Parallel I/O Library
Authors:
Muhammad Sohaib Ayub,
Muhammad Adnan,
Muhammad Yasir Shafi
Abstract:
Parallel I/O refers to the ability of scientific programs to concurrently read/write from/to a single file from multiple processes executing on distributed memory platforms like compute clusters. In the HPC world, I/O becomes a significant bottleneck for many real-world scientific applications. In the last two decades, there has been significant research in improving the performance of I/O operati…
▽ More
Parallel I/O refers to the ability of scientific programs to concurrently read/write from/to a single file from multiple processes executing on distributed memory platforms like compute clusters. In the HPC world, I/O becomes a significant bottleneck for many real-world scientific applications. In the last two decades, there has been significant research in improving the performance of I/O operations in scientific computing for traditional languages including C, C++, and Fortran. As a result of this, several mature and high-performance libraries including ROMIO (implementation of MPI-IO), parallel HDF5, Parallel I/O (PIO), and parallel netCDF are available today that provide efficient I/O for scientific applications. However, there is very little research done to evaluate and improve I/O performance of Java-based HPC applications. The main hindrance in the development of efficient parallel I/O Java libraries is the lack of a standard API (something equivalent to MPI-IO). Some adhoc solutions have been developed and used in proprietary applications, but there is no general-purpose solution that can be used by performance hungry applications. As part of this project, we plan to develop a Java-based parallel I/O API inspired by the MPI-IO bindings (MPI 2.0 standard document) for C, C++, and Fortran. Once the Java equivalent API of MPI-IO has been developed, we will develop a reference implementation on top of existing Java messaging libraries. Later, we will evaluate and compare performance of our reference Java Parallel I/O library with C/C++ counterparts using benchmarks and real-world applications.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
A Secure Healthcare 5.0 System Based on Blockchain Technology Entangled with Federated Learning Technique
Authors:
Abdur Rehman,
Sagheer Abbas,
M. A. Khan,
Taher M. Ghazal,
Khan Muhammad Adnan,
Amir Mosavi
Abstract:
In recent years, the global Internet of Medical Things (IoMT) industry has evolved at a tremendous speed. Security and privacy are key concerns on the IoMT, owing to the huge scale and deployment of IoMT networks. Machine learning (ML) and blockchain (BC) technologies have significantly enhanced the capabilities and facilities of healthcare 5.0, spawning a new area known as "Smart Healthcare." By…
▽ More
In recent years, the global Internet of Medical Things (IoMT) industry has evolved at a tremendous speed. Security and privacy are key concerns on the IoMT, owing to the huge scale and deployment of IoMT networks. Machine learning (ML) and blockchain (BC) technologies have significantly enhanced the capabilities and facilities of healthcare 5.0, spawning a new area known as "Smart Healthcare." By identifying concerns early, a smart healthcare system can help avoid long-term damage. This will enhance the quality of life for patients while reducing their stress and healthcare costs. The IoMT enables a range of functionalities in the field of information technology, one of which is smart and interactive health care. However, combining medical data into a single storage location to train a powerful machine learning model raises concerns about privacy, ownership, and compliance with greater concentration. Federated learning (FL) overcomes the preceding difficulties by utilizing a centralized aggregate server to disseminate a global learning model. Simultaneously, the local participant keeps control of patient information, assuring data confidentiality and security. This article conducts a comprehensive analysis of the findings on blockchain technology entangled with federated learning in healthcare. 5.0. The purpose of this study is to construct a secure health monitoring system in healthcare 5.0 by utilizing a blockchain technology and Intrusion Detection System (IDS) to detect any malicious activity in a healthcare network and enables physicians to monitor patients through medical sensors and take necessary measures periodically by predicting diseases.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Traffic Congestion Prediction using Deep Convolutional Neural Networks: A Color-coding Approach
Authors:
Mirza Fuad Adnan,
Nadim Ahmed,
Imrez Ishraque,
Md. Sifath Al Amin,
Md. Sumit Hasan
Abstract:
The traffic video data has become a critical factor in confining the state of traffic congestion due to the recent advancements in computer vision. This work proposes a unique technique for traffic video classification using a color-coding scheme before training the traffic data in a Deep convolutional neural network. At first, the video data is transformed into an imagery data set; then, the vehi…
▽ More
The traffic video data has become a critical factor in confining the state of traffic congestion due to the recent advancements in computer vision. This work proposes a unique technique for traffic video classification using a color-coding scheme before training the traffic data in a Deep convolutional neural network. At first, the video data is transformed into an imagery data set; then, the vehicle detection is performed using the You Only Look Once algorithm. A color-coded scheme has been adopted to transform the imagery dataset into a binary image dataset. These binary images are fed to a Deep Convolutional Neural Network. Using the UCSD dataset, we have obtained a classification accuracy of 98.2%.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Bayesian Hyperparameter Optimization for Deep Neural Network-Based Network Intrusion Detection
Authors:
Mohammad Masum,
Hossain Shahriar,
Hisham Haddad,
Md Jobair Hossain Faruk,
Maria Valero,
Md Abdullah Khan,
Mohammad A. Rahman,
Muhaiminul I. Adnan,
Alfredo Cuzzocrea
Abstract:
Traditional network intrusion detection approaches encounter feasibility and sustainability issues to combat modern, sophisticated, and unpredictable security attacks. Deep neural networks (DNN) have been successfully applied for intrusion detection problems. The optimal use of DNN-based classifiers requires careful tuning of the hyper-parameters. Manually tuning the hyperparameters is tedious, ti…
▽ More
Traditional network intrusion detection approaches encounter feasibility and sustainability issues to combat modern, sophisticated, and unpredictable security attacks. Deep neural networks (DNN) have been successfully applied for intrusion detection problems. The optimal use of DNN-based classifiers requires careful tuning of the hyper-parameters. Manually tuning the hyperparameters is tedious, time-consuming, and computationally expensive. Hence, there is a need for an automatic technique to find optimal hyperparameters for the best use of DNN in intrusion detection. This paper proposes a novel Bayesian optimization-based framework for the automatic optimization of hyperparameters, ensuring the best DNN architecture. We evaluated the performance of the proposed framework on NSL-KDD, a benchmark dataset for network intrusion detection. The experimental results show the framework's effectiveness as the resultant DNN architecture demonstrates significantly higher intrusion detection performance than the random search optimization-based approach in terms of accuracy, precision, recall, and f1-score.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Ransomware Classification and Detection With Machine Learning Algorithms
Authors:
Mohammad Masum,
Md Jobair Hossain Faruk,
Hossain Shahriar,
Kai Qian,
Dan Lo,
Muhaiminul Islam Adnan
Abstract:
Malicious attacks, malware, and ransomware families pose critical security issues to cybersecurity, and it may cause catastrophic damages to computer systems, data centers, web, and mobile applications across various industries and businesses. Traditional anti-ransomware systems struggle to fight against newly created sophisticated attacks. Therefore, state-of-the-art techniques like traditional a…
▽ More
Malicious attacks, malware, and ransomware families pose critical security issues to cybersecurity, and it may cause catastrophic damages to computer systems, data centers, web, and mobile applications across various industries and businesses. Traditional anti-ransomware systems struggle to fight against newly created sophisticated attacks. Therefore, state-of-the-art techniques like traditional and neural network-based architectures can be immensely utilized in the development of innovative ransomware solutions. In this paper, we present a feature selection-based framework with adopting different machine learning algorithms including neural network-based architectures to classify the security level for ransomware detection and prevention. We applied multiple machine learning algorithms: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR) as well as Neural Network (NN)-based classifiers on a selected number of features for ransomware classification. We performed all the experiments on one ransomware dataset to evaluate our proposed framework. The experimental results demonstrate that RF classifiers outperform other methods in terms of accuracy, F-beta, and precision scores.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Monitoring Shortcut Learning using Mutual Information
Authors:
Mohammed Adnan,
Yani Ioannou,
Chuan-Yung Tsai,
Angus Galloway,
H. R. Tizhoosh,
Graham W. Taylor
Abstract:
The failure of deep neural networks to generalize to out-of-distribution data is a well-known problem and raises concerns about the deployment of trained networks in safety-critical domains such as healthcare, finance and autonomous vehicles. We study a particular kind of distribution shift $\unicode{x2013}$ shortcuts or spurious correlations in the training data. Shortcut learning is often only e…
▽ More
The failure of deep neural networks to generalize to out-of-distribution data is a well-known problem and raises concerns about the deployment of trained networks in safety-critical domains such as healthcare, finance and autonomous vehicles. We study a particular kind of distribution shift $\unicode{x2013}$ shortcuts or spurious correlations in the training data. Shortcut learning is often only exposed when models are evaluated on real-world data that does not contain the same spurious correlations, posing a serious dilemma for AI practitioners to properly assess the effectiveness of a trained model for real-world applications. In this work, we propose to use the mutual information (MI) between the learned representation and the input as a metric to find where in training, the network latches onto shortcuts. Experiments demonstrate that MI can be used as a domain-agnostic metric for monitoring shortcut learning.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Heterogeneous Acceleration Pipeline for Recommendation System Training
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer tim…
▽ More
Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns.
This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.
△ Less
Submitted 28 April, 2024; v1 submitted 11 April, 2022;
originally announced April 2022.
-
A Systematic Study and Analysis of Bengali Folklore with Natural Language Processing Systems
Authors:
Mustain Billah,
Md. Mynoddin,
Mostafijur Rahman Akhond,
Md. Nasim Adnan,
Syed Md. Galib,
Rizwanur Rahad,
M Nurujjaman Khan
Abstract:
Folklore, a solid branch of folk literature, is the hallmark of any nation or any society. Such as oral tradition; as proverbs or jokes, it also includes material culture as well as traditional folk beliefs, and various customs. Bengali folklore is as rich in-depth as it is amazing. Nevertheless, in the womb of time, it is determined to sustain its existence. Therefore, our aim in this study is to…
▽ More
Folklore, a solid branch of folk literature, is the hallmark of any nation or any society. Such as oral tradition; as proverbs or jokes, it also includes material culture as well as traditional folk beliefs, and various customs. Bengali folklore is as rich in-depth as it is amazing. Nevertheless, in the womb of time, it is determined to sustain its existence. Therefore, our aim in this study is to make our rich folklore more comprehensible to everyone in a more sophisticated computational way. Some studies concluded various aspects of the Bengali language with NLP. Our proposed model is to be specific for Bengali folklore. Technically, it will be the first step towards Bengali natural language processing for studying and analyzing the folklore of Bengal.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Domain-Agnostic Clustering with Self-Distillation
Authors:
Mohammed Adnan,
Yani A. Ioannou,
Chuan-Yung Tsai,
Graham W. Taylor
Abstract:
Recent advancements in self-supervised learning have reduced the gap between supervised and unsupervised representation learning. However, most self-supervised and deep clustering techniques rely heavily on data augmentation, rendering them ineffective for many learning tasks where insufficient domain knowledge exists for performing augmentation. We propose a new self-distillation based algorithm…
▽ More
Recent advancements in self-supervised learning have reduced the gap between supervised and unsupervised representation learning. However, most self-supervised and deep clustering techniques rely heavily on data augmentation, rendering them ineffective for many learning tasks where insufficient domain knowledge exists for performing augmentation. We propose a new self-distillation based algorithm for domain-agnostic clustering. Our method builds upon the existing deep clustering frameworks and requires no separate student model. The proposed method outperforms existing domain agnostic (augmentation-free) algorithms on CIFAR-10. We empirically demonstrate that knowledge distillation can improve unsupervised representation learning by extracting richer `dark knowledge' from the model than using predicted labels alone. Preliminary experiments also suggest that self-distillation improves the convergence of DeepCluster-v2.
△ Less
Submitted 20 December, 2021; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Pay Attention with Focus: A Novel Learning Scheme for Classification of Whole Slide Images
Authors:
Shivam Kalra,
Mohammed Adnan,
Sobhan Hemati,
Taher Dehkharghanian,
Shahryar Rahnamayan,
Hamid Tizhoosh
Abstract:
Deep learning methods such as convolutional neural networks (CNNs) are difficult to directly utilize to analyze whole slide images (WSIs) due to the large image dimensions. We overcome this limitation by proposing a novel two-stage approach. First, we extract a set of representative patches (called mosaic) from a WSI. Each patch of a mosaic is encoded to a feature vector using a deep network. The…
▽ More
Deep learning methods such as convolutional neural networks (CNNs) are difficult to directly utilize to analyze whole slide images (WSIs) due to the large image dimensions. We overcome this limitation by proposing a novel two-stage approach. First, we extract a set of representative patches (called mosaic) from a WSI. Each patch of a mosaic is encoded to a feature vector using a deep network. The feature extractor model is fine-tuned using hierarchical target labels of WSIs, i.e., anatomic site and primary diagnosis. In the second stage, a set of encoded patch-level features from a WSI is used to compute the primary diagnosis probability through the proposed Pay Attention with Focus scheme, an attention-weighted averaging of predicted probabilities for all patches of a mosaic modulated by a trainable focal factor. Experimental results show that the proposed model can be robust, and effective for the classification of WSIs.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
A Bagging and Boosting Based Convexly Combined Optimum Mixture Probabilistic Model
Authors:
Mian Arif Shams Adnan,
H. M. Miraz Mahmud
Abstract:
Unlike previous studies on mixture distributions, a bagging and boosting based convexly combined mixture probabilistic model has been suggested. This model is a result of iteratively searching for obtaining the optimum probabilistic model that provides the maximum p value.
Unlike previous studies on mixture distributions, a bagging and boosting based convexly combined mixture probabilistic model has been suggested. This model is a result of iteratively searching for obtaining the optimum probabilistic model that provides the maximum p value.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Accelerating Recommendation System Training by Leveraging Popular Choices
Authors:
Muhammad Adnan,
Yassaman Ebrahimzadeh Maboud,
Divya Mahajan,
Prashant J. Nair
Abstract:
Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation model…
▽ More
Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000x more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3x and 1.52x in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy
△ Less
Submitted 28 September, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Representation Learning of Histopathology Images using Graph Neural Networks
Authors:
Mohammed Adnan,
Shivam Kalra,
Hamid R. Tizhoosh
Abstract:
Representation learning for Whole Slide Images (WSIs) is pivotal in develo** image-based systems to achieve higher precision in diagnostic pathology. We propose a two-stage framework for WSI representation learning. We sample relevant patches using a color-based method and use graph neural networks to learn relations among sampled patches to aggregate the image information into a single vector r…
▽ More
Representation learning for Whole Slide Images (WSIs) is pivotal in develo** image-based systems to achieve higher precision in diagnostic pathology. We propose a two-stage framework for WSI representation learning. We sample relevant patches using a color-based method and use graph neural networks to learn relations among sampled patches to aggregate the image information into a single vector representation. We introduce attention via graph pooling to automatically infer patches with higher relevance. We demonstrate the performance of our approach for discriminating two sub-types of lung cancers, Lung Adenocarcinoma (LUAD) & Lung Squamous Cell Carcinoma (LUSC). We collected 1,026 lung cancer WSIs with the 40$\times$ magnification from The Cancer Genome Atlas (TCGA) dataset, the largest public repository of histopathology images and achieved state-of-the-art accuracy of 88.8% and AUC of 0.89 on lung cancer sub-type classification by extracting features from a pre-trained DenseNet
△ Less
Submitted 17 April, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Learning Permutation Invariant Representations using Memory Networks
Authors:
Shivam Kalra,
Mohammed Adnan,
Graham Taylor,
Hamid Tizhoosh
Abstract:
Many real-world tasks such as classification of digital histopathology images and 3D object detection involve learning from a set of instances. In these cases, only a group of instances or a set, collectively, contains meaningful information and therefore only the sets have labels, and not individual data instances. In this work, we present a permutation invariant neural network called Memory-base…
▽ More
Many real-world tasks such as classification of digital histopathology images and 3D object detection involve learning from a set of instances. In these cases, only a group of instances or a set, collectively, contains meaningful information and therefore only the sets have labels, and not individual data instances. In this work, we present a permutation invariant neural network called Memory-based Exchangeable Model (MEM) for learning set functions. The MEM model consists of memory units that embed an input sequence to high-level features enabling the model to learn inter-dependencies among instances through a self-attention mechanism. We evaluated the learning ability of MEM on various toy datasets, point cloud classification, and classification of lung whole slide images (WSIs) into two subtypes of lung cancer---Lung Adenocarcinoma, and Lung Squamous Cell Carcinoma. We systematically extracted patches from lung WSIs downloaded from The Cancer Genome Atlas~(TCGA) dataset, the largest public repository of WSIs, achieving a competitive accuracy of 84.84\% for classification of two sub-types of lung cancer. The results on other datasets are promising as well, and demonstrate the efficacy of our model.
△ Less
Submitted 3 July, 2020; v1 submitted 18 November, 2019;
originally announced November 2019.
-
A Review on Cooperative Diversity Techniques Bypassing Channel Estimation
Authors:
Sylvia Ong Ai Ling,
Hushairi Zen,
Al-Khalid B Hj Othman,
Mahmood Adnan,
Olalekan Bello
Abstract:
Wireless communication technology has seen a remarkably fast evolution due to its capability to provide a quality, reliable and high-speed data transmission amongst the users. However, transmission of information in wireless channels is primarily impaired by deleterious multipath fading, which affects the quality and reliability of the system. In order to overcome the detrimental effects of fading…
▽ More
Wireless communication technology has seen a remarkably fast evolution due to its capability to provide a quality, reliable and high-speed data transmission amongst the users. However, transmission of information in wireless channels is primarily impaired by deleterious multipath fading, which affects the quality and reliability of the system. In order to overcome the detrimental effects of fading, Multiple-Input Multiple-Output (MIMO) technology is an attractive scheme that employs multiple transceiver antennas to carry the data over the same frequency band over a variety of signal paths. This technology has shown great solutions due to its ability to provide better spectral efficiency, capacity, throughput and robustness of the data transmission. But in practice, it is impractical to install multiple antennas on small-sized devices. Hence, to overcome the limitations of MIMO gain in the future wireless networks, cooperative diversity has recently draw in attention due to its ability to circumvent the difficulties of implementing actual antenna arrays in Multiple-Input and Multiple-Output (MIMO). By exploiting the broadcast feature of the wireless medium, cooperation among multiple nearby nodes is formed for data transmission. At the receiver, the signals are either coherently or differentially detected. Coherent detection requires exact channel estimation, which is difficult to apply in a time-varying channel. Hence, when the nodes are mobile, or when the channel is inaccurately estimated, the differential detection techniques that omit channel estimation become an alternative as compared to coherent detection. This article presents a review of the differential transmission techniques for cooperative diversity networks.
△ Less
Submitted 28 November, 2017;
originally announced November 2017.
-
Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs
Authors:
Asif M Adnan,
Sridhar Radhakrishnan,
Suleyman Karabuk
Abstract:
Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed on GPGPUs for improving execution time and total throughput (amount of data processed in unit time). We have applied our techniques for feature tracking on vid…
▽ More
Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed on GPGPUs for improving execution time and total throughput (amount of data processed in unit time). We have applied our techniques for feature tracking on video images captured by a high speed digital video camera where the number of frames captured varies between 600-1000 frames per second. Image processing kernels are composed of multiple simple kernels, which executes on the input image in a given sequence. A set of kernels that can be fused together forms a partition (or fused kernel). Given a set of Kernels and the data dependencies between them, it is difficult to determine the partitions of kernels such that the total performance is maximized (execution time and throughput). We have developed and implemented an optimization model to find such a partition. We also developed an algorithm to fuse multiple kernels based on their data dependencies. Additionally, to further improve performance on GPGPU systems, we have provided methods to distribute data and threads to processors. Our model was able to reduce data traffic, which resulted better performance.The performance (both execution time and throughput) of the proposed method for kernel fusing and its subsequent execution is shown to be 2 to 3 times higher than executing kernels in sequence. We have demonstrated our technique for facial feature tracking with applications to Neuroscience.
△ Less
Submitted 15 September, 2015;
originally announced September 2015.
-
Properties of Stochastic Kronecker Graph
Authors:
Ahmed Mehedi Nizam,
Md. Nasim Adnan,
Md. Rashedul Islam,
Mohammad Akbar Kabir
Abstract:
The stochastic Kronecker Graph model can generate large random graph that closely resembles many real world networks. For example, the output graph has a heavy-tailed degree distribution, has a (low) diameter that effectively remains constant over time and obeys the so-called densification power law [1]. Aside from this list of very important graph properties, one may ask for some additional infor…
▽ More
The stochastic Kronecker Graph model can generate large random graph that closely resembles many real world networks. For example, the output graph has a heavy-tailed degree distribution, has a (low) diameter that effectively remains constant over time and obeys the so-called densification power law [1]. Aside from this list of very important graph properties, one may ask for some additional information about the output graph: What will be the expected number of isolated vertices? How many edges, self loops are there in the graph? What will be the expected number of triangles in a random realization? Here we try to answer the above questions. In the first phase, we bound the expected values of the aforementioned features from above. Next we establish the sufficient conditions to generate stochastic Kronecker graph with a wide range of interesting properties. Finally we show two phase transitions for the appearance of edges and self loops in stochastic Kronecker graph.
△ Less
Submitted 4 October, 2012;
originally announced October 2012.
-
Design and implementation of a digital clock showing digits in Bangla font using microcontroller AT89C4051
Authors:
Nasif Muslim,
Md. Tanvir Adnan,
Mohammad Zahidul Kabir,
Md. Humayun Kabir,
Sheikh Mominul Islam
Abstract:
In this paper, a digital clock is designed where the microcontroller is used for timing controller and the font of the Bangla digits are designed, and programmed within the microcontroller. The design is cost effective, simple and easy for maintenance.
In this paper, a digital clock is designed where the microcontroller is used for timing controller and the font of the Bangla digits are designed, and programmed within the microcontroller. The design is cost effective, simple and easy for maintenance.
△ Less
Submitted 5 August, 2012;
originally announced August 2012.
-
Energy Efficient Geographical Load Balancing via Dynamic Deferral of Workload
Authors:
Muhammad Abdullah Adnan,
Ryo Sugihara,
Rajesh Gupta
Abstract:
With the increasing popularity of Cloud computing and Mobile computing, individuals, enterprises and research centers have started outsourcing their IT and computational needs to on-demand cloud services. Recently geographical load balancing techniques have been suggested for data centers hosting cloud computation in order to reduce energy cost by exploiting the electricity price differences acros…
▽ More
With the increasing popularity of Cloud computing and Mobile computing, individuals, enterprises and research centers have started outsourcing their IT and computational needs to on-demand cloud services. Recently geographical load balancing techniques have been suggested for data centers hosting cloud computation in order to reduce energy cost by exploiting the electricity price differences across regions. However, these algorithms do not draw distinction among diverse requirements for responsiveness across various workloads. In this paper, we use the flexibility from the Service Level Agreements (SLAs) to differentiate among workloads under bounded latency requirements and propose a novel approach for cost savings for geographical load balancing. We investigate how much workload to be executed in each data center and how much workload to be delayed and migrated to other data centers for energy saving while meeting deadlines. We present an offline formulation for geographical load balancing problem with dynamic deferral and give online algorithms to determine the assignment of workload to the data centers and the migration of workload between data centers in order to adapt with dynamic electricity price changes. We compare our algorithms with the greedy approach and show that significant cost savings can be achieved by migration of workload and dynamic deferral with future electricity price prediction. We validate our algorithms on MapReduce traces and show that geographic load balancing with dynamic deferral can provide 20-30% cost-savings.
△ Less
Submitted 10 April, 2012;
originally announced April 2012.
-
Dynamic Deferral of Workload for Capacity Provisioning in Data Centers
Authors:
Muhammad Abdullah Adnan,
Ryo Sugihara,
Yan Ma,
Rajesh Gupta
Abstract:
Recent increase in energy prices has led researchers to find better ways for capacity provisioning in data centers to reduce energy wastage due to the variation in workload. This paper explores the opportunity for cost saving utilizing the flexibility from the Service Level Agreements (SLAs) and proposes a novel approach for capacity provisioning under bounded latency requirements of the workload.…
▽ More
Recent increase in energy prices has led researchers to find better ways for capacity provisioning in data centers to reduce energy wastage due to the variation in workload. This paper explores the opportunity for cost saving utilizing the flexibility from the Service Level Agreements (SLAs) and proposes a novel approach for capacity provisioning under bounded latency requirements of the workload. We investigate how many servers to be kept active and how much workload to be delayed for energy saving while meeting every deadline. We present an offline LP formulation for capacity provisioning by dynamic deferral and give two online algorithms to determine the capacity of the data center and the assignment of workload to servers dynamically. We prove the feasibility of the online algorithms and show that their worst case performance are bounded by a constant factor with respect to the offline formulation. We validate our algorithms on a MapReduce workload by provisioning capacity on a Hadoop cluster and show that the algorithms actually perform much better in practice compared to the naive `follow the workload' provisioning, resulting in 20-40% cost-savings.
△ Less
Submitted 13 November, 2012; v1 submitted 17 September, 2011;
originally announced September 2011.
-
Characterizing Graphs of Zonohedra
Authors:
Muhammad Abdullah Adnan,
Masud Hasan
Abstract:
A classic theorem by Steinitz states that a graph G is realizable by a convex polyhedron if and only if G is 3-connected planar. Zonohedra are an important subclass of convex polyhedra having the property that the faces of a zonohedron are parallelograms and are in parallel pairs. In this paper we give characterization of graphs of zonohedra. We also give a linear time algorithm to recognize suc…
▽ More
A classic theorem by Steinitz states that a graph G is realizable by a convex polyhedron if and only if G is 3-connected planar. Zonohedra are an important subclass of convex polyhedra having the property that the faces of a zonohedron are parallelograms and are in parallel pairs. In this paper we give characterization of graphs of zonohedra. We also give a linear time algorithm to recognize such a graph. In our quest for finding the algorithm, we prove that in a zonohedron P both the number of zones and the number of faces in each zone is O(square root{n}), where n is the number of vertices of P.
△ Less
Submitted 3 November, 2008;
originally announced November 2008.