-
SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection
Authors:
Foozhan Ataiefard,
Walid Ahmed,
Habib Hajimolahoseini,
Saina Asani,
Farnoosh Javadi,
Mohammad Hassanpour,
Omar Mohamed Awad,
Austin Wen,
Kangling Liu,
Yang Liu
Abstract:
Vision transformers are known to be more computationally and data-intensive than CNN models. These transformer models such as ViT, require all the input image tokens to learn the relationship among them. However, many of these tokens are not informative and may contain irrelevant information such as unrelated background or unimportant scenery. These tokens are overlooked by the multi-head self-att…
▽ More
Vision transformers are known to be more computationally and data-intensive than CNN models. These transformer models such as ViT, require all the input image tokens to learn the relationship among them. However, many of these tokens are not informative and may contain irrelevant information such as unrelated background or unimportant scenery. These tokens are overlooked by the multi-head self-attention (MHSA), resulting in many redundant and unnecessary computations in MHSA and the feed-forward network (FFN). In this work, we propose a method to optimize the amount of unnecessary interactions between unimportant tokens by separating and sending them through a different low-cost computational path. Our method does not add any parameters to the ViT model and aims to find the best trade-off between training throughput and achieving a 0% loss in the Top-1 accuracy of the final model. Our experimental results on training ViT-small from scratch show that SkipViT is capable of effectively drop** 55% of the tokens while gaining more than 13% training throughput and maintaining classification accuracy at the level of the baseline model on Huawei Ascend910A.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling
Authors:
Habib Hajimolahoseini,
Omar Mohamed Awad,
Walid Ahmed,
Austin Wen,
Saina Asani,
Mohammad Hassanpour,
Farnoosh Javadi,
Mehdi Ahmadi,
Foozhan Ataiefard,
Kangling Liu,
Yang Liu
Abstract:
In this paper, we present SwiftLearn, a data-efficient approach to accelerate training of deep learning models using a subset of data samples selected during the warm-up stages of training. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages, aiming to preserve the model performance with fewer examples during the rest of training. The impo…
▽ More
In this paper, we present SwiftLearn, a data-efficient approach to accelerate training of deep learning models using a subset of data samples selected during the warm-up stages of training. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages, aiming to preserve the model performance with fewer examples during the rest of training. The importance measure we propose could be updated during training every once in a while, to make sure that all of the data samples have a chance to return to the training loop if they show a higher importance. The model architecture is unchanged but since the number of data samples controls the number of forward and backward passes during training, we can reduce the training time by reducing the number of training samples used in each epoch of training. Experimental results on a variety of CV and NLP models during both pretraining and finetuning show that the model performance could be preserved while achieving a significant speed-up during training. More specifically, BERT finetuning on GLUE benchmark shows that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while kee** the average accuracy drop less than 0.92%.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
GQKVA: Efficient Pre-training of Transformers by Grou** Queries, Keys, and Values
Authors:
Farnoosh Javadi,
Walid Ahmed,
Habib Hajimolahoseini,
Foozhan Ataiefard,
Mohammad Hassanpour,
Saina Asani,
Austin Wen,
Omar Mohamed Awad,
Kangling Liu,
Yang Liu
Abstract:
Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grou** techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with variou…
▽ More
Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grou** techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
△ Less
Submitted 13 December, 2023; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Multi-Task Learning For Reduced Popularity Bias In Multi-Territory Video Recommendations
Authors:
Phanideep Gampa,
Farnoosh Javadi,
Belhassen Bayar,
Ainur Yessenalina
Abstract:
Various data imbalances that naturally arise in a multi-territory personalized recommender system can lead to a significant item bias for globally prevalent items. A locally popular item can be overshadowed by a globally prevalent item. Moreover, users' viewership patterns/statistics can drastically change from one geographic location to another which may suggest to learn specific user embeddings.…
▽ More
Various data imbalances that naturally arise in a multi-territory personalized recommender system can lead to a significant item bias for globally prevalent items. A locally popular item can be overshadowed by a globally prevalent item. Moreover, users' viewership patterns/statistics can drastically change from one geographic location to another which may suggest to learn specific user embeddings. In this paper, we propose a multi-task learning (MTL) technique, along with an adaptive upsampling method to reduce popularity bias in multi-territory recommendations. Our proposed framework is designed to enrich training examples with active users representation through upsampling, and capable of learning geographic-based user embeddings by leveraging MTL. Through experiments, we demonstrate the effectiveness of our framework in multiple territories compared to a baseline not incorporating our proposed techniques.~Noticeably, we show improved relative gain of up to $65.27\%$ in PR-AUC metric. A case study is presented to demonstrate the advantages of our methods in attenuating the popularity bias of global items.
△ Less
Submitted 24 September, 2023;
originally announced October 2023.