Search | arXiv e-print repository

Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings

Authors: Daniel Rotem, Michael Hassid, Jonathan Mamou, Roy Schwartz

Abstract: Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the sam… ▽ More Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the same architecture and size, individual Multi-Model classifiers outperform their Early-Exit counterparts by an average of 2.3%. We show that this gap is caused by Early-Exit classifiers sharing model parameters during training, resulting in conflicting gradient updates of model weights. We find that despite this gap, Early-Exit still provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach. To address these issues, we propose SWEET (Separating Weights in Early Exit Transformers), an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights, not updated by other classifiers. We compare SWEET's speed-accuracy curve to standard Early-Exit and Multi-Model baselines and find that it outperforms both methods at fast speeds while maintaining comparable scores to Early-Exit at slow speeds. Moreover, SWEET individual classifiers outperform Early-Exit ones by 1.1% on average. SWEET enjoys the benefits of both methods, paving the way for further reduction of inference costs in NLP. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Proceedings of ACL 2023

arXiv:2211.03495 [pdf, other]

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Authors: Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz

Abstract: The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with… ▽ More The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Findings of EMNLP 2022

arXiv:1103.0260 [pdf, ps, other]

A Linear Approximation Algorithm for 2-Dimensional Vector Packing

Authors: Ekow Otoo, Ali Pinar, Doron Rotem

Abstract: We study the 2-dimensional vector packing problem, which is a generalization of the classical bin packing problem where each item has 2 distinct weights and each bin has 2 corresponding capacities. The goal is to group items into minimum number of bins, without violating the bin capacity constraints. We propose a Θ}(n)-time approximation algorithm that is inspired by the O(n^2) algorithm proposed… ▽ More We study the 2-dimensional vector packing problem, which is a generalization of the classical bin packing problem where each item has 2 distinct weights and each bin has 2 corresponding capacities. The goal is to group items into minimum number of bins, without violating the bin capacity constraints. We propose a Θ}(n)-time approximation algorithm that is inspired by the O(n^2) algorithm proposed by Chang, Hwang, and Park. △ Less

Submitted 1 March, 2011; originally announced March 2011.

Showing 1–3 of 3 results for author: Rotem, D