-
Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings
Authors:
Daniel Rotem,
Michael Hassid,
Jonathan Mamou,
Roy Schwartz
Abstract:
Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the sam…
▽ More
Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the same architecture and size, individual Multi-Model classifiers outperform their Early-Exit counterparts by an average of 2.3%. We show that this gap is caused by Early-Exit classifiers sharing model parameters during training, resulting in conflicting gradient updates of model weights. We find that despite this gap, Early-Exit still provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach. To address these issues, we propose SWEET (Separating Weights in Early Exit Transformers), an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights, not updated by other classifiers. We compare SWEET's speed-accuracy curve to standard Early-Exit and Multi-Model baselines and find that it outperforms both methods at fast speeds while maintaining comparable scores to Early-Exit at slow speeds. Moreover, SWEET individual classifiers outperform Early-Exit ones by 1.1% on average. SWEET enjoys the benefits of both methods, paving the way for further reduction of inference costs in NLP.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
Authors:
Michael Hassid,
Hao Peng,
Daniel Rotem,
Jungo Kasai,
Ivan Montero,
Noah A. Smith,
Roy Schwartz
Abstract:
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with…
▽ More
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Crystallographic orientation errors in mechanical exfoliation
Authors:
Y. Kolumbus,
A. Zalic,
N. Fardian-Melamed,
Z. Barkay,
D. Rotem,
D. Porath,
H. Steinberg
Abstract:
We evaluate the effect of mechanical exfoliation of van der Waals materials on crystallographic orientations of the resulting flakes. Flakes originating from a single crystal of graphite, whose orientation is confirmed using STM, are studied using facet orientations and electron back-scatter diffraction (EBSD). While facets exhibit a wide distribution of angles after a single round of exfoliation…
▽ More
We evaluate the effect of mechanical exfoliation of van der Waals materials on crystallographic orientations of the resulting flakes. Flakes originating from a single crystal of graphite, whose orientation is confirmed using STM, are studied using facet orientations and electron back-scatter diffraction (EBSD). While facets exhibit a wide distribution of angles after a single round of exfoliation ($ σ\approx 5^o $), EBSD shows that the true crystallographic orientations are more narrowly distributed ($ σ\approx 1.5^o $), and facets have an approximately error from the true orientation. Furthermore, we find that the majority of graphite fractures are along armchair lines, and that the cleavage process results in an increase of the zigzag lines portion. Our results place values on the rotation caused by a single round of the exfoliation process, and suggest that when a 1-2 degree precision is necessary, the orientation of a flake can be gauged by the orientation of the macroscopic single crystal from which it was exfoliated.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
A Linear Approximation Algorithm for 2-Dimensional Vector Packing
Authors:
Ekow Otoo,
Ali Pinar,
Doron Rotem
Abstract:
We study the 2-dimensional vector packing problem, which is a generalization of the classical bin packing problem where each item has 2 distinct weights and each bin has 2 corresponding capacities. The goal is to group items into minimum number of bins, without violating the bin capacity constraints. We propose a Θ}(n)-time approximation algorithm that is inspired by the O(n^2) algorithm proposed…
▽ More
We study the 2-dimensional vector packing problem, which is a generalization of the classical bin packing problem where each item has 2 distinct weights and each bin has 2 corresponding capacities. The goal is to group items into minimum number of bins, without violating the bin capacity constraints. We propose a Θ}(n)-time approximation algorithm that is inspired by the O(n^2) algorithm proposed by Chang, Hwang, and Park.
△ Less
Submitted 1 March, 2011;
originally announced March 2011.