Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing
Authors:
David Peer,
Sebastian Stabinger,
Stefan Engl,
Antonio Rodriguez-Sanchez
Abstract:
Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller…
▽ More
Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models. Knowledge distillation maintains high performance and reaches high compression rates, nevertheless, the size of the student model is fixed after pre-training and can not be changed individually for a given downstream task and use-case to reach a desired performance/speedup ratio. Another solution to reduce the size of models in a much more fine-grained and computationally cheaper fashion is to prune layers after the pre-training. The price to pay is that the performance of layer-wise pruning algorithms is not on par with state-of-the-art knowledge distillation methods. In this paper, Greedy-layer pruning is introduced to (1) outperform current state-of-the-art for layer-wise pruning, (2) close the performance gap when compared to knowledge distillation, while (3) providing a method to adapt the model size dynamically to reach a desired performance/speedup tradeoff without the need of additional pre-training phases. Our source code is available on https://github.com/deepopinion/greedy-layer-pruning.
△ Less
Submitted 29 March, 2022; v1 submitted 31 May, 2021;
originally announced May 2021.
Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification
Authors:
Alexander Rietzler,
Sebastian Stabinger,
Paul Opitz,
Stefan Engl
Abstract:
Aspect-Target Sentiment Classification (ATSC) is a subtask of Aspect-Based Sentiment Analysis (ABSA), which has many applications e.g. in e-commerce, where data and insights from reviews can be leveraged to create value for businesses and customers. Recently, deep transfer-learning methods have been applied successfully to a myriad of Natural Language Processing (NLP) tasks, including ATSC. Buildi…
▽ More
Aspect-Target Sentiment Classification (ATSC) is a subtask of Aspect-Based Sentiment Analysis (ABSA), which has many applications e.g. in e-commerce, where data and insights from reviews can be leveraged to create value for businesses and customers. Recently, deep transfer-learning methods have been applied successfully to a myriad of Natural Language Processing (NLP) tasks, including ATSC. Building on top of the prominent BERT language model, we approach ATSC using a two-step procedure: self-supervised domain-specific BERT language model finetuning, followed by supervised task-specific finetuning. Our findings on how to best exploit domain-specific language model finetuning enable us to produce new state-of-the-art performance on the SemEval 2014 Task 4 restaurants dataset. In addition, to explore the real-world robustness of our models, we perform cross-domain evaluation. We show that a cross-domain adapted BERT language model performs significantly better than strong baseline models like vanilla BERT-base and XLNet-base. Finally, we conduct a case study to interpret model prediction errors.
△ Less
Submitted 19 November, 2019; v1 submitted 30 August, 2019;
originally announced August 2019.