Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

Peer, David; Stabinger, Sebastian; Engl, Stefan; Rodriguez-Sanchez, Antonio

doi:10.1016/j.patrec.2022.03.023

Computer Science > Computation and Language

arXiv:2105.14839 (cs)

[Submitted on 31 May 2021 (v1), last revised 29 Mar 2022 (this version, v2)]

Title:Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

Authors:David Peer, Sebastian Stabinger, Stefan Engl, Antonio Rodriguez-Sanchez

View PDF

Abstract:Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs in production. One possible solution is to use knowledge distillation, which solves this problem by transferring information from large teacher models to smaller student models. Knowledge distillation maintains high performance and reaches high compression rates, nevertheless, the size of the student model is fixed after pre-training and can not be changed individually for a given downstream task and use-case to reach a desired performance/speedup ratio. Another solution to reduce the size of models in a much more fine-grained and computationally cheaper fashion is to prune layers after the pre-training. The price to pay is that the performance of layer-wise pruning algorithms is not on par with state-of-the-art knowledge distillation methods. In this paper, Greedy-layer pruning is introduced to (1) outperform current state-of-the-art for layer-wise pruning, (2) close the performance gap when compared to knowledge distillation, while (3) providing a method to adapt the model size dynamically to reach a desired performance/speedup tradeoff without the need of additional pre-training phases. Our source code is available on this https URL.

Comments:	Accepted at Pattern Recognition Letters
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2105.14839 [cs.CL]
	(or arXiv:2105.14839v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2105.14839
Related DOI:	https://doi.org/10.1016/j.patrec.2022.03.023

Submission history

From: David Peer [view email]
[v1] Mon, 31 May 2021 09:52:41 UTC (576 KB)
[v2] Tue, 29 Mar 2022 09:47:45 UTC (572 KB)

Computer Science > Computation and Language

Title:Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators