Search | arXiv e-print repository

Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency

Authors: Leonidas Gee, Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci

Abstract: Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation have observed corresponding drops in functional correctness. To that end, we introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self… ▽ More Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation have observed corresponding drops in functional correctness. To that end, we introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting while avoiding a reliance on larger models for learning signals. Code-Optimise achieves significant improvements in pass@k while decreasing the competitive baseline runtimes by an additional 6% for in-domain data and up to 3% for out-of-domain data. As a byproduct, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval, resulting in faster and cheaper inference. The generated data and codebase will be open-sourced at www.open-source.link. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Under review at ARR (for EMNLP 2024)

arXiv:2403.17811 [pdf, other]

doi 10.18653/v1/2023.emnlp-main.983

Are Compressed Language Models Less Subgroup Robust?

Authors: Leonidas Gee, Andrea Zugarini, Novi Quadrianto

Abstract: To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show… ▽ More To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Main Track

arXiv:2402.09977 [pdf, other]

doi 10.18653/v1/2022.emnlp-industry.41

Fast Vocabulary Transfer for Language Model Compression

Authors: Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, Paolo Torroni

Abstract: Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant redu… ▽ More Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)

Journal ref: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022): Industry Track

arXiv:2402.09949 [pdf, other]

doi 10.18653/v1/2023.emnlp-industry.58

Multi-word Tokenization for Sequence Compression

Authors: Leonidas Gee, Leonardo Rigutini, Marco Ernandes, Andrea Zugarini

Abstract: Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two ben… ▽ More Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation. △ Less

Submitted 4 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

arXiv:2007.00824 [pdf, ps, other]

Lightme: Analysing Language in Internet Support Groups for Mental Health

Authors: Gabriela Ferraro, Brendan Loo Gee, Shenjia Ji, Luis Salvador-Carulla

Abstract: Background: Assisting moderators to triage harmful posts in Internet Support Groups is relevant to ensure its safe use. Automated text classification methods analysing the language expressed in posts of online forums is a promising solution. Methods: Natural Language Processing and Machine Learning technologies were used to build a triage post classifier using a dataset from Reachout mental health… ▽ More Background: Assisting moderators to triage harmful posts in Internet Support Groups is relevant to ensure its safe use. Automated text classification methods analysing the language expressed in posts of online forums is a promising solution. Methods: Natural Language Processing and Machine Learning technologies were used to build a triage post classifier using a dataset from Reachout mental health forum for young people. Results: When comparing with the state-of-the-art, a solution mainly based on features from lexical resources, received the best classification performance for the crisis posts (52%), which is the most severe class. Six salient linguistic characteristics were found when analysing the crisis post; 1) posts expressing hopelessness, 2) short posts expressing concise negative emotional responses, 3) long posts expressing variations of emotions, 4) posts expressing dissatisfaction with available health services, 5) posts utilising storytelling, and 6) posts expressing users seeking advice from peers during a crisis. Conclusion: It is possible to build a competitive triage classifier using features derived only from the textual content of the post. Further research needs to be done in order to translate our quantitative and qualitative findings into features, as it may improve overall performance. △ Less

Submitted 2 July, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

Showing 1–5 of 5 results for author: Gee, L