Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation
Abstract
The rapid development of large language models (LLMs) has led to significant advancements in code completion tasks. While larger models have higher accuracy, they also cost much more to run. Meanwhile, model cascading has been proven effective to conserve computational resources while enhancing accuracy in LLMs on natural language generation tasks. It generates output with the smallest model in a set, and only queries the larger models when it fails to meet predefined quality criteria. However, this strategy has not been used in code completion tasks, primarily because assessing the quality of code completions differs substantially from assessing natural language, where the former relies heavily on the functional correctness. To address this, we propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold. We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model. We also introduce a heuristics to determine the optimal combination of the number of solutions, test cases, and test lines each model should generate, based on the budget. Compared to speculative decoding, our method works on black-box models, having the same level of cost-accuracy trade-off, yet providing much more choices based on the server’s budget. Ours is the first work to optimize cost-accuracy trade-off for LLM code generation with model cascading.
1 Introduction
Recently, large language models (LLMs) trained on vast datasets of code have proven to be highly capable at programming tasks (Chen et al., 2021; Li et al., 2022; Nijkamp et al., 2022). These models have been incorporated into popular AI programming assistants, e.g., GitHub Copilot Git (2021), and have attracted millions of users. However, code completion using LLMs is expensive, since it requires multiple forward passes of large models. Moreover, prompts tend to be unique and cannot be cached. The high computational demands and large memory consumption of LLMs even during inference result in high energy consumption and massive operational costs to provide code completion as a service for millions of users Zhu et al. (2023). Therefore, it is imperative to reduce the inference cost of code completion LLMs. methods include Prior methods to reduce LLM inference costs include quantization Liu et al. (2023), pruning Ma et al. (2023), knowledge distillation Hsieh et al. (2023), altered attention mechanism Dao et al. (2022), etc. These methods reduce inference costs for a single given model.
State-of-art code completion LLMs are typically available in multiple sizes, for instance, Codegen-2 Nijkamp et al. (2023) (1B, 3B, 7B, 16B), StarCoder Li et al. (2023) (1B, 3B, 15B), DeepSeek-Coder Guo et al. (2024) (1B, 7B, 33B), representing not just a single model but a family of models. Smaller models have lower performance in terms of code correctness, but also lower inference-time and memory requirements. In this paper, we ask the following question: can the capabilities of different models in a family be combined to improve the cost versus accuracy trade-offs?
To answer this question, we propose and evaluate model cascading for code, which we will refer to as model cascading for short. Model cascading builds on a simple, intuitive idea: we only query big models when absolutely necessary, i.e., if smaller models cannot provide correct code completions. In our implementation, prompts are sent to the smallest model first, working their way up to progressively bigger models depending on the quality of the models’ responses in each stage.
We use our selected model families to further demonstrate the potential of model cascading in Table 1: Wizard-Python-V1.0’s smallest 7B model achieves 56.7% accuracy on HumanEval, while the largest 34B model achieves 72.6% accuracy. The additional 16% boost in accuracy comes at an estimated 10 cost. With model cascading, the hope is to route most queries to the smallest model, reserving challenging queries for the large model.
A key challenge is deciding when to request assistance from a larger model. That is, a model must not only generate code, but also have the ability to test its code for correctness. One solution is for the designer to provide a test suite. However, this entails considerable design effort, and prior work in LLM-based code generation does not assume access to test suites. In model cascading, we instruct models to generate both code (up to a maximum of potential solutions) and a set of test cases. We then employ a learned threshold for the success rate to decide when to seek assistance from a larger model. In this way, an initial user prompt cascades up from the smallest to the largest model, stop** when the pass rate exceeds a pre-defined learned threshold. The self-testing scheme will also increase the final output’s accuracy as the previous work has shown Chen et al. (2023a).
To the best of our knowledge, model cascading is the first solution that leverages the combined power of a family of black-box LLMs to improve the operational cost versus accuracy tradeoffs. Although we evaluate model families that differ in parameter size, model cascading is easily extended to other approaches that generate model families, for example, an LLM model with varying quantization levels. While speculative decoding is another algorithm to reduce inference time in a similar spirit, we list its drawbacks in Section 2, and demonstrate that our results are superior with a comparison.
Our empirical results show that the optimal cascading strategy is always better than using a single model, and in the best case, our model cascading scheme can save 49% of cost while achieving the same accuracy compared to using a single model.
Our contributions include:
- We show with experiments that model cascading with a self-testing threshold can decrease the computation cost while increase accuracy for the one-best answer in LLMs on code completion tasks.
- We propose a full heuristic pipeline to find the optimal set of parameters, in terms of lowest inference cost and highest pass@1 accuracy at a cost budget, for a given set of models on a given dataset. Our method fits the demand of an industrial code completion server. It is compatible to any set of multi-LLM system for code completion, and is fully black-box.
- We demonstrate with reasons and experiments that our model cascading pipeline is compatible to more model families and provides users with a wider range of cost-accuracy solutions in code generation tasks compared to speculative decoding.
LLM Family | Cost ($) | Accuracy (%) | |||
Name | Size | H.Eval | MBPP | APPS | |
Wizard-Python-V1.0 | 7B | 2.22 | 56.7 | 53.9 | 16.4 |
13B | 7.87 | 64.6 | 55.0 | 21.4 | |
34B | 23.34 | 72.6 | 61.8 | 23.7 | |
Wizard-V1.0 | 1B | 0.23 | 23.8 | 33.0 | 3.9 |
3B | 0.45 | 34.8 | 41.5 | 8.6 | |
15B | 3.22 | 60.4 | 53.2 | 13.8 | |
Codegen-Mono | 350M | 1.28 | 14.6 | 22.0 | 0.4 |
2B | 3.12 | 26.2 | 34.2 | 4.9 | |
6B | 7.96 | 27.4 | 41.9 | 5.1 | |
16B | 17.40 | 30.5 | 45.4 | 5.9 |
2 Related Works
LLMs for Code There is a large and growing body of work on LLM-based code generation. Recent fine-tuned LLM models for code include Code LLAMA Rozière et al. (2023), WizardCoder Luo et al. (2023), etc. These models are typically evaluated using Pass@k metrics. That is, they generate solutions and “pass" a test as long as at least one solution is correct. In practice, however, a developer would have to pick from the solutions, requiring significant human effort.
LLM Cascading Prior work has investigated cascading for natural language prompts. FrugalGPT Chen et al. (2023b) trains a small DistillBERT model Sanh et al. (2020) to output a score based on the query and the answer, and uses this score to determine when to query a larger model. So far, there is no published work in model cascading for code completion tasks, in part because output evaluation is more challenging.
Self Testing for Code Generation To assist models in ranking their own solutions, CodeT Chen et al. (2023a) has recently proposed that LLM models, in addition to solutions, generate test-cases as extra outputs, and run solutions through these test-cases to rank the best solutions. In our work, we utilize their solution-test score for cascading threshold. While CodeT’s goal is only to rank-order solutions from a single model, model cascading on the other hand uses the same self-testing scheme to determine when to query the next larger model in the casade.
Speculative Decoding Speculative decoding (SD) is a very recent technique with the same spirit as model cascading that reduces inference time of a larger (“target”) model with a smaller (“draft”) model Leviathan et al. (2023). The method first uses the smaller model to speculatively generate a certain number of tokens, lets the bigger model compare tokens with its own predictions, and has the bigger model take over when the two results diverge. Inference time is reduced because the checking procedure is faster than generating new tokens. Nonetheless, there are several limitations on this method:
1. It does not work with black-box models, which are increasingly common in commercial deployments.
2. It queries both draft and target models, thus saving inference time but not computation and energy. In contrast, model cascading seeks to use the small model as frequently as possible, seeking to avoid using the big(ger) models altogether.
3. By letting the draft model comply to target model’s predictions, it ignores the fact that the smaller model can sometimes solve problems the bigger model cannot solve, even when they are trained on the same dataset.
4. It only works for two models, missing the opportunity to include 3 or more models for more diverse levels of capacity.
5. Since the draft model brings additional computation, researchers found that in order for the system to save time, the draft model must be significantly smaller (around 20 times) than the target model Leviathan et al. (2023) while still having a similar output to the latter, otherwise the acceptance rate would be too low. This strict requirement makes suitable model pairs hard to find.
3 Methodology
We now describe our proposed model cascading for code solution (see Figure 1 for an overview). We begin by describing our method to evaluate when to send queries to the next larger model.
3.1 Cascading Pipeline
Let there be a model family , with models: , ordered in terms of their parameter sizes from smallest (and hence fastest) to largest. For a user prompt , a natural language description of the function the user seeks to implement, we first ask the smallest model to generate solutions, in addition to tests, each consisting of up to test lines, i.e., the assert statements as shown in Figure 1. The test lines produced across all solutions are pooled together into a test suite (consisting of a total of up to tests) and used to evaluate each solution.
We adopt the same strategy to select a solution out of multiple as the prior works in self-testing Chen et al. (2023a); Xiong et al. (2023). For each solution , we define , the subset of test lines that the solution passes, and we compute two measures of solution quality: , the number of passing test lines, and , a measure of test quality among the passing test lines. gives the highest number of solves that a passing test line has across all solutions; that is, if is the number of solutions that pass test line , then . Intuitively, captures the idea that a test line (assuming it is not trivial) is higher quality if it is passed by multiple solutions. The overall solution quality is the product , which we use to rank solutions and decide if we should adopt the current model’s answer or proceed to a larger model; the highest scoring solution-test pair is denoted (, ) and its score is .
To decide whether a solution is acceptable, we define a threshold parameter , which takes value from 0 to 1. Let there be valid test lines in total. Note that , because a model might stop generating before it reaches the maximum number of test lines . If the highest solution score is greater than or equal to , we adopt the corresponding solution-test pair (, ) and end the pipeline. Otherwise, we repeat the above procedure with : pass the same prompt to , let it generate solutions and tests of test lines each, and score the solutions. We continue until we find a solution whose quality exceeds our threshold . When we use the largest model , we skip the threshold check and directly adopt the highest-scoring solution.
3.2 Finding Optimal Parameters
We introduce the parameter selection procedure in this section. Our cascading system has mainly three sets of parameters: the number of solutions and tests generated for each model ; the number of test lines in each test for each model , and the global cascading threshold parameter .
We first set a range for each parameter. In our experiments, we let for each model. When , it means we are skip** this model; when , it means we only generate one solution with greedy search, and exit with the solution as output, without generating tests; when , we generate one solution and one test with greedy search, and when , we generate solutions and tests with sampling, using temperature 0.8. For example, if we have 3 models in a cascading system, then [, , ] and [, , ] are both valid sets of parameters, but [, , ] is not, because means we must exit on model 2, so we must have .
Similarly, we let for each model. if and only if , which means we do not generate any code for test; when , we let the model generate a maximum of lines starting with assert. Note that the model might stop the generation before it reaches lines. In that case, we will have less than test lines for the current model.
Both and are parameters for each model in a cascading system. However, is a global parameter. The cascading thresholds from model 1 to model 2, and from model 2 to model 3, use the same value of . In our experiments, we sampled from .
Second, we sample a training set out of the full dataset to find the optimal parameters, and will test our selected combinations on the validation set . In our experiments, is randomly sampled, and it contains 30% of questions in .
Third, on , we run the cascading system in all the valid [] combinations as described in section 3.1, and calculate each of their cost and accuracy. Then, we make a cost-accuracy plot for all and combinations for each , similar to plots in Figure 3. We select the optimal cascading combinations by finding all the Pareto points in the plot—the points for which there is no other point with a higher accuracy and lower cost—and save these combinations as our optimal parameters. For comparison, we also save the optimal singular combinations by finding the Pareto points among the singular points, where there is only one model used in the system (i.e. , , ). We compare the difference between the Pareto points and singular Pareto points across plots, and find the best .
Finally, on , we compare our cascading optimal combinations on the we selected with the singular optimal combinations.
3.3 Cascading Threshold Parameter ()
In this section we discuss the impact of on the system’s cost and accuracy. As mentioned in Section 3.1, decides when to use a larger model. When is close to 0, the small model’s output is more likely to meet the standard and thus exit with the current output; on the other hand, when is close to 1, the system is more likely to use bigger models on each question.
We demonstrate the impact of with an example in Figure 2. We used the WizardCoder-Python-V1.0 family to run full HumanEval dataset, with set values for , . As the value of increases, the cost increases monotonically, because using larger models more often always results in more computation. However, the highest accuracy is achieved at rather than . The reason is that when smaller models have larger values, they have more solutions to pick from and more tests to validate their solutions. That makes the outputs from smaller models potentially more accurate than larger models. On the other hand, a too rigorous bar would miss correct solutions from smaller models. Among the solutions, there might be only a few of them correct, and so is the case among the tests. If we set , it means all solutions must pass all tests, which is much unlikely for a large .
3.4 Cost Calculation
All the models that we used are public, and thus we need to estimate the cost of running them based on the inference time they take on GPUs. To this end, we calculate the cost by multiplying the number of tokens generated by the per-token cost, similar to the pricing scheme used by commercial model providers like OpenAI.
We deploy a server with multiple NVIDIA Geforce RTX 3090 GPUs, each with 24GB of VRAM. We run each model with GPUs using the minimum that provides enough VRAM to run inference on a batch of 10 prompts. For each dataset, we randomly sample the maximum number of questions that can fit into a batch without running out of VRAM. We estimate the time to complete the inference , and the number of non-eos tokens from all answers . Finally, we look up the hourly cost to rent an RTX 3090 GPU online, . In this case, we use the price from runpod.io where RunPod (2023). Hence, the per-token cost on each model is .
Our cost statistics for all models on the HumanEval dataset are provided in Table 2. We did not include the length of the input prompt in the calculation, even though it also affects the computation cost. We account for this issue by collecting an averaged time stats for each model family on each dataset.
Model Family | Size | N.gpu | Time (h) | Batch | Cost ($) |
WizardCoder- Python-V1.0 | 7B | 2 | 2.17 | 20 | 1.91 |
13B | 4 | 3.78 | 24 | 6.65 | |
34B | 8 | 5.75 | 48 | 20.24 | |
WizardCoder- V1.0 | 1B | 1 | 0.31 | 180 | 0.13 |
3B | 1 | 0.58 | 80 | 0.26 | |
15B | 2 | 1.97 | 32 | 1.74 | |
Codegen- mono | 350M | 1 | 2.36 | 48 | 1.04 |
2B | 1 | 5.97 | 12 | 2.63 | |
6B | 2 | 6.44 | 12 | 5.67 | |
16B | 4 | 9.00 | 12 | 15.84 |
4 Experiment Setup
4.1 Models
There are multiple open-source code generation language models, including Codegen Nijkamp et al. (2022), InCoder Fried et al. (2022), StarCoder Li et al. (2023), Code LLaMA Rozière et al. (2023), CodeT5+ Zheng et al. (2023), WizardCoder Luo et al. (2023), DeepSeek-Coder Guo et al. (2024), etc.
For our experiments, we chose three model families to work with: the Codegen-mono family (4 models, each having 350M, 2B, 6B, 16B parameters), the WizardCoder-V1.0 family (3 models, each having 1B, 3B, 15B parameters), and the WizardCoder-Python-V1.0 family (3 models, each having 7B, 13B, 34B parameters). The Codegen-mono family models were trained on the THEPILE, BIGQUERY and BIGPYTHON datasets. The latter two WizardCoder families were trained on from the Starcoder family and LLAMA-2 family Touvron et al. (2023), on an instructive dataset Code Alpaca Chaudhary (2023). Their models share the same structure as their respective foundation models. The three model families have competitive performance in the corresponding sizes, and share features that are suitable for our experiments:
1. They open-sourced families of multiple sizes spanning a relatively wide range. The larger difference in size, the more inference time we can save when adopting a smaller checkpoint’s output.
2. Among each family, all checkpoints are trained from the same datasets, so that the larger checkpoints always have a higher accuracy than the smaller checkpoints. If a larger checkpoint has lower accuracy than a smaller one, we should eliminate it from the cascading system.
4.2 Datasets
For our experiments, we use HumanEval Chen et al. (2021), MBPP-sanitized Austin et al. (2021), and APPS-test introductory level subset Hendrycks et al. (2021). They have 164, 427, 1000 questions respectively. We show the greedy accuracy of all models in each model family on each dataset in Table 1.
In APPS-test, we only provide cascading results on the introductory level questions using WizardCoder-Python-V1.0 model family. The reason is that the accuracy of other model families on intro-level, as well as all model families on interview and competition levels, are below 10% on average. When the test set is too challenging for the model family, the cascading system will waste time rather than save time on smaller models.
4.3 Environment
Our server is consisted of NVIDIA GeForce 3090 GPUs. We use Cuda Toolkit 11.8, and a PyTorch-based python implementation from huggingface transformers Wolf et al. (2020) in version 4.31.0 with accelerate library Gugger et al. (2022), which is in parallel to the training environment of WizardCoder. We did not implement vllm Kwon et al. (2023), triton inference server, deepspeed or quantization, but our method is compatible with these acceleration strategies. We implemented code completions in a maximum of 1024 generated tokens with early stops.
4.4 Speculative Decoding
Previous works have shown that the accelerating effect of batching is stronger than SD Su et al. (2023), so we implemented speculative decoding (SD) in a multi-batch setting. The per-token cost of each model pair is calculated in the same way as Section 3.4. Same as Su et al. (2023), we deployed argmax sampling, so the output of SD has the same accuracy as that of the target model in greedy search.
We implemented SD on the two more recent WizardCoder families. We did not compare our method with SD on HumanEval and MBPP. Since the models are instruction-tuned in alpaca format, their outputs always begin with copying the function’s definition and instructions, having a 100% acceptance rate for over a half of the tokens generated, thus giving SD an unfair advantage.
5 Results
In Figure 3, we plot the throughput and accuracy of all the 3 model families on HumanEval and MBPP datasets, and the 2 model families on APPS-Intro. For each run, in the validation set, we randomly sampled 30% of all questions, selected the Pareto points as optimal combinations, and estimated their cost and accuracy on the test set with the rest of the 70% questions.
In general, for all model families on all datasets in the experiment, our scheme provides near-optimal combinations at any budget range. We find that the optimal parameter combinations from our cascading scheme are more accurate and less costly than single models.
Its effect on WizardCoder-Python-V1.0 family (right-most column) was the strongest. The selected cascading combinations achieve more optimal results on all datasets in all budget ranges. We speculate that the reason is the highly capable models generate high-quality tests, even when the questions are difficult. For the other two model families, while its effect over a single model is not obvious, it still correctly finds single model solutions when they are indeed optimal.
Our cascading scheme achieves a more obvious acceleration compared to using a single model on higher budgets. In the best case, the cascading scheme can save 49% of cost while achieving the same level of accuracy as using a single model. Therefore, our work is most valuable for accuracy-sensitive tasks.
Speculative decoding gains a similar cost-accuracy optimization as our scheme on the APPS-Intro dataset. Its largest batch size, cost and hit rate are provided in Table 3. It less effective than in the other speculative decoding works Chen et al. (2023c); Leviathan et al. (2023), because its effect decreases when we fully utilize gpus by increasing the batch size over 1. Moreover, it only provides one solution, neglecting the chance of significantly increasing accuracy by 2-5% with a small increase of cost.
M.Family | Size | N.gpu | Hit (%) | B. | C. ($) | C.red (%) |
WC-P-V1.0 | 7-34B | 8 | 63.9 | 8 | 37.1 | 3.1 |
WC-V1.0 | 1-15B | 2 | 72.5 | 8 | 5.57 | 15.2 |
6 Conclusion
We proposed a full heuristic pipeline to find the optimal cascading strategy, and proved that it gains lower cost and higher accuracy for various model families and datasets. Furthermore, it generates multiple optimal solutions for all ranges of budget, so users can pick the best solution based on their sensitivity of cost and accuracy, while speculative decoding only offers one choice. Our cascading pipeline is more effective when the model family has higher quality in code generation.
Our experiment focused on the python code generation tasks, but the method is compatible to all languages. In our experiments, we only attempted a global value for . To further examine the potential of model cascading for code completions, one can explore the results using a different value for each model and each [] combination.
References
- Chen et al. [2021] Mark Chen et al. Evaluating large language models trained on code. 2021.
- Li et al. [2022] Yujia Li et al. Competition-level code generation with alphacode. Science, 378:1092–1097, 2022.
- Nijkamp et al. [2022] Erik Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. 2022.
- Git [2021] Github copilot · your ai pair programmer, 2021. URL https://copilot.github.com/.
- Zhu et al. [2023] Xunyu Zhu et al. A survey on model compression for large language models. 2023.
- Liu et al. [2023] Zechun Liu et al. Llm-qat: Data-free quantization aware training for large language models. 2023.
- Ma et al. [2023] Xinyin Ma et al. Llm-pruner: On the structural pruning of large language models. 2023.
- Hsieh et al. [2023] Cheng-Yu Hsieh et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 2023.
- Dao et al. [2022] Tri Dao et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Nijkamp et al. [2023] Erik Nijkamp et al. Codegen2: Lessons for training llms on programming and natural languages. 2023.
- Li et al. [2023] Raymond Li et al. Starcoder: may the source be with you! 2023.
- Guo et al. [2024] Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. 2024.
- Chen et al. [2023a] Bei Chen et al. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023a.
- Rozière et al. [2023] Baptiste Rozière et al. Code llama: Open foundation models for code, 2023.
- Luo et al. [2023] Ziyang Luo et al. Wizardcoder: Empowering code large language models with evol-instruct. 2023.
- Chen et al. [2023b] Lingjiao Chen et al. Frugalgpt: How to use large language models while reducing cost and improving performance. 2023b.
- Sanh et al. [2020] Victor Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2020.
- Leviathan et al. [2023] Yaniv Leviathan et al. Fast inference from transformers via speculative decoding. 2023.
- Xiong et al. [2023] Weimin Xiong et al. The program testing ability of large language models for code. 2023.
- RunPod [2023] RunPod. Gpu cloud service, 2023. URL http://runpod.io.
- Fried et al. [2022] Daniel Fried et al. Incoder: A generative model for code infilling and synthesis. 2022.
- Zheng et al. [2023] Qinkai Zheng et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. 2023.
- Touvron et al. [2023] Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. 2023.
- Chaudhary [2023] Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. Github repository. https://github.com/sahil280114/codealpaca, 2023.
- Austin et al. [2021] Jacob Austin et al. Program synthesis with large language models. 2021.
- Hendrycks et al. [2021] Dan Hendrycks et al. Measuring coding challenge competence with apps. 2021.
- Wolf et al. [2020] Thomas Wolf et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
- Gugger et al. [2022] Sylvain Gugger et al. Accelerate: Training and inference at scale made simple, efficient and adaptable. Github repository. https://github.com/huggingface/accelerate, 2022.
- Kwon et al. [2023] Woosuk Kwon et al. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, page 611–626, 2023.
- Su et al. [2023] Qidong Su et al. The synergy of speculative decoding and batching in serving large language models. 2023.
- Chen et al. [2023c] Charlie Chen et al. Accelerating large language model decoding with speculative sampling. 2023c.