Model Cascading for Code: Reducing Inference Costs with Model Cascading for LLM Based Code Generation

Boyuan Chen
New York University
[email protected]
&Mingzhi Zhu
New York University
[email protected]
\ANDBrendan Dolan-Gavitt
New York University
[email protected]
&Muhammad Shafique
New York University Abu Dhabi
[email protected]
&Siddharth Garg
New York University
[email protected]
Abstract

The rapid development of large language models (LLMs) has led to significant advancements in code completion tasks. While larger models have higher accuracy, they also cost much more to run. Meanwhile, model cascading has been proven effective to conserve computational resources while enhancing accuracy in LLMs on natural language generation tasks. It generates output with the smallest model in a set, and only queries the larger models when it fails to meet predefined quality criteria. However, this strategy has not been used in code completion tasks, primarily because assessing the quality of code completions differs substantially from assessing natural language, where the former relies heavily on the functional correctness. To address this, we propose letting each model generate and execute a set of test cases for their solutions, and use the test results as the cascading threshold. We show that our model cascading strategy reduces computational costs while increases accuracy compared to generating the output with a single model. We also introduce a heuristics to determine the optimal combination of the number of solutions, test cases, and test lines each model should generate, based on the budget. Compared to speculative decoding, our method works on black-box models, having the same level of cost-accuracy trade-off, yet providing much more choices based on the server’s budget. Ours is the first work to optimize cost-accuracy trade-off for LLM code generation with model cascading.

1 Introduction

Recently, large language models (LLMs) trained on vast datasets of code have proven to be highly capable at programming tasks  (Chen et al., 2021; Li et al., 2022; Nijkamp et al., 2022). These models have been incorporated into popular AI programming assistants, e.g., GitHub Copilot Git (2021), and have attracted millions of users. However, code completion using LLMs is expensive, since it requires multiple forward passes of large models. Moreover, prompts tend to be unique and cannot be cached. The high computational demands and large memory consumption of LLMs even during inference result in high energy consumption and massive operational costs to provide code completion as a service for millions of users Zhu et al. (2023). Therefore, it is imperative to reduce the inference cost of code completion LLMs. methods include Prior methods to reduce LLM inference costs include quantization Liu et al. (2023), pruning Ma et al. (2023), knowledge distillation Hsieh et al. (2023), altered attention mechanism Dao et al. (2022), etc. These methods reduce inference costs for a single given model.

State-of-art code completion LLMs are typically available in multiple sizes, for instance, Codegen-2 Nijkamp et al. (2023) (1B, 3B, 7B, 16B), StarCoder Li et al. (2023) (1B, 3B, 15B), DeepSeek-Coder Guo et al. (2024) (1B, 7B, 33B), representing not just a single model but a family of models. Smaller models have lower performance in terms of code correctness, but also lower inference-time and memory requirements. In this paper, we ask the following question: can the capabilities of different models in a family be combined to improve the cost versus accuracy trade-offs?

To answer this question, we propose and evaluate model cascading for code, which we will refer to as model cascading for short. Model cascading builds on a simple, intuitive idea: we only query big models when absolutely necessary, i.e., if smaller models cannot provide correct code completions. In our implementation, prompts are sent to the smallest model first, working their way up to progressively bigger models depending on the quality of the models’ responses in each stage.

We use our selected model families to further demonstrate the potential of model cascading in Table 1: Wizard-Python-V1.0’s smallest 7B model achieves 56.7% accuracy on HumanEval, while the largest 34B model achieves 72.6% accuracy. The additional 16% boost in accuracy comes at an estimated 10×\times× cost. With model cascading, the hope is to route most queries to the smallest model, reserving challenging queries for the large model.

A key challenge is deciding when to request assistance from a larger model. That is, a model must not only generate code, but also have the ability to test its code for correctness. One solution is for the designer to provide a test suite. However, this entails considerable design effort, and prior work in LLM-based code generation does not assume access to test suites. In model cascading, we instruct models to generate both code (up to a maximum of k𝑘kitalic_k potential solutions) and a set of test cases. We then employ a learned threshold for the success rate to decide when to seek assistance from a larger model. In this way, an initial user prompt cascades up from the smallest to the largest model, stop** when the pass rate exceeds a pre-defined learned threshold. The self-testing scheme will also increase the final output’s accuracy as the previous work has shown Chen et al. (2023a).

To the best of our knowledge, model cascading is the first solution that leverages the combined power of a family of black-box LLMs to improve the operational cost versus accuracy tradeoffs. Although we evaluate model families that differ in parameter size, model cascading is easily extended to other approaches that generate model families, for example, an LLM model with varying quantization levels. While speculative decoding is another algorithm to reduce inference time in a similar spirit, we list its drawbacks in Section 2, and demonstrate that our results are superior with a comparison.

Our empirical results show that the optimal cascading strategy is always better than using a single model, and in the best case, our model cascading scheme can save 49% of cost while achieving the same accuracy compared to using a single model.

Our contributions include:

- We show with experiments that model cascading with a self-testing threshold can decrease the computation cost while increase accuracy for the one-best answer in LLMs on code completion tasks.

- We propose a full heuristic pipeline to find the optimal set of parameters, in terms of lowest inference cost and highest pass@1 accuracy at a cost budget, for a given set of models on a given dataset. Our method fits the demand of an industrial code completion server. It is compatible to any set of multi-LLM system for code completion, and is fully black-box.

- We demonstrate with reasons and experiments that our model cascading pipeline is compatible to more model families and provides users with a wider range of cost-accuracy solutions in code generation tasks compared to speculative decoding.

Refer to caption
Figure 1: An overview of our proposed model cascading solution with n𝑛nitalic_n models. Models with higher indices have more parameters. Starting with model 1, we generate multiple code solutions and testcases. We score each solution-test pair and identify the best pair; if the score exceeds a threshold, we accept the solution and output; otherwise, we move on to the next model and repeat the process, until we take the highest-scored output from the largest model.
Table 1: Cost and pass@1 accuracy with greedy search of all models on each dataset. Cost is estimated as $ per 1M tokens averaged on the three datasets. The APPS dataset refers to the introductory questions in the test set only.
LLM Family Cost ($) Accuracy (%)
Name Size H.Eval MBPP APPS
Wizard-Python-V1.0 7B 2.22 56.7 53.9 16.4
13B 7.87 64.6 55.0 21.4
34B 23.34 72.6 61.8 23.7
Wizard-V1.0 1B 0.23 23.8 33.0 3.9
3B 0.45 34.8 41.5 8.6
15B 3.22 60.4 53.2 13.8
Codegen-Mono 350M 1.28 14.6 22.0 0.4
2B 3.12 26.2 34.2 4.9
6B 7.96 27.4 41.9 5.1
16B 17.40 30.5 45.4 5.9

2 Related Works

LLMs for Code There is a large and growing body of work on LLM-based code generation. Recent fine-tuned LLM models for code include Code LLAMA Rozière et al. (2023), WizardCoder Luo et al. (2023), etc. These models are typically evaluated using Pass@k metrics. That is, they generate k𝑘kitalic_k solutions and “pass" a test as long as at least one solution is correct. In practice, however, a developer would have to pick from the k𝑘kitalic_k solutions, requiring significant human effort.

LLM Cascading Prior work has investigated cascading for natural language prompts. FrugalGPT Chen et al. (2023b) trains a small DistillBERT model Sanh et al. (2020) to output a score based on the query and the answer, and uses this score to determine when to query a larger model. So far, there is no published work in model cascading for code completion tasks, in part because output evaluation is more challenging.

Self Testing for Code Generation To assist models in ranking their own solutions, CodeT Chen et al. (2023a) has recently proposed that LLM models, in addition to solutions, generate test-cases as extra outputs, and run solutions through these test-cases to rank the best solutions. In our work, we utilize their solution-test score for cascading threshold. While CodeT’s goal is only to rank-order solutions from a single model, model cascading on the other hand uses the same self-testing scheme to determine when to query the next larger model in the casade.

Speculative Decoding Speculative decoding (SD) is a very recent technique with the same spirit as model cascading that reduces inference time of a larger (“target”) model with a smaller (“draft”) model Leviathan et al. (2023). The method first uses the smaller model to speculatively generate a certain number of tokens, lets the bigger model compare tokens with its own predictions, and has the bigger model take over when the two results diverge. Inference time is reduced because the checking procedure is faster than generating new tokens. Nonetheless, there are several limitations on this method:

1. It does not work with black-box models, which are increasingly common in commercial deployments.

2. It queries both draft and target models, thus saving inference time but not computation and energy. In contrast, model cascading seeks to use the small model as frequently as possible, seeking to avoid using the big(ger) models altogether.

3. By letting the draft model comply to target model’s predictions, it ignores the fact that the smaller model can sometimes solve problems the bigger model cannot solve, even when they are trained on the same dataset.

4. It only works for two models, missing the opportunity to include 3 or more models for more diverse levels of capacity.

5. Since the draft model brings additional computation, researchers found that in order for the system to save time, the draft model must be significantly smaller (around 20 times) than the target model Leviathan et al. (2023) while still having a similar output to the latter, otherwise the acceptance rate would be too low. This strict requirement makes suitable model pairs hard to find.

3 Methodology

We now describe our proposed model cascading for code solution (see Figure 1 for an overview). We begin by describing our method to evaluate when to send queries to the next larger model.

3.1 Cascading Pipeline

Let there be a model family \mathcal{M}caligraphic_M, with n𝑛nitalic_n models: {m1,,mn}subscript𝑚1subscript𝑚𝑛\{m_{1},\ldots,m_{n}\}{ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, ordered in terms of their parameter sizes from smallest (and hence fastest) to largest. For a user prompt p𝑝pitalic_p, a natural language description of the function the user seeks to implement, we first ask the smallest model m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to generate k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT solutions, in addition to k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tests, each consisting of up to l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT test lines, i.e., the assert statements as shown in Figure 1. The test lines produced across all solutions are pooled together into a test suite T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (consisting of a total of up to k1×l1subscript𝑘1subscript𝑙1k_{1}\times l_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tests) and used to evaluate each solution.

We adopt the same strategy to select a solution out of multiple as the prior works in self-testing Chen et al. (2023a); Xiong et al. (2023). For each solution s𝒮={s1,,sk1}𝑠𝒮subscript𝑠1subscript𝑠subscript𝑘1s\in\mathcal{S}=\{s_{1},\ldots,s_{k_{1}}\}italic_s ∈ caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, we define TpassT1subscript𝑇𝑝𝑎𝑠𝑠subscript𝑇1T_{pass}\subseteq T_{1}italic_T start_POSTSUBSCRIPT italic_p italic_a italic_s italic_s end_POSTSUBSCRIPT ⊆ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the subset of test lines that the solution passes, and we compute two measures of solution quality: ns=|Tpass|subscript𝑛𝑠subscript𝑇𝑝𝑎𝑠𝑠n_{s}=|T_{pass}|italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = | italic_T start_POSTSUBSCRIPT italic_p italic_a italic_s italic_s end_POSTSUBSCRIPT |, the number of passing test lines, and ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a measure of test quality among the passing test lines. ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gives the highest number of solves that a passing test line has across all solutions; that is, if solves(t,𝒮)𝑠𝑜𝑙𝑣𝑒𝑠𝑡𝒮solves(t,\mathcal{S})italic_s italic_o italic_l italic_v italic_e italic_s ( italic_t , caligraphic_S ) is the number of solutions that pass test line t𝑡titalic_t, then nt=maxtTpass(solves(t,𝒮))subscript𝑛𝑡subscript𝑡subscript𝑇𝑝𝑎𝑠𝑠𝑠𝑜𝑙𝑣𝑒𝑠𝑡𝒮n_{t}=\max_{t\in T_{pass}}(solves(t,\mathcal{S}))italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_t ∈ italic_T start_POSTSUBSCRIPT italic_p italic_a italic_s italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s italic_o italic_l italic_v italic_e italic_s ( italic_t , caligraphic_S ) ). Intuitively, ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT captures the idea that a test line (assuming it is not trivial) is higher quality if it is passed by multiple solutions. The overall solution quality is the product ns×ntsubscript𝑛𝑠subscript𝑛𝑡n_{s}\times n_{t}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which we use to rank solutions and decide if we should adopt the current model’s answer or proceed to a larger model; the highest scoring solution-test pair is denoted (sα1subscript𝑠subscript𝛼1s_{\alpha_{1}}italic_s start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, tα1subscript𝑡subscript𝛼1t_{\alpha_{1}}italic_t start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and its score is nsα1×ntα1subscript𝑛subscript𝑠subscript𝛼1subscript𝑛subscript𝑡subscript𝛼1n_{s_{\alpha_{1}}}\times n_{t_{\alpha_{1}}}italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

To decide whether a solution is acceptable, we define a threshold parameter θ𝜃\thetaitalic_θ, which takes value from 0 to 1. Let there be Nt1subscript𝑁subscript𝑡1N_{t_{1}}italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT valid test lines in total. Note that Nt1k1×l1subscript𝑁subscript𝑡1subscript𝑘1subscript𝑙1N_{t_{1}}\leq k_{1}\times l_{1}italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, because a model might stop generating before it reaches the maximum number of test lines l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. If the highest solution score nsα1×ntα1subscript𝑛subscript𝑠subscript𝛼1subscript𝑛subscript𝑡subscript𝛼1n_{s_{\alpha_{1}}}\times n_{t_{\alpha_{1}}}italic_n start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT is greater than or equal to θ×k1×Nt1𝜃subscript𝑘1subscript𝑁subscript𝑡1\theta\times k_{1}\times N_{t_{1}}italic_θ × italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we adopt the corresponding solution-test pair (sα1subscript𝑠subscript𝛼1s_{\alpha_{1}}italic_s start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, tα1subscript𝑡subscript𝛼1t_{\alpha_{1}}italic_t start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and end the pipeline. Otherwise, we repeat the above procedure with m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: pass the same prompt to m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, let it generate k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT solutions and tests of l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT test lines each, and score the solutions. We continue until we find a solution whose quality exceeds our threshold θ𝜃\thetaitalic_θ. When we use the largest model mnsubscript𝑚𝑛m_{n}italic_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we skip the threshold check and directly adopt the highest-scoring solution.

3.2 Finding Optimal Parameters

We introduce the parameter selection procedure in this section. Our cascading system has mainly three sets of parameters: the number of solutions and tests generated for each model k1knsubscript𝑘1subscript𝑘𝑛k_{1}-k_{n}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT; the number of test lines in each test for each model l1lnsubscript𝑙1subscript𝑙𝑛l_{1}-l_{n}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and the global cascading threshold parameter θ𝜃\thetaitalic_θ.

We first set a range for each parameter. In our experiments, we let k[1,0,1,3,5,10]𝑘1013510k\in[-1,0,1,3,5,10]italic_k ∈ [ - 1 , 0 , 1 , 3 , 5 , 10 ] for each model. When k=1𝑘1k=-1italic_k = - 1, it means we are skip** this model; when k=0𝑘0k=0italic_k = 0, it means we only generate one solution with greedy search, and exit with the solution as output, without generating tests; when k=1𝑘1k=1italic_k = 1, we generate one solution and one test with greedy search, and when k>1𝑘1k>1italic_k > 1, we generate k𝑘kitalic_k solutions and k𝑘kitalic_k tests with sampling, using temperature 0.8. For example, if we have 3 models in a cascading system, then [k1=3subscript𝑘13k_{1}=3italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3, k2=5subscript𝑘25k_{2}=5italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5, k3=0subscript𝑘30k_{3}=0italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0] and [k1=10subscript𝑘110k_{1}=10italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10, k2=1subscript𝑘21k_{2}=-1italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1, k3=3subscript𝑘33k_{3}=3italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 3] are both valid sets of parameters, but [k1=1subscript𝑘11k_{1}=1italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, k2=0subscript𝑘20k_{2}=0italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, k3=3subscript𝑘33k_{3}=3italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 3] is not, because k2=0subscript𝑘20k_{2}=0italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 means we must exit on model 2, so we must have k3=1subscript𝑘31k_{3}=-1italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 1.

Similarly, we let l[0,2,4]𝑙024l\in[0,2,4]italic_l ∈ [ 0 , 2 , 4 ] for each model. l=0𝑙0l=0italic_l = 0 if and only if k=0𝑘0k=0italic_k = 0, which means we do not generate any code for test; when l>0𝑙0l>0italic_l > 0, we let the model generate a maximum of l𝑙litalic_l lines starting with assert. Note that the model might stop the generation before it reaches l𝑙litalic_l lines. In that case, we will have less than k×l𝑘𝑙k\times litalic_k × italic_l test lines for the current model.

Both k𝑘kitalic_k and l𝑙litalic_l are parameters for each model in a cascading system. However, θ𝜃\thetaitalic_θ is a global parameter. The cascading thresholds from model 1 to model 2, and from model 2 to model 3, use the same value of θ𝜃\thetaitalic_θ. In our experiments, we sampled θ𝜃\thetaitalic_θ from [0.0,0.1,0.2,1.0]0.00.10.21.0[0.0,0.1,0.2,\dots 1.0][ 0.0 , 0.1 , 0.2 , … 1.0 ].

Second, we sample a training set dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT out of the full dataset d𝑑ditalic_d to find the optimal parameters, and will test our selected combinations on the validation set dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. In our experiments, dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is randomly sampled, and it contains 30% of questions in d𝑑ditalic_d.

Third, on dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we run the cascading system in all the valid [θ,k1kn,l1ln𝜃subscript𝑘1subscript𝑘𝑛subscript𝑙1subscript𝑙𝑛\theta,k_{1}-k_{n},l_{1}-l_{n}italic_θ , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT] combinations as described in section 3.1, and calculate each of their cost and accuracy. Then, we make a cost-accuracy plot for all k𝑘kitalic_k and l𝑙litalic_l combinations for each θ𝜃\thetaitalic_θ, similar to plots in Figure 3. We select the optimal cascading combinations by finding all the Pareto points in the plot—the points for which there is no other point with a higher accuracy and lower cost—and save these combinations as our optimal parameters. For comparison, we also save the optimal singular combinations by finding the Pareto points among the singular points, where there is only one model used in the system (i.e. k1=1subscript𝑘11k_{1}=-1italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1, k2=3subscript𝑘23k_{2}=3italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3, k3=1subscript𝑘31k_{3}=-1italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 1). We compare the difference between the Pareto points and singular Pareto points across plots, and find the best θ𝜃\thetaitalic_θ.

Finally, on dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, we compare our cascading optimal combinations on the θ𝜃\thetaitalic_θ we selected with the singular optimal combinations.

3.3 Cascading Threshold Parameter (θ𝜃\thetaitalic_θ)

Refer to caption
Figure 2: Cascading result using the three models in Wizard-Python-V1.0 family on full HumanEval dataset, with k1=5subscript𝑘15k_{1}=5italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5, k2=3subscript𝑘23k_{2}=3italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3, k3=0subscript𝑘30k_{3}=0italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0, l1=4subscript𝑙14l_{1}=4italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4, l2=4subscript𝑙24l_{2}=4italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 4, l3=4subscript𝑙34l_{3}=4italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 4. We sample θ𝜃\thetaitalic_θ value on 0, 0.1, 0.2 …1.0. The highest accuracy is 76.7%, achieved at θ=0.8𝜃0.8\theta=0.8italic_θ = 0.8 with a cost of $8.03 per 1k queries.

In this section we discuss the impact of θ𝜃\thetaitalic_θ on the system’s cost and accuracy. As mentioned in Section 3.1, θ𝜃\thetaitalic_θ decides when to use a larger model. When θ𝜃\thetaitalic_θ is close to 0, the small model’s output is more likely to meet the standard and thus exit with the current output; on the other hand, when θ𝜃\thetaitalic_θ is close to 1, the system is more likely to use bigger models on each question.

We demonstrate the impact of θ𝜃\thetaitalic_θ with an example in Figure 2. We used the WizardCoder-Python-V1.0 family to run full HumanEval dataset, with set values for k1k3subscript𝑘1subscript𝑘3k_{1}-k_{3}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, l1l3subscript𝑙1subscript𝑙3l_{1}-l_{3}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. As the value of θ𝜃\thetaitalic_θ increases, the cost increases monotonically, because using larger models more often always results in more computation. However, the highest accuracy is achieved at θ=0.8𝜃0.8\theta=0.8italic_θ = 0.8 rather than 1.01.01.01.0. The reason is that when smaller models have larger k𝑘kitalic_k values, they have more solutions to pick from and more tests to validate their solutions. That makes the outputs from smaller models potentially more accurate than larger models. On the other hand, a too rigorous bar would miss correct solutions from smaller models. Among the k𝑘kitalic_k solutions, there might be only a few of them correct, and so is the case among the k×l𝑘𝑙k\times litalic_k × italic_l tests. If we set θ=1.0𝜃1.0\theta=1.0italic_θ = 1.0, it means all solutions must pass all tests, which is much unlikely for a large k𝑘kitalic_k.

3.4 Cost Calculation

All the models that we used are public, and thus we need to estimate the cost of running them based on the inference time they take on GPUs. To this end, we calculate the cost by multiplying the number of tokens generated by the per-token cost, similar to the pricing scheme used by commercial model providers like OpenAI.

We deploy a server with multiple NVIDIA Geforce RTX 3090 GPUs, each with 24GB of VRAM. We run each model with 2ksuperscript2𝑘2^{k}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT GPUs using the minimum k𝑘kitalic_k that provides enough VRAM to run inference on a batch of 10 prompts. For each dataset, we randomly sample the maximum number of questions that can fit into a batch without running out of VRAM. We estimate the time to complete the inference T𝑇Titalic_T, and the number of non-eos tokens from all answers Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Finally, we look up the hourly cost to rent an RTX 3090 GPU online, c𝑐citalic_c. In this case, we use the price from runpod.io where C=$0.44/hr𝐶currency-dollar0.44𝑟C=\$0.44/hritalic_C = $ 0.44 / italic_h italic_r RunPod (2023). Hence, the per-token cost on each model is c=TNtC𝑐𝑇subscript𝑁𝑡𝐶c=TN_{t}Citalic_c = italic_T italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_C.

Our cost statistics for all models on the HumanEval dataset are provided in Table 2. We did not include the length of the input prompt in the calculation, even though it also affects the computation cost. We account for this issue by collecting an averaged time stats for each model family on each dataset.

Model Family Size N.gpu Time (h) Batch Cost ($)
WizardCoder- Python-V1.0 7B 2 2.17 20 1.91
13B 4 3.78 24 6.65
34B 8 5.75 48 20.24
WizardCoder- V1.0 1B 1 0.31 180 0.13
3B 1 0.58 80 0.26
15B 2 1.97 32 1.74
Codegen- mono 350M 1 2.36 48 1.04
2B 1 5.97 12 2.63
6B 2 6.44 12 5.67
16B 4 9.00 12 15.84
Table 2: Time and Cost stats per 1M tokens generated of all selected models on HumanEval dataset. We generated all completions with sampling. Time is averaged across 10 runs.

4 Experiment Setup

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: All [k,t]𝑘𝑡[k,t][ italic_k , italic_t ] combinations cost-accuracy results for Codegen-mono, WizardCoder-V1.0, WizardCoder-Python-V1.0 families on HumanEval, MBPP and APPS-Intro datasets, each plot with a selected optimal θ𝜃\thetaitalic_θ. The two WizardCoder families have 3 models, and the Codegen-mono family has 4 models. Light blue dots represent all parameter combinations; green dots represent the cascading optimal combinations; purple crosses represent single-model optimal combinations; yellow dots are speculative decoding results. All the marked combinations are selected from the training set. The procedure is as introduced in Section 3.2.

4.1 Models

There are multiple open-source code generation language models, including Codegen Nijkamp et al. (2022), InCoder Fried et al. (2022), StarCoder Li et al. (2023), Code LLaMA Rozière et al. (2023), CodeT5+ Zheng et al. (2023), WizardCoder Luo et al. (2023), DeepSeek-Coder Guo et al. (2024), etc.

For our experiments, we chose three model families to work with: the Codegen-mono family (4 models, each having 350M, 2B, 6B, 16B parameters), the WizardCoder-V1.0 family (3 models, each having 1B, 3B, 15B parameters), and the WizardCoder-Python-V1.0 family (3 models, each having 7B, 13B, 34B parameters). The Codegen-mono family models were trained on the THEPILE, BIGQUERY and BIGPYTHON datasets. The latter two WizardCoder families were trained on from the Starcoder family and LLAMA-2 family Touvron et al. (2023), on an instructive dataset Code Alpaca Chaudhary (2023). Their models share the same structure as their respective foundation models. The three model families have competitive performance in the corresponding sizes, and share features that are suitable for our experiments:

1. They open-sourced families of multiple sizes spanning a relatively wide range. The larger difference in size, the more inference time we can save when adopting a smaller checkpoint’s output.

2. Among each family, all checkpoints are trained from the same datasets, so that the larger checkpoints always have a higher accuracy than the smaller checkpoints. If a larger checkpoint has lower accuracy than a smaller one, we should eliminate it from the cascading system.

4.2 Datasets

For our experiments, we use HumanEval Chen et al. (2021), MBPP-sanitized Austin et al. (2021), and APPS-test introductory level subset Hendrycks et al. (2021). They have 164, 427, 1000 questions respectively. We show the greedy accuracy of all models in each model family on each dataset in Table 1.

In APPS-test, we only provide cascading results on the introductory level questions using WizardCoder-Python-V1.0 model family. The reason is that the accuracy of other model families on intro-level, as well as all model families on interview and competition levels, are below 10% on average. When the test set is too challenging for the model family, the cascading system will waste time rather than save time on smaller models.

4.3 Environment

Our server is consisted of NVIDIA GeForce 3090 GPUs. We use Cuda Toolkit 11.8, and a PyTorch-based python implementation from huggingface transformers Wolf et al. (2020) in version 4.31.0 with accelerate library Gugger et al. (2022), which is in parallel to the training environment of WizardCoder. We did not implement vllm Kwon et al. (2023), triton inference server, deepspeed or quantization, but our method is compatible with these acceleration strategies. We implemented code completions in a maximum of 1024 generated tokens with early stops.

4.4 Speculative Decoding

Previous works have shown that the accelerating effect of batching is stronger than SD Su et al. (2023), so we implemented speculative decoding (SD) in a multi-batch setting. The per-token cost of each model pair is calculated in the same way as Section 3.4. Same as Su et al. (2023), we deployed argmax sampling, so the output of SD has the same accuracy as that of the target model in greedy search.

We implemented SD on the two more recent WizardCoder families. We did not compare our method with SD on HumanEval and MBPP. Since the models are instruction-tuned in alpaca format, their outputs always begin with copying the function’s definition and instructions, having a 100% acceptance rate for over a half of the tokens generated, thus giving SD an unfair advantage.

5 Results

In Figure 3, we plot the throughput and accuracy of all the 3 model families on HumanEval and MBPP datasets, and the 2 model families on APPS-Intro. For each run, in the validation set, we randomly sampled 30% of all questions, selected the Pareto points as optimal combinations, and estimated their cost and accuracy on the test set with the rest of the 70% questions.

In general, for all model families on all datasets in the experiment, our scheme provides near-optimal combinations at any budget range. We find that the optimal parameter combinations from our cascading scheme are more accurate and less costly than single models.

Its effect on WizardCoder-Python-V1.0 family (right-most column) was the strongest. The selected cascading combinations achieve more optimal results on all datasets in all budget ranges. We speculate that the reason is the highly capable models generate high-quality tests, even when the questions are difficult. For the other two model families, while its effect over a single model is not obvious, it still correctly finds single model solutions when they are indeed optimal.

Our cascading scheme achieves a more obvious acceleration compared to using a single model on higher budgets. In the best case, the cascading scheme can save 49% of cost while achieving the same level of accuracy as using a single model. Therefore, our work is most valuable for accuracy-sensitive tasks.

Speculative decoding gains a similar cost-accuracy optimization as our scheme on the APPS-Intro dataset. Its largest batch size, cost and hit rate are provided in Table 3. It less effective than in the other speculative decoding works Chen et al. (2023c); Leviathan et al. (2023), because its effect decreases when we fully utilize gpus by increasing the batch size over 1. Moreover, it only provides one solution, neglecting the chance of significantly increasing accuracy by 2-5% with a small increase of cost.

M.Family Size N.gpu Hit (%) B. C. ($) C.red (%)
WC-P-V1.0 7-34B 8 63.9 8 37.1 3.1
WC-V1.0 1-15B 2 72.5 8 5.57 15.2
Table 3: Cost and acceptance rate stats for speculative decoding pairs from each WizardCoder family on APPS-intro dataset in maximum memory consumption.

6 Conclusion

We proposed a full heuristic pipeline to find the optimal cascading strategy, and proved that it gains lower cost and higher accuracy for various model families and datasets. Furthermore, it generates multiple optimal solutions for all ranges of budget, so users can pick the best solution based on their sensitivity of cost and accuracy, while speculative decoding only offers one choice. Our cascading pipeline is more effective when the model family has higher quality in code generation.

Our experiment focused on the python code generation tasks, but the method is compatible to all languages. In our experiments, we only attempted a global value for θ𝜃\thetaitalic_θ. To further examine the potential of model cascading for code completions, one can explore the results using a different θ𝜃\thetaitalic_θ value for each model and each [k,t𝑘𝑡k,titalic_k , italic_t] combination.

References

  • Chen et al. [2021] Mark Chen et al. Evaluating large language models trained on code. 2021.
  • Li et al. [2022] Yujia Li et al. Competition-level code generation with alphacode. Science, 378:1092–1097, 2022.
  • Nijkamp et al. [2022] Erik Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. 2022.
  • Git [2021] Github copilot · your ai pair programmer, 2021. URL https://copilot.github.com/.
  • Zhu et al. [2023] Xunyu Zhu et al. A survey on model compression for large language models. 2023.
  • Liu et al. [2023] Zechun Liu et al. Llm-qat: Data-free quantization aware training for large language models. 2023.
  • Ma et al. [2023] Xinyin Ma et al. Llm-pruner: On the structural pruning of large language models. 2023.
  • Hsieh et al. [2023] Cheng-Yu Hsieh et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 2023.
  • Dao et al. [2022] Tri Dao et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  • Nijkamp et al. [2023] Erik Nijkamp et al. Codegen2: Lessons for training llms on programming and natural languages. 2023.
  • Li et al. [2023] Raymond Li et al. Starcoder: may the source be with you! 2023.
  • Guo et al. [2024] Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. 2024.
  • Chen et al. [2023a] Bei Chen et al. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023a.
  • Rozière et al. [2023] Baptiste Rozière et al. Code llama: Open foundation models for code, 2023.
  • Luo et al. [2023] Ziyang Luo et al. Wizardcoder: Empowering code large language models with evol-instruct. 2023.
  • Chen et al. [2023b] Lingjiao Chen et al. Frugalgpt: How to use large language models while reducing cost and improving performance. 2023b.
  • Sanh et al. [2020] Victor Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2020.
  • Leviathan et al. [2023] Yaniv Leviathan et al. Fast inference from transformers via speculative decoding. 2023.
  • Xiong et al. [2023] Weimin Xiong et al. The program testing ability of large language models for code. 2023.
  • RunPod [2023] RunPod. Gpu cloud service, 2023. URL http://runpod.io.
  • Fried et al. [2022] Daniel Fried et al. Incoder: A generative model for code infilling and synthesis. 2022.
  • Zheng et al. [2023] Qinkai Zheng et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. 2023.
  • Touvron et al. [2023] Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. 2023.
  • Chaudhary [2023] Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. Github repository. https://github.com/sahil280114/codealpaca, 2023.
  • Austin et al. [2021] Jacob Austin et al. Program synthesis with large language models. 2021.
  • Hendrycks et al. [2021] Dan Hendrycks et al. Measuring coding challenge competence with apps. 2021.
  • Wolf et al. [2020] Thomas Wolf et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
  • Gugger et al. [2022] Sylvain Gugger et al. Accelerate: Training and inference at scale made simple, efficient and adaptable. Github repository. https://github.com/huggingface/accelerate, 2022.
  • Kwon et al. [2023] Woosuk Kwon et al. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, page 611–626, 2023.
  • Su et al. [2023] Qidong Su et al. The synergy of speculative decoding and batching in serving large language models. 2023.
  • Chen et al. [2023c] Charlie Chen et al. Accelerating large language model decoding with speculative sampling. 2023c.