Activation Steering for Robust Type Prediction in CodeLLMs

Francesca Lucchetti & Arjun Guha
Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115
{lucchetti.f,a.guha}@northeastern.edu
Abstract

Contemporary LLMs pretrained on code are capable of succeeding at a wide variety of programming tasks. However, their performance is very sensitive to syntactic features, such as the names of variables and types, the structure of code, and presence of type hints. We contribute an inference-time technique to make CodeLLMs more robust to syntactic distractors that are semantically irrelevant. Our methodology relies on activation steering, which involves editing internal model activations to steer the model towards the correct prediction. We contribute a novel way to construct steering vectors by taking inspiration from mutation testing, which constructs minimal semantics-breaking code edits. In contrast, we construct steering vectors from semantics-preserving code edits. We apply our approach to the task of type prediction for the gradually typed languages Python and TypeScript. This approach corrects up to 90% of type mispredictions. Finally, we show that steering vectors calculated from Python activations reliably correct type mispredictions in TypeScript, and vice versa. This result suggests that LLMs may be learning to transfer knowledge of types across programming languages.

1 Introduction

Large Language Models trained on code (CodeLLMs) are one of the most successful applications of LLMs to date. They are the foundation for both developer tools (e.g., Weiss & Yahav (2013); git (2021)) and for reasoning agents (e.g., Zelikman et al. (2024); Yang et al. (2024); Hu et al. (2024); Li et al. (2022)). However, despite their utility, CodeLLMs can still be unreliable. Recent work shows that even the most capable models are not robust to small semantics-preserving changes such as refactorings and variable renamings (Hooda et al., 2024; Tambon et al., 2024). Thus CodeLLMs often cannot be deployed without careful supervision, such as a human-in-the-loop or robust code sandboxing.

In this paper, we present an approach to make CodeLLMs more robust using activation steering (Li et al., 2024; Rimsky et al., 2023; Turner et al., 2023; Subramani et al., 2022). Activation steering is an inference-time model editing technique that modifies the intermediate computations (activations) of a model using steering vectors to steer model behavior toward desired outcomes. We present a novel technique for constructing steering vectors for the code domain by taking inspiration from mutation testing (DeMillo et al., 1978). Mutation testing automatically constructs minimal semantics-breaking code edits. In contrast, we construct steering vectors from minimal semantics-preserving code edits that lead to CodeLLM mispredictions. The nature of code allows us to construct these code edits in a sound and scaleable way.

This paper focuses on robust type prediction for gradually typed programming languages, specifically Python and TypeScript. A gradually typed programming language allows programs to freely mix typed and untyped code, giving programmers more flexibility than traditional static ty** affords (Siek & Taha, 2006; Tobin-Hochstadt & Felleisen, 2006). Given a partially typed program p𝑝pitalic_p written in a gradually typed language, the type prediction task is defined as follows (Migeed & Palsberg, 2020). Choose an untyped variable binding 𝑣𝑎𝑟p𝑣𝑎𝑟𝑝\mathit{var}\in pitalic_var ∈ italic_p, predict a type annotation 𝑣𝑎𝑟:T:𝑣𝑎𝑟𝑇\mathit{var}:Titalic_var : italic_T, and insert the annotation back into the program to get a new program psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The task is successful if psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT continues to pass the type-checker.

We present an approach for steering type prediction from pairs of prompts (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) that represent two type prediction tasks for the same variable binding 𝑣𝑎𝑟:T:𝑣𝑎𝑟𝑇\mathit{var}:Titalic_var : italic_T. However, the CodeLLM successfully predicts T𝑇Titalic_T for x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT but mispredicts it for x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In each pair, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is from natural data, whereas x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is constructed from x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using semantics-preserving program edits (§ 3.1). We carefully construct steering vectors that narrowly target particular types of program edits. Moreover, we also combine edits to study the robustness of steering for different types of mispredictions. In our experiments, we find that steering activations at multiple adjacent layers can significantly improve accuracy (§ 4.1).

Finally, we are surprised to find that type prediction steering transfers across programming languages. In particular, the steering vectors that we construct from TypeScript prompts are very effective at steering Python type predictions, and vice versa. We conduct a careful analysis and conclude that is it likely that multi-lingual CodeLLMs are learning a representation of types that is shared across the programming languages we study (§ 4.5).

2 Background and Related Work

return s[::-1] == s

   return s[::-1] == s<fim_middle>

Figure 1: An example of a code-infilling prompt using FIM special tokens. The correct prediction expected after the FIM middle token is str.

CodeLLMs and fill-in-the-middle

CodeLLMs are capable of performing a variety of downstream tasks such as generating code from natural language, explaining code, generating tests, and more (Nam et al., 2024; Schäfer et al., 2024). Contemporary CodeLLMs are decoder-only transformers trained on vast amounts of source code (Li et al., 2023; Rozière et al., 2024; Guo et al., 2024). The aforementioned models are also trained to fill-in-the-middle (FIM) (Bavarian et al., 2022; Fried et al., 2023). FIM training does the following (1) It splits 50%absentpercent50\approx 50\%≈ 50 % of training items into three chunks—prefix, middle, and suffix—of random lengths; (2) It adds special tokens to the start of each chunk to demarcate chunk boundaries; and (3) It reorders the middle chunk to appear last. The language modelling training objective remains unchanged. Thus FIM inference allows us to generate the middle chunk, conditioned on the prefix and the suffix (figure 1). The resulting model can still be used for conventional left-to-right generation.

Neural type prediction

In this paper we study activation steering for type prediction, formulated as a fill-in-the-middle task. Prior work uses this formulation to evaluate base models (Yee & Guha, 2023; Fried et al., 2023), but does not study models’ internal mechanisms or try to improve task performance in a targeted way. There is prior work that trains smaller, specialized models for type prediction (Hellendoorn et al., 2018; Jesse et al., 2022; 2021; Pandi et al., 2021; Wei et al., 2020), but contemporary CodeLLMs outperform these specialized models (Yee & Guha, 2023; Fried et al., 2023).

Classical type prediction and type inference

Type prediction is also known as type migration, and is distinct from type inference as found in languages such as OCaml and Haskell. In those languages, every variable is typed, even if the types are implicit (Harper & Mitchell, 1993). In contrast, gradual type prediction can change program semantics (Phipps-Costin et al., 2021). There is work on gradual type prediction that uses human-written constraints and constraint solvers (Phipps-Costin et al., 2021; Rastogi et al., 2012; Siek & Vachharajani, 2008; Campora et al., 2018; Henglein & Rehof, 1995; Cartwright & Fagan, 1991). But, these papers present algorithms for variations of the lambda calculus or simple functional languages such as Scheme, and have not been scaled to more complex, modern programming languages.

Mechanistic interpretability

Interpretability seeks to understand the causal mechanisms behind model inference. In particular, mechanistic interpretability studies how model behaviors arise from specific components within the transformer architecture. One of the driving motivations of the field is that by understanding how a model works, we can understand how to align models to desired behavior.

Prior work in mechanistic interpretability includes localizing and editing factual associations within the transformer (Meng et al., 2022), as well as probing hidden representations for knowledge of high-level concepts (Li et al., 2024; Dong et al., 2023). These works are forms of implicit evaluation of model ability, as opposed to explicit evaluation using benchmarks (Dong et al., 2023). Mechanistic interpretability has contributed a wealth of methods for implicit evaluation, for example activation patching (Vig et al., 2020; Variengien & Winsor, 2023), which applies patches to model activations in order to produce a change in behavior. Research leveraging activation patching has suggested the existence of task vectors (Hendel et al., 2023; Ilharco et al., 2022). Task vectors are representations within the transformer which encode a high-level notion of the task described by the prompt. The existence of task vectors is further supported by work in activation steering. Steering has been employed to diminish model deceitfulness and sycophancy (Rimsky et al., 2023; Li et al., 2024), which suggests that steering works by modifying the activations corresponding to task vectors. We thus draw on the idea of task vectors to perform our steering experiments. Importantly, steering relies on the linear representation hypothesis, which argues that concepts are represented as directions in the embedding space of a model (Park et al., 2023). We posit that successful steering performs transformations over model activations that shift the task representation towards a target direction.

Despite much work on the mechanistic interpretability of LLMs, similar research on CodeLLMs has been limited. In this paper we focus on types, which are a fundamental feature of programming languages. One obstacle in this domain is that within a program, a type annotation is semantically constrained by both the tokens that precede it and those that succeed it. For example, the type of a function argument is constrained by how the argument is subsequently used in the function body. To overcome this, we use CodeLLMs that are trained on FIM. This allows us to construct prompts for type prediction.

Mutation testing and program transformations

We take inspiration from mutation testing (DeMillo et al., 1978), but our technique is different. The goal of mutation testing is to test a program’s test suite. To do so, a mutator injects small bugs that alter the semantics of a program, such as changing a 00 to a 1111 or turning x>y𝑥𝑦x>yitalic_x > italic_y into x<y𝑥𝑦x<yitalic_x < italic_y. The hypothesis is that a good test suite should be able to catch these artificial bugs, and there is a substantial evidence that the ability to catch both artificial and real-world bugs is strongly correlated (Just et al., 2014). In this paper, we make program edits that would not affect test cases.111Reflection in TypeScript and Python allows a program to detect any source code change. But, most test suites do not use these features. Instead, we make syntactic edits that lead to type mispredictions for a given CodeLLM.

3 Methodology

3.1 Constructing Steering Datasets

Model choice

We build steering datasets for 1B and 7B parameter StarCoderBase models. These models are trained to fill-in-the-middle (§ 2), which is necessary for the type prediction task.

Source datasets

A distinguishing feature of our work is that we construct steering pairs from natural data (source code from GitHub) instead of using templates to construct simple examples. This provides more evidence that activation steering is robust. For Python, we use ManyTypes4Py (Mir et al., 2021), a dataset for neural type prediction. It features code from 5,382 Python projects that use Python type annotations and successfully type-check. For TypeScript, we filter The Stack (Kocetkov et al., 2023) to find 1.1M TypeScript files that type-check, of which we sample up to 2,000 for steering.

def __init__(self, x, y):
self.x = x
self.y = y
def delta_x(p: Point , x: float):
p.x = p.x + x
(a) The original program.
def __init__(self, x, y):
self.x = x
self.y = y
def delta_x(p: Type0, x: float):
p.x = p.x + x
(b) Type renaming.
def __init__(self, x, y):
self.x = x
self.y = y
def delta_x(p , x: float):
p.x = p.x + x
(d) Type annotation removal.

        

def __init__(self, x, y):
self.x = x
self.y = y
def delta_x(p: Point, tmp: float):
p.x = p.x + tmp
(c) Variable renaming.
Figure 2: Examples of three semantics-preserving edits. The type prediction site is float. We carefully ensure that each edit is internally consistent. E.g., in (2(c)), when we rename the binding x to tmp, we rename references to the binding. To construct a steering pair, we repeatedly apply edits until the CodeLLM mispredicts.
class KafkaAvroBackend(RepositoryBackend):
def __init__(
self, config__tmp0: dict, producer=AvroProducer, loader=AvroMessageLoader,
value_serializer: Callable = to_message_from_dto,
get_producer_config: Callable = get_producer_config,
get_loader_config: Callable = get_loader_config
) -> None:
producer_config = get_producer_config(config__tmp0)
loader_config = get_loader_config(config__tmp0)
...
Figure 3: A fragment of a Python steering pair. The original code is 70 lines of text. The dict is the expected prediction. But, renaming config to __tmp0 makes the model mispredict Repository, which is a hallucinated type.

Semantics-preserving code edits

For a given model M𝑀Mitalic_M, our steering dataset is a set of triples (x+,x,t)superscript𝑥superscript𝑥𝑡(x^{+},x^{-},t)( italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_t ) where t𝑡titalic_t is a type and x+,xsuperscript𝑥superscript𝑥x^{+},x^{-}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are prompts that represent fill-in-the-middle type prediction tasks with t𝑡titalic_t as the expected type for both prompts. However, the maximum likelihood generation for x+superscript𝑥x^{+}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is M(x+)=t𝑀superscript𝑥𝑡M(x^{+})=titalic_M ( italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) = italic_t and for xsuperscript𝑥x^{-}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT it is M(x)=k𝑀superscript𝑥𝑘M(x^{-})=kitalic_M ( italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) = italic_k, where kt𝑘𝑡k\neq titalic_k ≠ italic_t.

To construct the positive prompt and determine its target type t𝑡titalic_t, we select a Python or TypeScript file from one of our source datasets. Within that file, we select the type on a type-annotated variable binding 𝑣𝑎𝑟:t:𝑣𝑎𝑟𝑡\mathit{var}:titalic_var : italic_t as the target type. With these selected, we build a fill-in-the-middle prompt. Let the text of the file be pts𝑝𝑡𝑠p\cdot t\cdot sitalic_p ⋅ italic_t ⋅ italic_s, where p𝑝pitalic_p is the text that precedes the selected type annotation t𝑡titalic_t and s𝑠sitalic_s is the text that succeeds t𝑡titalic_t. The positive prompt thus becomes 𝚏𝚒𝚖_𝚙𝚛𝚎𝚏𝚒𝚡p𝚏𝚒𝚖_𝚜𝚞𝚏𝚏𝚒𝚡s𝚏𝚒𝚖_𝚖𝚒𝚍𝚍𝚕𝚎delimited-⟨⟩𝚏𝚒𝚖_𝚙𝚛𝚎𝚏𝚒𝚡𝑝delimited-⟨⟩𝚏𝚒𝚖_𝚜𝚞𝚏𝚏𝚒𝚡𝑠delimited-⟨⟩𝚏𝚒𝚖_𝚖𝚒𝚍𝚍𝚕𝚎\langle\mathtt{fim\_prefix}\rangle p\langle\mathtt{fim\_suffix}\rangle s% \langle\mathtt{fim\_middle}\rangle⟨ typewriter_fim _ typewriter_prefix ⟩ italic_p ⟨ typewriter_fim _ typewriter_suffix ⟩ italic_s ⟨ typewriter_fim _ typewriter_middle ⟩ (figure 1).

To construct the negative prompt, we incrementally apply semantics-preserving code edits to the positive prompt until M𝑀Mitalic_M mispredicts t𝑡titalic_t. We implement the following edits. 1) Rename variable:We select a variable binding (from a function definition) and rename it to an arbitrary name that does not conflict with other variables in the program. We also rename all bound occurrences of the same variable so that the program’s semantics remain unchanged. 2) Remove type annotation:We select a type annotation (excluding the target t𝑡titalic_t) and delete it. In a gradually typed language, removing or relaxing an annotation does not alter program semantics. 3) Rename user-defined type:We select an arbitrary type definition (e.g., a class name or a type alias) and rename it to an arbitrary name that does not conflict with other names in the program. 4) Rename builtin type:We also support renaming builtin types by introducing type aliases. Figure 2 illustrates several separate edits to a program.

By incrementally applying edits, the model eventually mispredicts t𝑡titalic_t (or else we fail to find an xsuperscript𝑥x^{-}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and thus discard x+superscript𝑥x^{+}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT). Figure 3 illustrates a real example from our dataset where StarCoderBase-1B mispredicts after an edit. Note that a single edit often alters the positive prompt x+superscript𝑥x^{+}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT at several points. Moreover, when we apply several edits, the two prompts x+,xsuperscript𝑥superscript𝑥x^{+},x^{-}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT may become significantly different from each other. We find that the mean edit-distance for our Python and TypeScript steering pairs is 74747474 and 327327327327 respectively. This is in contrast to steering pairs constructed from sentence templates, that only differ by a small, fixed number of words at a far smaller number of places.

We hypothesize that mispredictions happen because the model fails to identify the inference task as a type prediction task. This likely occurs because our edits shift the prompt outside the distribution of a model’s training data. We thus call our edits distractors, since they distract from the correct task. We leverage activation steering to make the model more robust against such distractors and other semantically irrelevant features of code. For example, distractors often remove important in-context clues like function names which models rely on for prediction. Relying on textual meaning rather than program structure is a limitation of CodeLLMs that we improve with activation steering.

Class balance

The distribution of type annotations in natural data is heavily skewed: builtin types occur far more often than the long tail of user-defined types. To prevent bias towards majority-class types in our steering vectors, we balance every dataset 𝒟𝒟\mathcal{D}caligraphic_D to ensure that no target type t𝑡titalic_t occurs more than 25252525 times. In our evaluation, we also measure the effectiveness of steering by type (§ 4.2).

Python TypeScript
Edit Type Steering Held-out Steering Held-out
Rename variables 1,92419241,9241 , 924 773773773773 1,40014001,4001 , 400 539539539539
Rename types 1,41314131,4131 , 413 484484484484 1,09410941,0941 , 094 457457457457
Remove type annotations 1,95219521,9521 , 952 892892892892 698698698698 297297297297
Rename variables & types 2,00020002,0002 , 000 1,05110511,0511 , 051 2,00020002,0002 , 000 811811811811
Rename variables & remove type annotations 1,84218421,8421 , 842 1,77317731,7731 , 773 1,66816681,6681 , 668 599599599599
Rename types & remove type annotations 2,00020002,0002 , 000 781781781781 1,61916191,6191 , 619 445445445445
All edits 2,00020002,0002 , 000 1,57215721,5721 , 572 2,00020002,0002 , 000 762762762762
Table 1: Overview of StarCoderBase-1B dataset sizes and applied code edits. Note that steering subsets are used for constructing steering vectors, while held out subsets are used to evaluate steering performance on out of distribution data. Dataset sizes vary because we only apply edits to programs where the model correctly predicts types. We then filter to remove overlap in source programs between held-out and steering sets. This accounts for the variation in sizes.

Ablation splits and evaluation sets

We construct several datasets 𝒟𝒟\mathcal{D}caligraphic_D, where each dataset applies a subset (or all) of the program edits described above to programs in a language. We further split each dataset into a steering subset and a held-out evaluation subset. We ensure that the two subsets have prompts based on different source programs, to test how steering generalizes to unseen programs. Table 1 summarizes the dataset sizes for StarCoderBase-1B. We report the sizes for StarCoderBase-7B in appendix A.

3.2 Constructing Steering Vectors

Given a dataset of steering pairs (xi+,xi,t)𝒟superscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡𝒟(x_{i}^{+},x_{i}^{-},t)\in\mathcal{D}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_t ) ∈ caligraphic_D and model M𝑀Mitalic_M, we apply a forward pass to every M(xi+),M(xi)𝑀superscriptsubscript𝑥𝑖𝑀superscriptsubscript𝑥𝑖M(x_{i}^{+}),M(x_{i}^{-})italic_M ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) , italic_M ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) to collect model activations on the last token. Specifically, we extract activations from the residual stream of the model, which is the output of a transformer layer isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that in turn becomes the input to subsequent layers i+1nsubscript𝑖1𝑛\ell_{i+1...n}roman_ℓ start_POSTSUBSCRIPT italic_i + 1 … italic_n end_POSTSUBSCRIPT. We write A(x)subscript𝐴𝑥A_{\ell}(x)italic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_x ) to denote this activation tensor at some layer \ellroman_ℓ for prompt x𝑥xitalic_x. Thus we compute steering vectors 𝐭subscript𝐭\mathbf{t}_{\ell}bold_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT—one for each layer—as the mean difference between positive and negative activations at that layer:

𝐭=1|𝒟|(xi+,xi,t)𝒟(A(xi+)A(xi))subscript𝐭1𝒟subscriptsuperscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡𝒟subscript𝐴superscriptsubscript𝑥𝑖subscript𝐴superscriptsubscript𝑥𝑖\mathbf{t}_{\ell}=\frac{1}{|\mathcal{D}|}\sum_{(x_{i}^{+},x_{i}^{-},t)\in% \mathcal{D}}\left(A_{\ell}(x_{i}^{+})-A_{\ell}(x_{i}^{-})\right)bold_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_t ) ∈ caligraphic_D end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) - italic_A start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) (1)

The intuition behind equation 1 is that the distance between positive and negative pairs in activation space encodes the transformation for steering towards the correct type.

Following the FIM format, the last token in all prompts is fim_middledelimited-⟨⟩fim_middle\langle\texttt{fim\_middle}\rangle⟨ fim_middle ⟩. The prompt provides the model with context indicating that a type annotation should follow—namely, the code preceding the type annotation site, followed by a colon and the remaining code (e.g. figure 1). We choose to intervene on the fim_middledelimited-⟨⟩fim_middle\langle\texttt{fim\_middle}\rangle⟨ fim_middle ⟩ token because we hypothesize that models capable of FIM type prediction build latent representations of types in the residual stream of the last token before prediction. This is in contrast to Rimsky et al. (2023) that calculates steering vectors using the residual stream of the expected token. Conversely, in our experiments we steer a model by adding the steering vector 𝐭subscript𝐭\mathbf{t}_{\ell}bold_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT to the last token’s residual stream at layer \ellroman_ℓ. In the next section, we perform steering at the level of both individual layers and sets of adjacent layers simultaneously.

4 Evaluation

4.1 Layer Ablations

Refer to caption
Figure 4: Steering accuracy for StarCoderBase-1B on the held-out set of the TypeScript Datasets. Steering vectors are patched onto different subsets of adjacent layers.

We evaluate the accuracy of steering on each edit dataset (table 1). To find the optimal layer for applying steering vectors, we conduct an ablation on TypeScript data with StarCoderBase-1B. We consider patching at single layers as well as intervals of three and five adjacent layers. Figure 4 shows that for code edits where steering is most effective (e.g., renaming types), patching at single versus multiple layers makes little difference. However, for the less effective edits (e.g., removing type annotations), patching multiple layers significantly increases accuracy. We hypothesize that this performance gap exists because task vector refinement occurs over multiple layers, thus needs to be steered over multiple layers. This follows from previous ideas that transformers build predictions incrementally throughout layers (Elhage et al., 2021; Geva et al., 2022). Following this finding, we chose to patch layers 10–14 in StarCoderBase-1B and 19–23 in StarCoderBase-7B.

Python TypeScript
Edit Type Steering Random Steering Random
Rename variables 0.24 0.17 0.29 0.26
Rename types 0.90 0.10 0.75 0.11
Remove type annotations 0.48 0.20 0.41 0.31
Rename types & variables 0.56 0.13 0.49 0.09
Rename types & remove annotations 0.60 0.11 0.63 0.11
Rename variables & remove annotations 0.39 0.18 0.30 0.22
All edits 0.36 0.16 0.51 0.11
(a)
Python TypeScript
Edit Type Steering Random Steering Random
Rename variables 0.20 0.11 0.26 0.09
Rename types 0.84 0.07 0.69 0.07
Remove type annotations 0.49 0.11 0.39 0.18
Rename types & variables 0.56 0.10 0.36 0.05
Rename types & remove annotations 0.59 0.09 0.67 0.05
Rename variables & remove annotations 0.41 0.10 0.27 0.08
All edits 0.50 0.10 0.40 0.04
(b)
Table 2: Steering StarCoderBase-1B and 7B on several datasets of semantics-preserving code edits. For each dataset, we report the accuracy on a held out set of negative prompts, i.e., the model mispredicts all of these types without steering and steering corrects a significant number of mispredictions. The Random column reports accuracy after steering with a randomly initialized steering vector.

4.2 Evaluating Different Edits and Effectiveness By Type

Refer to caption
Figure 5: Mean accuracy of steering Python by expected type label frequency. Error bars indicate the interquartile range. Note that we constrain the maximum size of any class to 25.

Having picked the layers to steer, table 2 summarizes the results of steering on each of our models. For StarCoderBase-1B, steering is most effective at fixing distractors from renaming types, achieving an accuracy of 90% in Python and 75% in TypeScript. We report accuracy over the held-out evaluation sets for each edit type. However, we also evaluated steering accuracy on the steering set and found no significant difference in performance between steering splits or held-out splits (appendix B). This is a first indication that our steering vectors generalize well.

To ensure that steering performance is not disproportionately affected by accuracy on a single type, we conduct an analysis of the results of steering per type label. In figure 5, we report steering accuracy on Python factored by target type frequency. Appendix C has a similar result for TypeScript. We find that the performance of steering vectors is well-distributed across types, and not skewed towards the majority class of labels.

4.3 Backup Circuits

A threat to validity for any intervention that involves activation patching is that the patch may not be directly improving performance, but just triggering a backup circuit (McGrath et al., 2023). For example, it could be that patching is just introducing noise into the embedding space that causes other mechanisms to trigger, thus producing the expected outcome. This effect complicates the interpretability of patches as well as steering vectors.

As part of our evaluation, we steer the model with a random steering vector (the Random columns in table 2). We find that steering in this way has non-zero accuracy, though it is generally significantly lower than our computed steering vectors. We posit that the non-zero accuracy of the random vector occurs due to backup circuits. However, the more significant gains from our computed steering vectors indicate that our steering method performs true transformations towards the correct type.

4.4 Steering Versus Finetuning

Fine-tuning is the traditional approach to improving model performance on a downstream task, so we now compare steering to fine-tuning. We fine-tune StarCoderBase-1B on the Python Renamed Types dataset. We fine-tune for five epochs using the AdamW optimizer and a learning rate of 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with weight decay of 0.10.10.10.1. At the end of each epoch, we evaluate on the held-out set.

In figure 6(a), we see that fine-tuning is approximately as effective as steering on the type prediction task. However, figure 6(b) shows that this comes with a caveat: fine-tuning on type-prediction significantly degrades the model’s code completion ability, as measured with HumanEval Chen et al. (2021). In contrast, since the steering vectors for type prediction are a small, localized patch, one can easily toggle them on or off based on the task at hand. On the other hand, fine-tuning for type prediction produces a new specialized model. We have shown that this fine-tuned model is weak at code completion, but in general, other abilities may have been impacted too.

Refer to caption
(a) Type prediction accuracy.
Refer to caption
(b) HumanEval pass@1.
Figure 6: We fine-tune StarCoderBase-1B on the Python Renamed Types dataset. We find that within 1 epoch, the fine-tuned model achieves the same accuracy on the held-out set as the steered model (figure 6(a)). However, this fine-tuning makes the model much worse at code completion (figure 6(b)).

4.5 Language Transfer in Steering Vectors

Edit Type TSPyTSPy\mathrm{TS}\rightarrow\mathrm{Py}roman_TS → roman_Py PyPyPyPy\mathrm{Py}\rightarrow\mathrm{Py}roman_Py → roman_Py PyTSPyTS\mathrm{Py}\rightarrow\mathrm{TS}roman_Py → roman_TS TSTSTSTS\mathrm{TS}\rightarrow\mathrm{TS}roman_TS → roman_TS
Rename variables 0.24 0.24 0.20 0.29
Rename types 0.85 0.90 0.77 0.75
Remove type annotations 0.38 0.48 0.36 0.41
Rename types & variables 0.38 0.56 0.55 0.49
Rename types & remove annotations 0.60 0.60 0.64 0.63
Rename variables & remove annotations 0.26 0.39 0.36 0.30
All edits 0.27 0.36 0.34 0.51
Table 3: Accuracy of steering StarCoderBase-1B. A column labelled AB𝐴𝐵A\rightarrow Bitalic_A → italic_B indicates that the steering vector is computed from language A𝐴Aitalic_A but evaluated on language B𝐵Bitalic_B.

Syntactically, Python and TypeScript look very different. However, semantically the two languages have a lot in common Politz et al. (2013); Bierman et al. (2014). In particular, both of them use gradual ty** disciplines with many shared semantic features. So, we wondered, could CodeLLMs be learning a representation of types that is shared between these two languages?

To test this hypothesis, we test if steering vectors built on TypeScript data can improve the accuracy of Python type prediction, and vice versa. We conduct this experiment with each of our datasets: we steer StarCoderBase-1B using vectors from language A𝐴Aitalic_A but evaluate on the corresponding held-out test set from language B𝐵Bitalic_B. Table 3 shows that this is nearly as effective as steering type prediction on the same language.

From this observation, we hypothesize that within the model there exists a representation of program types that is shared across TypeScript and Python. This is in contrast to the scenario where distinct model components are specialized to each programming language. Thus, steering vectors are effective across languages because they intervene on the shared representation between them. Moreover, these results suggest that steering vectors generalize by targeting a high-level representation of types.

5 Conclusion

We investigate activation steering for type prediction by making CodeLLMs more robust against semantically-irrelevant aspects of code. We find that by constructing steering pairs using semantics-preserving code edits we can construct highly effective steering vectors. Our experiments show that steering vectors generalize outside the programs used for steering and outperform naive baselines. We find that our steering vectors are transferable across languages. This suggests the existence of a shared representation of a type within the CodeLLM.

Activation steering can be a powerful technique for improving model performance on tasks where fine-tuning is not advantageous. As CodeLLMs are trained on more programming languages and downstream tasks, activation steering may be used as a lightweight alternative to multiple fine-tuned experts. Rather than fine-tuning several models on different tasks, the same base model could be specialized through steering vectors which are added and removed as needed. This could be especially convenient for resource-constrained applications. Our type prediction vectors, for example, could make a convenient lightweight expert for a coding assistant, useful for applications like type migration.

In future work, we aim to study the underlying causal mechanism in CodeLLMs responsible for type prediction. We further wish to explore how type-prediction steering vectors may generalize to open generation problems like code completion.

6 Ethical Considerations

The purpose of this work is to investigate methods for making CodeLLMs more robust and aligned to user intent. It is our view that interpreting CodeLLMs is necessary for understanding whether models approach programming in a principled way. As models become more integrated into developers’ workflows, model errors could compromise the security of entire systems. For this reason, we focus on making models more robust against distractors.

The datasets used in our investigation are sourced from publicly available code. Our TypeScript dataset is derived from a subset of The Stack v1.2, which contains permissively licensed data with personal identifying information (PII) filtered. The ManyTypes4Py dataset is funded by the European Commission, which follows data privacy laws under the EU General Data Protection Regulation (GDPR). We utilize the StarCoder family of models for their open and publicly available weights and training data.

7 Reproducibility

We commit to making all of our code and datasets public for the purpose of reproducibility and furthering research in the mechanistic interpretability of CodeLLMs.

Acknowledgments

We thank Ming-Ho Yee for providing the TypeScript dataset that we use in this work. We thank Northeastern Research Computing for providing the computing resources used for this work. This work is partially supported by the U.S. National Science Foundation (CCF-2052696).

References

  • git (2021) GitHub Copilot: Your AI pair programmer, 2021. URL https://github.com/features/copilot/.
  • Bavarian et al. (2022) Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
  • Bierman et al. (2014) Gavin Bierman, Martín Abadi, and Mads Torgersen. Understanding TypeScript. In Richard Jones (ed.), ECOOP 2014 – Object-Oriented Programming, Lecture Notes in Computer Science, pp.  257–281, Berlin, Heidelberg, 2014. Springer. ISBN 978-3-662-44202-9. doi: 10.1007/978-3-662-44202-9˙11.
  • Campora et al. (2018) John Peter Campora, Sheng Chen, Martin Erwig, and Eric Walkingshaw. Migrating Gradual Types. Proceedings of the ACM on Programming Languages (PACMPL), 2(POPL), 2018.
  • Cartwright & Fagan (1991) Robert Cartwright and Mike Fagan. Soft ty**. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1991.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.longhoe.net/abs/2107.03374. arXiv:2107.03374 [cs].
  • DeMillo et al. (1978) R.A. DeMillo, R.J. Lipton, and F.G. Sayward. Hints on Test Data Selection: Help for the Practicing Programmer. Computer, 11(4):34–41, April 1978. ISSN 1558-0814. doi: 10.1109/C-M.1978.218136. URL https://ieeexplore.ieee.org/document/1646911. Conference Name: Computer.
  • Dong et al. (2023) Xiangjue Dong, Yibo Wang, Philip S Yu, and James Caverlee. Probing explicit and implicit gender bias through llm conditional text generation. arXiv preprint arXiv:2311.00306, 2023.
  • Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1, 2021.
  • Fried et al. (2023) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. InCoder: A Generative Model for Code Infilling and Synthesis. In International Conference on Learning Representations (ICLR), 2023.
  • Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  30–45, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.3. URL https://aclanthology.org/2022.emnlp-main.3.
  • Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence, January 2024. URL http://arxiv.longhoe.net/abs/2401.14196. arXiv:2401.14196 [cs].
  • Harper & Mitchell (1993) Robert Harper and John C. Mitchell. On the type structure of standard ML. ACM Transactions on Programming Languages and Systems, 15(2):211–252, April 1993. ISSN 0164-0925. doi: 10.1145/169701.169696. URL https://dl.acm.org/doi/10.1145/169701.169696.
  • Hellendoorn et al. (2018) Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. Deep Learning Type Inference. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2018.
  • Hendel et al. (2023) Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
  • Henglein & Rehof (1995) Fritz Henglein and Jakob Rehof. Safe polymorphic type inference for a dynamically typed language: Translating Scheme to ML. In International Conference on Functional Programming Languages and Computer Architecture (FPCA), 1995.
  • Hooda et al. (2024) Ashish Hooda, Mihai Christodorescu, Miltos Allamanis, Aaron Wilson, Kassem Fawaz, and Somesh Jha. Do large code models understand programming concepts? a black-box approach. arXiv preprint arXiv:2402.05980, 2024.
  • Hu et al. (2024) Zichao Hu, Francesca Lucchetti, Claire Schlesinger, Yash Saxena, Anders Freeman, Sadanand Modak, Arjun Guha, and Joydeep Biswas. Deploying and Evaluating LLMs to Program Service Mobile Robots. IEEE Robotics and Automation Letters, pp.  1–8, 2024. ISSN 2377-3766. doi: 10.1109/LRA.2024.3360020. URL https://ieeexplore.ieee.org/document/10416558.
  • Ilharco et al. (2022) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  • Jesse et al. (2021) Kevin Jesse, Premkumar T. Devanbu, and Toufique Ahmed. Learning type annotation: is big data enough? In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp.  1483–1486, Athens Greece, August 2021. ACM. ISBN 978-1-4503-8562-6. doi: 10.1145/3468264.3473135. URL https://dl.acm.org/doi/10.1145/3468264.3473135.
  • Jesse et al. (2022) Kevin Jesse, Premkumar Devanbu, and Anand Ashok Sawant. Learning To Predict User-Defined Types. IEEE Transactions on Software Engineering, pp.  1–1, 2022. ISSN 0098-5589, 1939-3520, 2326-3881. doi: 10.1109/TSE.2022.3178945. URL https://ieeexplore.ieee.org/document/9785755/.
  • Just et al. (2014) René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pp.  654–665, New York, NY, USA, November 2014. Association for Computing Machinery. ISBN 978-1-4503-3056-5. doi: 10.1145/2635868.2635929. URL https://doi.org/10.1145/2635868.2635929.
  • Kocetkov et al. (2023) Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The Stack: 3 TB of permissively licensed source code. In Deep Learning for Code Workshop (DL4C), 2023. URL http://arxiv.longhoe.net/abs/2211.15533.
  • Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. StarCoder: may the source be with you! Transactions of Machine Learning Research (TMLR), December 2023.
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097, December 2022. doi: 10.1126/science.abq1158. URL https://www.science.org/doi/full/10.1126/science.abq1158. Publisher: American Association for the Advancement of Science.
  • McGrath et al. (2023) Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and Editing Factual Associations in GPT, October 2022. URL http://arxiv.longhoe.net/abs/2202.05262. arXiv:2202.05262 [cs].
  • Migeed & Palsberg (2020) Zeina Migeed and Jens Palsberg. What is Decidable about Gradual Types? Proceedings of the ACM on Programming Languages (PACMPL), 4(POPL), 2020.
  • Mir et al. (2021) Amir M Mir, Evaldas Latoškinas, and Georgios Gousios. Manytypes4py: A benchmark python dataset for machine learning-based type inference. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp.  585–589. IEEE, 2021.
  • Nam et al. (2024) Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pp.  881–881. IEEE Computer Society, 2024.
  • Pandi et al. (2021) Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gordon, and Charles Sutton. Probabilistic Type Inference by Optimising Logical and Natural Constraints, 2021. URL https://arxiv.longhoe.net/abs/2004.00348v3.
  • Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
  • Phipps-Costin et al. (2021) Luna Phipps-Costin, Carolyn Jane Anderson, Michael Greenberg, and Arjun Guha. Solver-based Gradual Type Migration. Proceedings of the ACM on Programming Languages (PACMPL), 5(OOPSLA), 2021. doi: https://doi.org/10.1145/3485488.
  • Politz et al. (2013) Joe Gibbs Politz, Alejandro Martinez, Mae Milano, Sumner Warren, Daniel Patterson, Junsong Li, Anand Chitipothu, and Shriram Krishnamurthi. Python: the full monty. In ACM SIGPLAN Conference on Object Oriented Programmingm, Systems, Languages and Applications (OOPSLA), pp.  217–232, Indianapolis, IN, USA, October 2013. ACM. ISBN 978-1-4503-2374-1. doi: 10.1145/2509136.2509536. URL https://dl.acm.org/doi/10.1145/2509136.2509536.
  • Rastogi et al. (2012) Aseem Rastogi, Avik Chaudhuri, and Basil Hosmer. The Ins and Outs of Gradual Type Inference. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2012.
  • Rimsky et al. (2023) Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  • Rozière et al. (2024) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code Llama: Open Foundation Models for Code, January 2024. URL http://arxiv.longhoe.net/abs/2308.12950. arXiv:2308.12950 [cs].
  • Schäfer et al. (2024) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. Adaptive Test Generation Using a Large Language Model. IEEE Transactions on Software Engineering (TSE), 50(1), 2024. doi: 10.1109/TSE.2023.3334955.
  • Siek & Taha (2006) Jeremy G. Siek and Walid Taha. Gradual Ty** for Functional Languages. In Scheme Workshop, 2006.
  • Siek & Vachharajani (2008) Jeremy G. Siek and Manish Vachharajani. Gradual Ty** with Unification-based Inference. In ACM SIGPLAN International Symposium on Dynamic Languages (DLS), 2008.
  • Subramani et al. (2022) Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124, 2022.
  • Tambon et al. (2024) Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Giuliano Antoniol. Bugs in large language models generated code. arXiv preprint arXiv:2403.08937, 2024.
  • Tobin-Hochstadt & Felleisen (2006) Sam Tobin-Hochstadt and Matthias Felleisen. Interlanguage Migration: From Scripts to Programs. In ACM SIGPLAN International Symposium on Dynamic Languages (DLS), 2006.
  • Turner et al. (2023) Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  • Variengien & Winsor (2023) Alexandre Variengien and Eric Winsor. Look before you leap: A universal emergent decomposition of retrieval tasks in language models. arXiv preprint arXiv:2312.10091, 2023.
  • Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265, 2020.
  • Wei et al. (2020) Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. LambdaNet: Probabilistic Type Inference using Graph Neural Networks. In International Conference on Learning Representations (ICLR), 2020.
  • Weiss & Yahav (2013) Dror Weiss and Eran Yahav. Tabnine: AI Assistant for software developers, 2013. URL https://www.tabnine.com/.
  • Yang et al. (2024) Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi R Fung, Sha Li, Zixuan Huang, Xu Cao, Xingyao Wang, Yiquan Wang, et al. If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents. arXiv preprint arXiv:2401.00812, 2024.
  • Yee & Guha (2023) Ming-Ho Yee and Arjun Guha. Do Machine Learning Models Produce TypeScript Types that Type Check? In European Conference on Object Oriented Programming (ECOOP), 2023.
  • Zelikman et al. (2024) Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. Advances in Neural Information Processing Systems, 36, 2024.

Appendix A StarCoderBase-7B Steering Dataset Sizes

Table 4 reports the sizes of our steering and evaluation datasets for StarCoderBase-7B.

Python TypeScript
Edit Type Steering Held-out Steering Held-out
Rename variables 1,09410941,0941 , 094 465 944 385
Rename types 796 343 773 332
Remove type annotations 1,52815281,5281 , 528 706 875 344
Rename variables & types 1,62216221,6221 , 622 759 1,42314231,4231 , 423 577
Rename variables & remove type annotations 2,00020002,0002 , 000 1,24612461,2461 , 246 1,27012701,2701 , 270 478
Rename types & remove type annotations 2,00020002,0002 , 000 977 893 383
All edits 2,00020002,0002 , 000 1,04410441,0441 , 044 1,66316631,6631 , 663 630
Table 4: Overview of StarCoderBase-7B dataset sizes and applied code edits.

Appendix B Steering Accuracy on Construction Set

In § 4.2 we evaluated steering accuracy on held-out evaluation data. In this section, we evaluate accuracy on programs used to construct the steering vectors themselves. Our results in table 5 show that there is no significant difference in accuracy between held-out and steering sets.

Python TypeScript
Edit Type Steering Held-out Steering Held-out
Rename variables 0.23 0.24 0.29 0.29
Rename types 0.89 0.90 0.71 0.75
Remove type annotations 0.44 0.48 0.41 0.41
Rename types & variables 0.56 0.56 0.47 0.49
Rename types & remove annotations 0.60 0.60 0.61 0.63
Rename variables & remove annotations 0.39 0.39 0.33 0.30
All edits 0.42 0.36 0.49 0.51
(a)
Python TypeScript
Edit Type Steering Held-out Steering Held-out
Rename variables 0.19 0.20 0.29 0.26
Rename types 0.82 0.84 0.62 0.69
Remove type annotations 0.46 0.49 0.40 0.39
Rename types & variables 0.50 0.56 0.36 0.36
Rename types & remove annotations 0.52 0.59 0.63 0.67
Rename variables & remove annotations 0.33 0.41 0.29 0.27
All edits 0.46 0.50 0.40 0.40
(b)
Table 5: Steering StarCoderBase-1B and 7B on several datasets of semantics-preserving code edits. For each dataset, we report the accuracy on a held out set of negative prompts. Under the Steering column, we report the accuracy of steering vectors on the set of negative prompts used to construct the vectors.

Appendix C Accuracy by Type

We investigate whether steering vectors are biased towards particular types. Specifically, we look at whether steering vectors have higher accuracy on type labels that appear more frequently in the steering set. Typically, these types are builtin types such as string or str which are most commonly represented in type annotations. We look at the All Edits subsets of Python and Typescript in figure 5 and figure 7, respectively. For each type label frequency ranging from 1 to 25, we plot the mean accuracy of a steering vector on the held-out set. Results show that steering vectors are not biased towards more frequent type labels.

Refer to caption
Figure 7: Accuracy by type for the TypeScript All Edits dataset. Error bars represent the interquartile range.