Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena

Leonie Weissweiler, Abdullatif Köksal, Hinrich Schütze
LMU Munich & Munich Center for Machine Learning
[email protected]

Abstract

Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, “She sneezed the foam off her cappuccino”) demonstrates that constructions must carry meaning, otherwise the fact that “sneeze” in this context causes movement cannot be explained. We form the hypothesis that this remains challenging even for state-of-the-art Large Language Models (LLMs), for which we devise a test based on substituting the verb with a prototypical motion verb. To be able to perform this test at statistically significant scale, in the absence of adequate CxG corpora, we develop a novel pipeline of NLP-assisted collection of linguistically annotated text. We show how dependency parsing and GPT-3.5 can be used to significantly reduce annotation cost and thus enable the annotation of rare phenomena at scale. We then evaluate GPT, Gemini, Llama2 and Mistral models for their understanding of the CMC using the newly collected corpus. We find that all models struggle with understanding the motion component that the CMC adds to a sentence.

1 Introduction

\ex

. She sneezed the foam off her cappuccino.

\ex

. They laughed him off the stage.

These are two examples of the caused-motion construction (CMC) in which the verb behaves unusually: sneeze and laugh typically do not take multiple arguments, nor do they typically convey that something was moved by sneezing/laughing. This poses a challenge to any naive form of lexical semantics: it would not make sense for someone writing a dictionary to include, for each intransitive verb, the meaning and valency of the CMC. Almost any verb can appear in the CMC as long as we can imagine a scenario in which the action it describes causes motion. The fact that humans easily understand the CMC showcases a main feature of Construction Grammar (Croft, 2001; Goldberg, 1995): the meaning is attached to the construction itself, and not the verb. Putting the verb into this construction adds the new meaning and valency. This is one reason that constructions pose a challenge to Large Language Models (LLMs), as they would have to learn to attach the meaning to this construction and retrieve it when necessary. Its extreme rareness and productivity makes it impossible to memorise all instances and memorization would not be sufficient because the meaning shift to the verb is creative and is influenced by the specific context.

The research questions of this paper therefore are: Have LLMs learned the meaning of the CMC and how can we construct the resources needed to determine the status of CMC in LLMs?

We first address the second question, of collecting data for this at scale. This is challenging for several reasons. First, the CMC is a very rare phenomenon. Second, we are mostly interested in instances that are non-prototypical, i.e., where the verb does not typically encode motion, unlike e.g. ‘kick’ or ‘throw’. Third, this construction cannot be automatically identified using only syntactic criteria: words might be in the correct syntactic slots required by the CMC, but not create a CMC reading if the semantics of the sentence do not fit. For example, “I would take that into account” is structurally identical to the examples above, but nothing is moving.

This shows that there is a crucial semantic component. The rarity makes it infeasible to manually sift through a corpus to collect a dataset of the CMC, while the semantic complexity makes it infeasible to do so fully automatically.

In this way, we consider the CMC exemplary of rare phenomena of language that have been largely set aside in Computational Linguistics and in recent evaluation of LLMs in particular. This may be due to them being considered the periphery of language, rather than the core Chomsky (1993), or simply due to the described difficulty in finding appropriate data to investigate both the phenomena and their representation in LLMs. However, it is our point of view that as the performance of such models increases across the board, it is vital to turn to “edge cases” to accurately identify performance gaps. This is particularly important as rare phenomena may be indicators of systematic underlying problems of an NLP paradigm.

To study rare phenomena, we need data. To this end, in section 3 we propose a novel annotation pipeline that combines dependency parsing with an advanced LLM, GPT-3.5 (OpenAI, 2022). The aim of our pipeline is to minimise the cost of running GPT-3.5 and compensating human annotators, while maximising the number of positive, manually verified, linguistically diverse instances in the dataset.

We further use insights about the semantic regularities in the CMC to automatically expand this to a second corpus of 127,955 sentences, which have not been manually verified, but with high likelihood are also CMC instances. After creating our corpus, we now return to our aim of evaluating state-of-the-art LLMs for their understanding of the CMC, as an example of a semantically challenging “edge case”.

In Section 4, we therefore develop a test for different LLMs’ understanding of the CMC, by giving an instance and asking if the direct object is physically moving. We then replace the non-prototypical verb (e.g., “sneeze”) by a prototypical one that always encodes motion (e.g., “throw”) and ask the model again if the direct object is moving. We expect models that do not fully understand the CMC to fail to consistently answer both questions with “yes”. We observe that all models struggle with this task. Mixtral 8x7b performs best, but still has an error rate of over 30%.

We make three main contributions:

•

We propose a hybrid human-LLM corpus construction method and show its effectiveness for non-prototypical CMC, an extremely rare phenomenon. We discuss how our design and our guidelines can be applied to data collection needs for other linguistic phenomena.
•

We release a corpus of manually verified instances of the CMC of 765 sentences and a second, semi-automatically annotated corpus of 127,955 sentences.
•

We evaluate different sizes of GPT, Llama2, Mistral, and Gemini models on their understanding of the CMC and find that all models perform poorly, with the best model only achieving 69.75%.

2 Related Work

Evaluation of LLMs’ Understanding of Constructions. Tayyar Madabushi et al. (2020) conclude that BERT Devlin et al. (2019) can classify whether two sentences contain instances of the same construction. Tseng et al. (2022) show that LMs have higher prediction accuracy on fixed than on variable syntactic slots and infer that LMs acquire constructional knowledge (i.e., they understand the “syntactic context” needed to identify a fixed slot). Weissweiler et al. (2022) find that LLMs reliably discriminate instances of the English Comparative Correlative (CC) from superficially similar contexts. However, LLMs do not produce correct inferences from them, i.e., they do not understand its meaning.

Most related to this work, Li et al. (2022) probe for LMs’ handling of four ASCs: ditransitive, resultative, caused-motion, and removal. They adapt the findings of Bencini and Goldberg (2000), who used a sentence sorting task to determine whether human participants perceive the argument structure or the verb as the main factor in the sentence meaning. They find that, while human participants prefer sorting by the construction more if they are more proficient English speakers, PLMs show the same effect in relation to training data size. In a second experiment, they then insert random verbs that are incompatible with one of the constructions, and measure the Euclidean distance between the verbs’ contextual embedding and that of a verb that is prototypical for the construction. They demonstrate that construction information is picked up by the model, as the contextual embedding of the verb is brought closer to the corresponding prototypical verb embedding.

Mahowald (2023) investigates GPT-3’s Brown et al. (2020) understanding of the English AANN construction, assessing its grasp of the construction’s semantic and syntactic constraints. Utilising a few-shot prompt based on the CoLA corpus of linguistic acceptability (Warstadt et al., 2019), he creates artificial AANN variants as probing data. GPT-3’s performance on the linguistic acceptability task is found to align with human judgments across most conditions.

Linguistic Annotation with GPT. Since the release of ChatGPT, numerous papers have proposed to use it as an annotator. Gilardi et al. (2023) find that ChatGPT outperforms crowd-workers on tasks such as topic detection. Yu et al. (2023) and Savelka and Ashley (2023) evaluate the accuracy of GPT-3.5 and GPT-4 against human annotators, while Koptyra et al. (2023) annotate a corpus of data labelled for emotion by ChatGPT, but acknowledge its lower accuracy compared to a human-annotated version. In the area of Construction Grammar, Torrent et al. (2023) use ChatGPT to generate novel instances of constructions.

Most related to our work are papers that propose a cooperation between the LLM and the human annotator. Holter and Ell (2023) create a small gold standard for industry requirements by generating an initial parse tree with GPT-3 and then correcting it with a human annotator. Pangakis et al. (2023) investigate LLM annotation performance on 27 different tasks in two steps. First, annotators compile a codebook of annotation guidelines, which is then given to the LLM as help for annotation, and then the codebook is refined by the annotators in a second step. However, they find little to no improvement from the second step. Gray et al. (2023) make an LLM pre-generate labels for legal text analytics tasks which are then corrected by human annotators, but find that this does not speed up the annotation process.

In contrast, our work proposes a hybrid human-LLM pipeline that minimizes the cost of dataset creation. We emphasise prompt design and engineering, a critical factor in effective use of LLMs.

Computational Approaches to Argument Structure Constructions (ASCs). In addition to probing work discussed above, ASCs have also been studied from a computational perspective. Kyle and Sung (2023) leverage a UD-parsed corpus as well as FrameNet Ruppenhofer et al. (2016) semantic labelling to annotate a range of ASCs.

Hwang and Palmer (2015) identify CMCs and four different subtypes based on linguistic features. Some of these are automatically generated, but others are gold annotations. This limits the applicability to large, unannotated corpora.

Hwang and Kim (2023) conduct an automatic analysis of constructional diversity to predict ESL speakers’ language proficiency. Similar to our first filtering step, they perform an automatic dependency parse and then identify a range of constructions, including the CMC, using a decision tree built on the parse. They do not employ any further filtering.

Refer to caption — Figure 1: Flowchart of our annotation pipeline. For details of each step refer to §3.

3 Data Collection

We address the following problem setting.

A linguistic researcher (a computational or corpus linguist) wants to investigate a rare phenomenon. They want to find as many examples of the phenomenon as possible, which must be verified by human annotators. There are three tools at their disposal: linguistic resources, GPT-3.5, and human annotators. The human annotators will generally be experts in linguistic annotation.

Our key idea is that data collection will proceed in a pipeline, where a corpus is first filtered using dependency parsing and the syntactic constraints of the phenomenon, the output is further filtered with prompt-based classification using GPT, and the sentences which it labels as positive are then manually annotated by a human. Each step in the pipeline is meant to further concentrate the rate of instances in the corpus that will then be manually annotated, therefore reducing total annotation effort.

The main cost of data collection is the cost for the GPT-3.5 API and for human annotators. We assume that any expenses for linguistic resources and the computational infrastructure at our disposal are negligible in comparison. Our aim is to minimise the cost for GPT-3.5 and annotators while maximising the number of positive, manually verified, diverse instances. We assume that the use case of the linguistic resource is such that it needs to be manually annotated, so that we cannot simply fully rely on an LLM.

We propose a way of computing the cost for this problem setting and a pipeline for producing a novel linguistic resource while minimising cost.

Our main goal is minimising the cost per confirmed instance; however, we also have a secondary goal: the final set of instances should be diverse. Regardless of the specific goals of the linguistic researcher, it is unlikely that they would be served by a set of sentences that do not represent the true diversity of the phenomenon. Extreme cost-minimising measures – such as making the dependency filtering rules described in §3.1 too strict or asking GPT-3.5 to provide examples of the phenomenon – would therefore be counterproductive.

The baseline here is to take an annotator, give them a corpus, set them on the task of reading through it and marking all sentences that contain instances of the phenomenon. Let $C\mbox{${}_{\hbox{HR}}$}$ be the fixed cost of the human review of one sentence and $N$ the size of the corpus. Then the total cost of creating the resource is $J\mbox{${}_{\hbox{naive}}$}=C\mbox{${}_{\hbox{HR}}$}N$ as the annotator has to sift through all sentences. Our problem setting is that the phenomenon is rare, i.e., the corpus contains very few true positives. Thus, $\mbox{TP}\ll N$ and the number of positive instances TP found after annotating the corpus will be small. This makes this method cost-prohibitive. We therefore turn to dependency parsing with spacy Honnibal and Montani (2017) for prefiltering.

3.1 Step 1: Dependency Parsing

3.1.1 General Considerations

Figure 1 shows our pipeline. In the first step, we dependency-parse the corpus and apply a pattern to filter out all sentences that, with high likelihood, are not instances of the phenomenon.

For this dependency annotation, we could rely on annotated treebanks such as Universal Dependencies de Marneffe et al. (2021). But to find a diverse and sufficiently large set of instances, particularly in languages other than English, available treebanks may not be large enough for the rare phenomenon that we are targeting.

We therefore turn to automated dependency parsing to annotate large amounts of data. We assume that this part of the pipeline has negligible cost, as it is carried out either through a web service or on existing infrastructure.

After dependency parsing, we want filters that preserve the diversity of the found sentences. We therefore design subtree filters that preserve recall above all else. This is especially advisable as parsing will lead to some parsing errors that we want to be tolerant of, and the rarer the phenomena that we are targeting, the more likely they are to be parsed incorrectly, as even human annotation guidelines might be inconsistent for them.

To write the patterns, we recommend to start with a list of gold instances, dependency parse them and manually look at the similarities. If one wants to work only with a treebank, it might be advisable to start at first with a highly lexicalised version of the phenomenon to find a first example sentence and go on from there. In terms of tools, if one does not want to write the code for finding the instances themselves, available tools include GrewMatch,¹¹1https://match.grew.fr/ DepEdit Peng and Zeldes (2018), Corpus Workbench²²2https://cwb.sourceforge.io/index.php and SPIKE Shlain et al. (2020).

3.1.2 Designing a Dependency Subtree for the CMC

Figure 2 shows the dependency subtree we use as a filter for CMC. It was designed by automatically parsing examples of positive and negative instances from Goldberg (1992), and manually designing subtrees that cover the positive instances except for clear parsing errors, while excluding as many of the negative examples as possible. We run an automatic dependency parser from the spacy toolkit³³3version 3.2.0 on the reddit corpus Baumgartner et al. (2020) and extract all matching sentences. This also allows us to extract the location of the potential CMC instance and its parts as a side product of the filtering step: We extract the sentence, the lemmatised verb, direct object, preposition, and prepositional object, as well as their positions in the sentence.

3.2 Step 2: Prompt-based Few-shot Classification with GPT

3.2.1 General Considerations

Even after dependency-based filtering, the positive instances would still be very rare in the output and it is therefore not feasible that the output is directly annotated by a human. We therefore introduce a further filtering step with GPT to “concentrate” the positive instances even more (Step 2 of the pipeline, Figure 1), i.e. we want GPT to remove most negative instances while kee** as many positive instances as possible. The remaining data can then be cost-effectively annotated by the human annotator. The aim is to reduce the cost per instance (i.e., cost per true positive, TP) as much as possible.

There are two components of the cost: the cost of querying GPT and the cost of human annotation. Our two key ideas are:

•

We consider the two costs jointly and optimize the pipeline for overall lowest cost per TP.
•

Design and selection of the prompt used with the API is a major determinant for the cost of the pipeline. We propose a workflow for creating effective prompts.

A particular prompt may require many tokens in total, thereby incurring a higher API cost. But it may also have high accuracy, thereby reducing the cost of human annotation. We jointly consider both cost components when designing and selecting prompts.

Minimizing the cost per true positive

Given an annotated development set $V$ that is representative of the corpus that we want to annotate with our pipeline, let $J(C\mbox{${}_{\hbox{HR}}$},C\mbox{${}_{\hbox{API}}$},i)$ be the cost per true positive where $C\mbox{${}_{\hbox{HR}}$}$ is the human annotation cost per sentence, $C\mbox{${}_{\hbox{API}}$}$ is the cost of processing an input/output token with the API and $i$ (for instruction) is a prompt. We can then estimate $J(C\mbox{${}_{\hbox{HR}}$},C\mbox{${}_{\hbox{API}}$},i),$ the cost per true positive, as follows:

\frac{C\mbox{${}_{\hbox{API}}$}t(V,i)+C\mbox{${}_{\hbox{HR}}$}(\mbox{TP}(V,i)+% \mbox{FP}(V,i))}{\mbox{TP}(V,i)}

(1)

where we process the development set using the API and prompt $i$ and record: $\mbox{TP}(V,i)$ , the number of true positives, $\mbox{FP}(V,i))$ , the number of false positives, and $t(V,i)$ , the sum of the number of tokens input to the API and the number of tokens returned by the API.

We create a variety of different prompts $i$ and then select our final prompt $i^{\prime}$ as the one with the lowest per-TP cost:

i^{\prime}=\mbox{argmin}_{i}J(C\mbox{${}_{\hbox{HR}}$},C\mbox{${}_{\hbox{API}}% $},i)

Determining the size of the input corpus

When applying our pipeline, we often will have a requirement for a specific number TP ${}_{\hbox{req}}$ of the phenomenon, e.g., $\mbox{TP}\mbox{${}_{\hbox{req}}$}=1000$ instances of the CMC. After selecting a prompt $i$ and determining $\mbox{TP}(V,i)$ on the development set, we can estimate the size $N$ of the input corpus that will result in a set of TP ${}_{\hbox{req}}$ instances to be output by the pipeline as:

N:=|V|\frac{\mbox{TP}\mbox{${}_{\hbox{req}}$}}{\mbox{TP}(V,i)}

3.2.2 Prompt Engineering and Few-shot Selection

We suggest to first manually annotate a development set of positive and negative instances. The development set should be as representative of the corpus to be annotated as possible, including the rate of TPs. This development set can then be used, along with a small fixed budget, to develop prompts and finally select the best-performing prompt. We present possible strategies for prompt development, using CMC as the running example.

Development Set

For creating the development set $V$ , we manually annotate 504 (133 $P$ , 371 $N$ ) sentences from the output of the dependency filtering step. To ensure that $V$ is both diverse and relevant, we group the prefiltered dataset by verb, and starting with the highest-frequency verbs, take at most 5 positive and 5 negative sentences from every verb, where no preposition appears twice in either the positive or the negative sentences selected. We choose 5 shots from each class to be included as examples in the prompt, which are a mixture of sentences taken from Goldberg (1992) and additional data that was not used for $V$ . Using this development set $V$ , we can determine TP $(V,i)$ , FP $(V,i)$ and $t(V,i)$ of each prompt and compute its per-TP cost using Eq. 1.

Iterative Prompt Development

We start with a simple base prompt and iteratively attempt improvements to it. The total cost of this experimentation was about $10. The full details of all attempted prompts are given in the appendix in Section C.⁴⁴4The base model used was gpt-3.5-turbo-1106, the GPT-4 model was gpt-4-0125-preview.

Most improvements on the base prompt are uncontroversial, as they decrease the total cost under the assumption that $C\mbox{${}_{\hbox{HR}}$}$ is not less than $0.0001 (and the cost per API token is dominated by the cost of human annotation). These include:

•

long task description
•

long system prompt
•

JSON format for input and output
•

few-shots alternate by class
•

explanations for the labels added to the few-shots and demanded from the model
•

only 10 sentences classified at one time

				Sent’s to Annotate			Total Cost
P	Details	Prec.	Rec.	LLM	Human	API	$C\mbox{${}_{\hbox{HR}}$}$ =$.001	$C\mbox{${}_{\hbox{HR}}$}$ =$0.2	$C\mbox{${}_{\hbox{HR}}$}$ =$1	$C\mbox{${}_{\hbox{HR}}$}$ =$2
5	Simple	56.54	71.42	5,304	1,768	$0.4	$2.2	$177	$1,768	$3,537
12	5 + expl.	83.75	50.37	7,522	1,194	$1.4	$2.6	$120	$1,195	$2,389
17	12 + GPT-4	90.09	81.96	5,040	1,110	$16.2	$17.3	$127	$1,126	$2,236
18	17 + best-of-3	91.81	75.93	4,989	1,089	$48.3	$49.4	$157	$1,137	$2,226
-	Human only	-	-	-	3,789	$0.0	$3.8	$575	$3,789	$7,578

Table 1: A comparison of the four best-performing prompts (5, 12, 17, 18) for different values of

C\mbox{${}_{\hbox{HR}}$}

. P = Prompt. We give numbers (sentences that need to be annotated by LLM/human) for a scenario in which the desired size of the final resource (output of pipeline when applied to the raw corpus) is

N=1000

. The human baseline depends solely on the rate of TPs (which is higher here than for the raw corpus to be processed by the pipeline as the development set contains more positive instances).

Table 1 illustrates the prompt development process and how the choice of most cost-effective prompt depends on human cost $C\mbox{${}_{\hbox{HR}}$}$ and API cost $C\mbox{${}_{\hbox{API}}$}$ . Prompts are numbered (reflecting the sequence in which we created them). Prompt 5 is a simple short prompt (consequently: low API cost). Its precision is low and so the human has to annotate more sentences to produce the final dataset of $N=1000$ TPs of CMC. Prompts 12, 17, 18 are consecutive refinements of 5. They result in more input to and output from the API because explanations are given with the few-shots, the model is asked to provide explanations and (for 18) we classify each sentence three times (tripling the API expense). For 17 and 18, we use GPT-4, which is more capable, but also more expensive than GPT-3.5. Each refinement improves precision, which then decreases human cost.

The right four columns of the table illustrate how prompt selection depends on human cost. For (unrealistic) very low human cost ( $C\mbox{${}_{\hbox{HR}}$}$ =$.001), precision of the prompt does not matter much and the simplest prompt (which has low API cost) is most cost-effective. As human cost rises, the more expensive prompts become competitive since now human cost is the main factor and the API cost component becomes small in relative terms. For $C\mbox{${}_{\hbox{HR}}$}$ =$.2, prompt 12 is best; for $C\mbox{${}_{\hbox{HR}}$}$ =$1, prompt 17; and for $C\mbox{${}_{\hbox{HR}}$}$ =$2, prompt 18.

As our final prompt, we select prompt 12 as it is a good tradeoff between API cost and human cost.

3.3 Final Dataset Collection

I laughed myself onto the floor .

She had a bit of sauce on her lip so he kissed it off of her .

She was in pretty bad shape from drinking , so I helped her into bed and started the water .

He rushed me across the street , and down the sidewalk west towards Bedlam Mental Hospital .

I can pop my shoulder out of my socket .

Table 2: Examples from the final dataset. Verbs are highlighted in green, direct objects in purple, prepositions in blue, and prepositional objects in red.

Using prompt 12, we collect a final dataset with a $6 budget for the GPT-3.5-API and manually annotate the output. Out of 20,408 sentences classified, 1303 are judged by the model as positive. We annotate these by hand and find 632 positive instances. We combine these with the 133 positive instances from our development set for a final number of 765 hand-annotated CMC instances.⁵⁵5We hope to be able to release larger datasets using this method in the future, given a larger computational and annotation budget.

We observe that the 4-tuple of $<$ verb, direct object, preposition, prepositional object $>$ almost always perfectly determines the class. We use this observation to extrapolate from our manually annotated sentences to all other sentences with the same 4-tuples and release a second dataset of 127,955 high-likelihood CMC instances.

4 Evaluation of LLMs’ Understanding of the CMC

4.1 Methods

The goal of our evaluation is to assess different LLMs for their understanding of the CMC. The fact that the prompt engineering above, even with 10 few-shot examples and extensive task descriptions and explanations, is still far from perfect, suggests that this is a challenging task even for advanced models like GPT-3.5, which leads us to question if the same holds for other LLMs.

Our LLM evaluation setup in this section differs from prompt evaluation as we do not explicitly refer to the “caused-motion construction”, but rather prompt implicitly for the model’s understanding of the situation described. The key idea is that in a CMC sentence, something is always physically moving, even if the verb (e.g., “sneeze”) does not indicate this. The distinction between prototypical vs. non-prototypical instances is crucial here: for prototypical CMC instances (“throw”, “kick”), the verb already conveys the meaning component of motion while for non-prototypical CMC instances (“sneeze”, “laugh”) it does not and the LLM has to infer the additional meaning component of motion from the construction.

Our setup is to ask “In the sentence sentence, is direct_object moving, yes or no?”. We then replace the verb of the CMC with the appropriately inflected form of “throw”, and ask the same question again, using the structural information extracted by the dependency filtering step. We expect that models with no understanding of the CMC would answer “yes” both times only for prototypical instances, and switch from “no” to “yes” for non-prototypical ones. Models with a perfect understanding of the CMC would always answer “yes”. However, we do not group verbs into prototypical and non-prototypical ones a priori, as we support the view that this is on a gradient, and therefore the classes are not clear-cut: verbs like “brush” or “apply” may not always convey motion outside of the CMC, but certainly more often than “laugh”.

We conduct this experiment on our corpus of 765 hand-annotated sentences. As API-based LLMs, we investigate GPT-3.5, GPT-4 OpenAI (2022), and Gemini Pro Team et al. (2023). From the family of open LLMs, we further choose Llama2 Touvron et al. (2023) with sizes 7b, 13b, and the quantised version of 70b, and Mistral 7b Jiang et al. (2023) and Mixtral 8x7b Jiang et al. (2024), as well as their respective instruction-tuned versions. Models generate a sentence in response, which we then parse for versions of “yes” and “no”.

4.2 Quantitative Results

Table 3 presents the results in three groups. (i) Y $\rightarrow$ Y. The model answers “yes” both times and therefore demonstrates that it understands the CMC. (ii) N $\rightarrow$ Y. The model answers with “no” for the original sentence but changes its answer to “yes” when the verb is changed to “throw”, meaning that it does not understand the CMC. (iii) X $\rightarrow$ N. Even with “throw”, the model does not answer correctly that the direct object is moving. We consider these to be general failures of the model to understand the instruction, rather than the CMC specifically.

The main result of this evaluation is that all models perform poorly, with Mixtral 8x7b instruction-tuned (IT) performing best with 69.75% and an overall error rate of 30.25% (12.13+18.12). Mistral 7b IT is a close second. Surprisingly, GPT-4 follows at a distance of more than 10 points at 57.07%. All other models perform worse. This demonstrates that, despite the impressive progress LLMs have recently made, they are still far from perfect natural language understanding. Presumably, one reason is that non-prototypical CMC instances (e.g., “sneeze”) do not occur frequently in the training data. In addition, this type of construction is creative and complex use of language that is harder to generalize to than the linguistic behaviors that are tested in standard NLP benchmarks. Our finding may be useful as guidance for the further development of large language models. In addition to semantic/pragmatic shortcomings of LLMs (e.g., on reasoning) and “bad behavior” (output that is untruthful, offensive etc.), it suggests that syntax and in particular the syntax-semantics interface is also an area in which more progress is needed for LLMs to come closer to human levels of performance.

Surprisingly, our expectation that instruction-tuned models will generally be better at understanding the premise of the question (and therefore reduce X $\rightarrow$ N answers) only holds for Mistral, but not for Llama2. Higher rates of X $\rightarrow$ N are generally associated with lower rates of N $\rightarrow$ Y, suggesting that models that understand the question better also understand the CMC better. Most surprisingly, the worst-performing models were the two GPT models, Gemini Pro, and Mixtral 8x7b with instruction tuning, while the best-performing models were smaller sizes of Llama2.

4.3 Qualitative Results

We hypothesised that prototypical motion verbs would lead to the model answering yes both times (Y $\rightarrow$ Y), as the original sentence already obviously contains motion. It might also make sense for these verbs to be more frequent in the category of the model answering “no” both times (X $\rightarrow$ N). While this would not be the correct answer, a model that is confused and answers incorrectly about a sentence with “kick” would likely give the same answer for “throw”. We also anticipated that the frequency of a model changing its answer would be the higher, the less prototypical the original verb is (i.e., a “no” for highly non-prototypical “sneeze” more often than a “no” for more prototypical “brush”).

Investigating the distribution of verbs over the output classes for GPT-4, we indeed find that the most frequent five verbs of the Y $\rightarrow$ Y are all highly prototypical motion verbs: fling, chuck, pull, teleport, slam. In contrast, the most frequent five verbs for N $\rightarrow$ Y are not: steal, eat, tie, smash, separate.

Family	Model	IT	Y $\rightarrow$ Y	N $\rightarrow$ Y	X $\rightarrow$ N
GPT	3.5	+	43.20	10.70	46.10
GPT	4	+	57.07	11.23	31.70
Gemini	Pro	+	43.43	12.70	43.87
Llama2	7b	–	9.54	1.09	89.37
	7b	+	21.93	1.77	76.29
	13b	–	53.00	8.72	38.28
	13b	+	5.59	1.23	93.19
	70b ${}_{Q}$	–	36.65	7.36	55.99
	70b ${}_{Q}$	+	37.87	5.59	56.54
Mistral	7b	–	34.20	4.50	61.31
	7b	+	68.12	8.45	23.43
	8x7b	–	35.29	9.95	54.77
	8x7b	+	69.75	12.13	18.12

Table 3: LLM evaluation results. IT=instruction-tuned. Q=quantised.

5 Conclusion

Our paper has made several contributions. We have introduced an annotation pipeline aided by dependency parsing and prompting GPT-3.5, which can be specifically used for phenomena that are so rare that little to no corpora have been created, as the human annotation effort would be too great. We have demonstrated this pipeline on the example of the caused-motion construction, and created corpora of 765 manually annotated and 127,955 automatically annotated (but high-confidence) CMC examples. We have used the manually annotated corpus to evaluate state-of-the-art LLMs for their understanding of the CMC, and found that they have high error rates ( $>$ 30%) when asked to interpret situations described with a non-prototypical CMC.

We hope that our work will inspire more computational and corpus-based studies of rare linguistic phenomena. We note that even though prompt engineering is complex, large gains can be achieved by using intermediate-complexity prompts and basic knowledge of LLMs. We are confident that further advances in instruction-tuned LLMs will make the cost-benefit ratio of incorporating them into this hybrid annotation pipeline even stronger.

We see several opportunities for interesting future work in both halves of the paper. For the data collection part, it is a promising engineering direction to develop tools that automate parts of this process so that it becomes available to linguists without the need for complex prompt engineering. Continued progress in LLMs is likely to make the process even more efficient.

Concerning the evaluation of LLMs’ understanding of constructions, a straightforward direction for future work would be to expand to the other three Argument Structure Constructions described in Goldberg (1992).

Limitations

The data collection section of our work is limited because it draws on GPT-3.5, a commercial non-open model. While the general principles and guidelines for data collection hold for any LLM that is used for data annotation, some of the specific takeaways, especially the cost formulation, are specific to GPT-3.5. Further, the final size of our dataset is limited by our budget for prompting GPT-3.5, as well as the time available for annotation.

References

Baumgartner et al. (2020) Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. CoRR, abs/2001.08435.
Bencini and Goldberg (2000) Giulia ML Bencini and Adele E Goldberg. 2000. The contribution of argument structure constructions to sentence meaning. Journal of Memory and Language, 43(4):640–651.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
Chomsky (1993) Noam Chomsky. 1993. Lectures on government and binding: The Pisa lectures. 9. Walter de Gruyter.
Croft (2001) William Croft. 2001. Radical construction grammar: Syntactic theory in typological perspective. Oxford University Press on Demand.
de Marneffe et al. (2021) Marie-Catherine de Marneffe, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. Universal Dependencies. Computational Linguistics, 47(2):255–308.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
Goldberg (1995) Adele E.. Goldberg. 1995. Constructions: A construction grammar approach to argument structure. University of Chicago Press.
Goldberg (1992) Adele Eva Goldberg. 1992. Argument structure constructions. University of California, Berkeley.
Gray et al. (2023) Morgan Gray, Jaromir Savelka, Wesley Oliver, and Kevin Ashley. 2023. Can gpt alleviate the burden of annotation? In Legal Knowledge and Information Systems, pages 157–166. IOS Press.
Holter and Ell (2023) Ole Magnus Holter and Basil Ell. 2023. Human-machine collaborative annotation: A case study with gpt-3. In Proceedings of the 4th Conference on Language, Data and Knowledge, pages 193–206.
Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Hwang and Kim (2023) Haerim Hwang and Hyunwoo Kim. 2023. Automatic analysis of constructional diversity as a predictor of efl students’ writing proficiency. Applied Linguistics, 44(1):127–147.
Hwang and Palmer (2015) Jena D. Hwang and Martha Palmer. 2015. Identification of caused motion construction. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 51–60, Denver, Colorado. Association for Computational Linguistics.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts.
Koptyra et al. (2023) Bartłomiej Koptyra, Anh Ngo, Łukasz Radliński, and Jan Kocoń. 2023. Clarin-emo: Training emotion recognition models using human annotation and chatgpt. In International Conference on Computational Science, pages 365–379. Springer.
Kyle and Sung (2023) Kristopher Kyle and Hakyung Sung. 2023. An argument structure construction treebank. In Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023), pages 51–62, Washington, D.C. Association for Computational Linguistics.
Li et al. (2022) Bai Li, Zining Zhu, Guillaume Thomas, Frank Rudzicz, and Yang Xu. 2022. Neural reality of argument structure constructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7410–7423, Dublin, Ireland. Association for Computational Linguistics.
Mahowald (2023) Kyle Mahowald. 2023. A discerning several thousand judgments: Gpt-3 rates the article+ adjective+ numeral+ noun construction. arXiv preprint arXiv:2301.12564.
OpenAI (2022) OpenAI. 2022. Chatgpt: Optimizing language models for dialogue.
Pangakis et al. (2023) Nicholas Pangakis, Samuel Wolken, and Neil Fasching. 2023. Automated annotation with generative ai requires validation. arXiv preprint arXiv:2306.00176.
Peng and Zeldes (2018) Siyao Peng and Amir Zeldes. 2018. All roads lead to UD: Converting Stanford and Penn parses to English Universal Dependencies with multilayer annotations. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 167–177, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Ruppenhofer et al. (2016) Josef Ruppenhofer, Michael Ellsworth, Myriam Schwarzer-Petruck, Christopher R Johnson, and Jan Scheffczyk. 2016. Framenet ii: Extended theory and practice. Technical report, International Computer Science Institute.
Savelka and Ashley (2023) Jaromir Savelka and Kevin D Ashley. 2023. The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts. Frontiers in Artificial Intelligence, 6.
Shlain et al. (2020) Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, and Yoav Goldberg. 2020. Syntactic search by example. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 17–23, Online. Association for Computational Linguistics.
Tayyar Madabushi et al. (2020) Harish Tayyar Madabushi, Laurence Romain, Dagmar Divjak, and Petar Milin. 2020. CxGBERT: BERT meets construction grammar. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4020–4032, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yu**g Zhang, Ravi Addanki, Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, **wei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek **dal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, ** Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, **g Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, **liang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, **hyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023. Gemini: A family of highly capable multimodal models.
Torrent et al. (2023) Tiago Timponi Torrent, Thomas Hoffmann, Arthur Lorenzi Almeida, and Mark Turner. 2023. Copilots for Linguists: AI, Constructions, and Frames. Cambridge University Press.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Tseng et al. (2022) Yu-Hsiang Tseng, Cing-Fang Shih, Pin-Er Chen, Hsin-Yu Chou, Mao-Chang Ku, and Shu-Kai Hsieh. 2022. CxLM: A construction and context-aware language model. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6361–6369, Marseille, France. European Language Resources Association.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Weissweiler et al. (2022) Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze. 2022. The better your syntax, the better your semantics? probing pretrained language models for the English comparative correlative. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10859–10882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Yu et al. (2023) Danni Yu, Luyang Li, Hang Su, and Matteo Fuoli. 2023. Assessing the potential of llm-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apologies. International Journal of Corpus Linguistics.

Appendix A Detailed Graphs of Prompt and Total Costs

We include a visual representation of the cost of each prompt dependent on the human annotation cost per sentence. All prompts are displayed in Figure 3. In Figure 4, we remove most of the worse-performing prompts and show the different slopes and intercept points. In Figure 5, we highlight the points in the human annotation cost where the optimal prompt changes.

Appendix B Full Statistics on each prompt

Prompt	Prec.	Rec.	F1	HR $	GPT $ in ct
1	41.84	57.89	48.58	2.389	0.01
2	46.70	69.69	55.92	2.141	0.01
3	55.13	76.69	64.15	1.813	0.04
4	55.61	78.19	65.00	1.798	0.04
5	56.54	71.42	63.12	1.768	0.04
6	52.28	77.44	62.42	1.912	0.06
7	53.53	79.69	64.04	1.867	0.03
8	43.93	87.21	58.43	2.275	0.07
9	34.63	93.23	50.50	2.887	0.04
10	76.62	44.36	56.19	1.305	0.13
11	76.54	46.61	57.94	1.306	0.12
12	83.75	50.37	62.91	1.194	0.14
13	83.33	52.63	64.51	1.200	0.17
14	78.35	57.14	66.08	1.276	0.44
15	78.02	53.38	63.39	1.281	0.18
16	85.50	44.36	58.41	1.169	0.55
17	90.09	75.18	81.96	1.110	1.62
18	91.81	75.93	83.12	1.089	4.83

Table 4: Full Evaluation Results for Every Prompt

In Table 4, we report the sequence of prompts that we tried and their performance. Details of all prompts and the changes from the previous prompt are available in Appendix Section C.

Appendix C Full details for each prompt

We report in Tables 5 to 19 the details of the prompt, along with the change that it represents from a previous prompt.

System Prompt	-
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not.
Input Format	Here are 5 positive examples: id, sentence, label Here are 5 negative examples: id, sentence, label. Classify the following sentences: id, sentence
Output Format	Reply with a csv codeblock (wrapped in three backticks), with the headers ’id’ and ’label’. label should be either True or False. Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No

Table 5: Prompt 1

System Prompt	-
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 5 positive examples: id, sentence, label Here are 5 negative examples: id, sentence, label. Classify the following sentences: id, sentence
Output Format	Reply with a csv codeblock (wrapped in three backticks), with the headers ’id’ and ’label’. label should be either True or False. Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 1 with longer instruction

Table 6: Prompt 2.

System Prompt	-
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 5 positive examples: { "id": id,"sentence": sentence, "label": label }. Here are 5 negative examples: { "id": id,"sentence": sentence, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 2 with input and output format changed to JSON

Table 7: Prompt 3

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 5 positive examples: { "id": id,"sentence": sentence, "label": label }. Here are 5 negative examples: { "id": id,"sentence": sentence, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 3 with added system prompt

Table 8: Prompt 4

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with ground truth labels: : { "id": id,"sentence": sentence, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 4 with few shots alternating between True and False labels.

Table 9: Prompt 5

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with ground truth labels: : { "id": id, "verb" : verb, "direct object": direct object, "preposition" : preposition, "prepositional object" : prepositional object, "label": label }. Classify the following sentences: { "id": id, "verb" : verb, "direct object": direct object, "preposition" : preposition, "prepositional object" : prepositional object }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "verb", "direct object", "preposition", "prepositional object", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 5 with the sentences replaced by the information extracted with the dependency filtering step

Table 10: Prompt 6

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with ground truth labels: : { "id": id, "string" : verb direct object preposition prepositional object, "label": label }. Classify the following sentences: { "id": id, "string" : verb direct object preposition prepositional object }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", a "string", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 5 with the sentences replaced by the substring from the verb to the prepositional object

Table 11: Prompt 7

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with ground truth labels: : { "id": id, "sentence" : sentence, "verb" : verb, "direct object": direct object, "preposition" : preposition, "prepositional object" : prepositional object, "label": label }. Classify the following sentences: { "id": id, "sentence" : sentence, "verb" : verb, "direct object": direct object, "preposition" : preposition, "prepositional object" : prepositional object }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "verb", "direct object", "preposition", "prepositional object", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 6 with the full sentence added back in

Table 12: Prompt 8

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with ground truth labels: : { "id": id, "sentence": sentence, "string" : verb direct object preposition prepositional object, "label": label }. Classify the following sentences: { "id": id, "sentence": sentence, "string" : verb direct object preposition prepositional object }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", a "sentence", a "string", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 7 with the full sentence added back in

Table 13: Prompt 9

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50
Model	GPT-3.5
Majority Vote	No
Change	Prompt 5 with explanations for the labels added to input and required from the model for the output

Table 14: Prompt 10

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	50 $\rightarrow$ 25 $\rightarrow$ 10 $\rightarrow$ 5 $\rightarrow$ 1
Model	GPT-3.5
Majority Vote	No
Change	Prompt 10 with different numbers of sentences classified per prompt

Table 15: Prompts 11-14

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 20 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	10
Sentences	10
Model	GPT-3.5
Majority Vote	No
Change	Prompt 12 with the number of shots per class doubled from 5 to 10

Table 16: Prompt 15

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	10
Model	GPT-3.5
Majority Vote	Yes
Change	Prompt 12 with a majority vote from three separate runs

Table 17: Prompt 16

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	10
Model	GPT-4
Majority Vote	No
Change	Prompt 12 using GPT-4

Table 18: Prompt 17

System Prompt	You are a linguistic expert specializing in syntax, specifically the caused-motion construction in English sentences. Your task is to analyze given sentences and classify whether they exhibit this construction or not. Remember to carefully consider the structure and meaning of each sentence to make the most accurate determination.
Instruction	The task is to classify whether the sentences are instances of the caused motion construction as first introduced by Goldberg (1992) or not. A caused-motion construction is a linguistic phenomenon where a verb describes an action that results in a change of location or motion for a specific object. Your task will be to understand what is going on in the sentence and determine if the verb describes an action that results in a change of location or motion for a specific object. Keep in mind that the caused-motion construction is rare, and label the sentences accordingly.
Input Format	Here are 10 examples with examples with explanations and ground truth labels: : { "id": id,"sentence": sentence, "explanation": explanation, "label": label }. Classify the following sentences: { "id": id,"sentence": sentence }.
Output Format	Respond with a jsonl codeblock (wrapped in three backticks). Each object should include an "id", "sentence", "explanation", and finally a "label" field with either "true" or "false". Label all 50 sentences.
Shots per Class	5
Sentences	10
Model	GPT-4
Majority Vote	Yes
Change	Prompt 16 using GPT-4

Table 19: Prompt 18

Appendix D Few Shots

In Table 20, we give the five shots from each class given to ChatGPT as examples.

Sentence	Verb	Dir Obj	Prep	P-Obj	Lab.	Explanation
Sam sneezed the napkin off the table .	sneeze	napkin	off	table	True	This is caused-motion, because Sam sneezing is causing the napkin to move off the table.
Joey grated the cheese onto a serving plate .	grate	cheese	onto	plate	True	This is caused-motion, because the grating is causing the cheese to move onto the plate.
Sam assisted her out of the room .	assist	she	out of	room	True	This is caused-motion, because Sam assisting is causing her to move out of the room.
He nudged the golf ball into the hole .	nudge	ball	into	hole	True	This is caused-motion, because him nudging the ball is causing it to move into the hole.
Frank squeezed the ball through the crack .	squeeze	ball	through	crack	True	This is caused-motion, because Frank is moving the ball through the whole by squeezing it.
The hammer broke the vase into pieces.	break	vase	into	piece	False	This is not caused-motion, because the vase is changing its state into pieces, the pieces are not a destination.
Christy blew Sam under the table .	instruct	he	into	room	False	This is not caused-motion, because you are not moving under the table because Christy is blowing, the blowing action is taking place under the table.
Adele raised her eyebrows at Sam .	raise	eyebrow	at	I	False	This is not caused-motion, because while Adele is moving her eyebrows, they are not literally moving towards Sam.
They separated people into groups .	separate	people	into	group	False	This is not caused-motion, because the people aren’t moving towards groups, they are becoming the groups.
His cane helped him into the car .	help	he	into	car	False	This is not caused-motion because the cane isn’t causing the motion, it is being used as a tool to assist with the motion.

Table 20: Few Shots used in all prompts with the structured information and explanations that are used in some prompts. P-Obj stands for Prepositional Object, Dir Obj for Direct Object.