Can Language Models Recognize Convincing Arguments?

Paula Rescala,¹ Manoel Horta Ribeiro,¹ Tiancheng Hu,² Robert West¹
¹EPFL, ²University of Cambridge
Corresponding author: [email protected]

Abstract

The remarkable and ever-increasing capabilities of Large Language Models (LLMs) have raised concerns about their potential misuse for creating personalized, convincing misinformation and propaganda. To gain insights into LLMs’ persuasive capabilities without directly engaging in experimentation with humans, we propose studying their performance on the related task of detecting convincing arguments. We extend a dataset by Durmus & Cardie (2018) with debates, votes, and user traits and propose tasks measuring LLMs’ ability to (1) distinguish between strong and weak arguments, (2) predict stances based on beliefs and demographic characteristics, and (3) determine the appeal of an argument to an individual based on their traits. We show that LLMs perform on par with humans in these tasks and that combining predictions from different LLMs yields significant performance gains, even surpassing human performance. The data and code released with this paper contribute to the crucial ongoing effort of continuously evaluating and monitoring the rapidly evolving capabilities and potential impact of LLMs.

1 Introduction

As LLMs rise in capacity and popularity, so has the concern that they may enable the creation of tailor-made, convincing arguments capable of pushing false or misleading narratives online (De Angelis et al., 2023; Buchanan et al., 2021). While the dissemination of tailor-made misinformation predates the rise of LLMs (DiResta et al., 2019), the increasing accessibility and capability of state-of-the-art models such as GPT-4 and Claude could exacerbate this issue by enabling the creation of highly personalized misleading content with unprecedented ease (Bommasani et al., 2021; Goldstein et al., 2023).

Closely related to the task of generating persuasive content is the task of detecting content that would be particularly persuasive to individuals with specific demographics or beliefs. Malicious actors might not need to create content themselves; instead, they may promote existing content to specific users (Broniatowski et al., 2018). Furthermore, both tasks require that the underlying (large language) model encodes correlations between demographics and beliefs (e.g., U.S. Democrats tend to favor legalizing abortion), as well as the correlation between beliefs (e.g., anti-abortion individuals are more likely to be against gun control). Empirical work suggests that generation is easier than detection for LLMs (West et al., 2023), yet the latter task is better suited to continuously benchmark models’ persuasiveness capacities. Assessing models’ capacity to generate arguments requires continuous human experimentation as LLMs evolve, which can be time-consuming and resource-intensive. On the contrary, measuring a model’s capacity to detect content persuasive to specific demographics can be done quickly and without interaction with human subjects, making it a more efficient approach for benchmarking the persuasive capabilities of LLMs over time.

Refer to caption — Figure 1: Our approach to study LLMs’ persuasiveness capabilities. We measure to which extent LLMs can reproduce human judgments on the quality and persuasiveness of arguments. If LLMs can predict users’ positions on stances (e.g., The death penalty should be legal) before and after reading a debate, as well as judge who was the better debater, we argue it would be well suited to generate personalized misinformation and propaganda.

Present Work. We study whether LLMs can detect content that would be persuasive to individuals with specific demographics or beliefs. We center our investigation around three research questions. Namely, can LLMs $\ldots$

RQ1: judge the quality of arguments and identify convincing arguments as well as humans?

RQ2: judge how demographics and beliefs influence people’s stances on specific topics?

RQ3: determine how arguments appeal to individuals depending on their demographics?

Answering these questions can help assess the threat of LLM-generated misinformation and propaganda. If LLMs can detect good arguments (RQ1), determine the correlation between demographics and previously stated beliefs with people’s stances on new specific topics (RQ2), and determine whether an argument will convince specific individuals (RQ3), they are likely better at generating misinformation and propaganda.

To investigate these research questions, we extended a dataset collected by Durmus & Cardie (2018) from a defunct debate platform (debate.org). We annotate 833 politics-related debates with clear propositions, such as “The electoral college should remain unchanged.” Each debate contains arguments for (“Pro”) and against (“Con”) the proposition, along with votes from debate.org participants indicating the winning side. Importantly, the dataset includes demographic information of the voters as well as their stances on 48 so-called “big issues”. For 121 debates with 751 votes on three of the most prominent topics in the dataset, we obtained crowdsourced labels to compare the capabilities of LLMs to those of humans. Then, Using this enriched dataset, we evaluate the performance of four LLMs (GPT-3.5, GPT-4, Llama 2, and Mistral 7B) on three tasks: 1) identifying the side with more convincing arguments (RQ1); 2) predicting individuals’ stances on specific propositions before the debate, given their demographic and basic belief information (RQ2); and 3) predicting individuals’ stances on specific propositions after the debate, given their demographic and basic belief information (RQ3). Figure 1 illustrates our overall approach.

Our key finding is that LLM exhibits human-like performance across the three proposed tasks. In judging the better debater (RQ1), GPT-4 (Accuracy: 60.50%; RQ1) is as good as an individual voter in the dataset (Accuracy: 60.69%; RQ1). When predicting users’ stances on specific issues before and after reading the debate (RQ2 and RQ3), LLMs again perform similarly to humans. For instance, in the before the debate scenario (RQ2), Mistral yields an accuracy of 42.27%, whereas crowdworkers achieve 39.86% (random guessing would yield 33.3% accuracy). However, zero-shot prediction with LLMs still underperforms a supervised machine learning model [XGBoost (Chen & Guestrin, 2016)], which achieves 58.25% accuracy in cross-validation. Nevertheless, stacking the predictions of LLMs and using them as features in a supervised learning setting reduces the performance gap (45.91%).

Overall, our work contributes to the growing body of research on the societal impact of LLMs (Bommasani et al., 2021; Solaiman et al., 2023; Weidinger et al., 2023, inter alia). By investigating LLMs’ ability to detect persuasive content tailored to specific demographics, we shed light on their potential for generating targeted misinformation and propaganda.

2 Related Work

We review related work in three broad directions closely related to the task.

Demographics, beliefs, and persuasion.

Demographics have long been known to impact people’s political beliefs and attitudes. Group-level demographic factors such as race, religion, and education shape individuals’ perspectives on various political issues and voting behavior in the U.S. context (Campbell et al., 1960; Erikson & Tedin, 2019). For example, 78% of Black, 72% of Asian, and 65% of Hispanic workers see efforts on increasing diversity, equity, and inclusion at work positively, compared to 47% of White workers (Minkin, 2023). Similarly, previous work indicates that persuasion depends on the message recipients’ existing values and that individual differences can influence persuasion (O’Keefe, 2015). For instance, Hirsh et al. (2012) showed that tailoring messages to different personality traits can make them more persuasive; Orji et al. (2015) showed that men and women differ significantly in their responsiveness to the different persuasive strategies. And, closer to the work at hand, Durmus & Cardie (2018) and Al Khatib et al. (2020) have shown that modeling or considering debate characteristics can enhance the prediction of argument persuasiveness. In this work, we examine how LLMs can capture the correlations between demographics and beliefs (RQ2) and how personal traits determine the persuasiveness of arguments (RQ3).

Argument quality.

Defining argument quality is no easy task, or as persuasion scholars O’Keefe & Jackson (1995) put it: “there is no clear general abstract characterization of what constitutes argument quality.” For example, an argument may be good because of its effectiveness, i.e., in practice, it convinces people (O’Keefe & Jackson, 1995), or because its congency, i.e., that the argument is composed of individually accepted premises that allow its conclusion to be reached (Johnson & Blair, 2006), or even because it is reasonable, i.e., the argument and the way it is put is acceptable contributes to resolving the disagreement (Walton, 2005). Over the last decades, there’s been excitement in the NLP and artificial intelligence communities about automatically extracting arguments from text (Lawrence & Reed, 2020; Cabrio & Villata, 2018) (or even multimodal data, see Liu et al. (2023)) and using computational methods to better understand what makes arguments effective (Wachsmuth et al., 2017; Tan et al., 2016). In this work, we examine to which extent LLMs can capture which debates were considered to have better arguments (RQ1) and to which extent arguments were effective across different demographics (RQ3).

Personalized misinformation and propaganda.

Microtargeting or “personalized persuasion” refers to tailoring the language or content of messages to individuals based on their characteristics (e.g., demographics and prior political beliefs) to make them maximally persuasive. Evidence on the effect sizes of microtargeting is mixed (Guess & Coppock, 2020; Coppock et al., 2020; Matz et al., 2017; Tappin et al., 2023), which has led Teeny et al. (2021) to propose that research on microtargeting should move from “does microtargeting work?”, to “when micro-targeting works?” At the same time, the increasing popularity and capabilities of large language models (LLMs) have raised concerns that they may make microtargeting cheaper and better and that they may enable new ways to “microtarget” misinformation and propaganda, e.g., personalized chatbots – see Goldstein et al. (2023) for a comprehensive discussion. This concern has been corroborated by new work suggesting that LLMs are capable of generating messages perceived as equally or more persuasive than humans (Bai et al., 2023) that they can personalize messages to make them more persuasive (Simchon et al., 2024) and that LLMs can successfully persuade humans in debates by exploiting their personal traits Salvi et al. (2024). This work is primarily motivated by the theme of microtargeting, as our goal is to continuously benchmark LLMs capabilities of tailor messages so that they are persuasive to different individuals (RQ3).

3 Data

Data for this study was collected by Durmus & Cardie (2018) from an online debate platform (debate.org; no longer operational). The platform allowed users to participate in and vote on debates covering a breadth of topics, including politics, religion, and science. Each debate within the dataset consists of multiple rounds, each round with an argument from both the “Pro” and “Con” perspectives. Users on the platform could vote on various aspects of the debate, such as which side they believed provided a more convincing argument. The raw dataset contains 78,376 debates, 45,348 users, and 195,724 votes. Each user has corresponding demographic information, such as gender and age, as well as their stances on 48 so-called “big issues,” such as abortion, capital punishment, and national health care. Nevertheless, most demographic data is missing from the dataset. Most important to the work at hand, voters had to indicate which side: 1) Made more convincing arguments; 2) They agreed with before the debate; 3) They agreed with after the debate. We measure LLMs’ capacity to recognize convincing arguments by predicting the responses to these three questions, each of which could be answered “Pro,” “Con,” or “Tie.” Note that predicting question #1–#3 corresponds to our research questions RQ1–RQ3.

Although each debate in the dataset has a corresponding title indicative of its content, these titles are user-defined and do not always take the form of a proposition. As a result, it is not always clear from reading the debate title alone what the “Pro” and “Con” stances are. Hence, we contribute clear, manually written propositions for 833 debates that (1) were categorized under “Politics,” (2) contained at least 300 total tokens (tokens are counted using the tiktoken library with the GPT-3.5-turbo model encoding), (3) contained at least two complete rounds, (4) The debater who spoke the most in the debate did not speak more than 25% more than the other debater, (5) the debate had at least three votes. We discarded an additional 199 debates that fulfilled the aforementioned criteria but were troll debates (e.g., just profanity toward the other debater), incorrectly categorized as Politics, or impossible to paraphrase into a proposition (see Table 1 for examples).

Original Title	Hand-written Proposition
A Debate On The Electoral College	The electoral college should remain unchanged.
Gay Marriage Should Be Legal	Gay marriage and equal rights.

Table 1: Examples of titles used for debates in the dataset and the corresponding manually written propositions we created to replace them.

PoliProp [PP]. We study these 833 annotated debates, considering all votes ( $n=4$ , $871$ ) in these debates for users with no more than five missing values in demographic information ( $4$ , $871$ out of $7{,}797$ ). We also trimmed each debate in the dataset larger than the smallest context window (4096 tokens) among LLMs considered. Trimming is done by removing one round at a time from the end of the debate until the token count is small enough, an approach that equally penalizes both debaters (unlike simply removing tokens at the end of the debate). Hereafter, we call this dataset the PoliProp dataset.

PoliIssues [IS]. We also separately consider all debates on abortion ( $n=50$ ), gay marriage ( $n=51$ ), and capital punishment ( $n=31$ ), the most prominent topics in the dataset. Given that debates within the three themes are similar, we use this data to compare LLM performances with traditional machine learning methods, predicting participants’ votes using their demographic and stances on big issues as features. To obtain a human baseline, we collect crowdsourced labels using Amazon Mechanical Turk (MTurk) for each of the 751 votes cast on these 121 debates. Crowdworkers are essentially presented with the same questions as the LLMs. Given a debate, we ask who gave the better arguments. Given a set of characteristics by a voter as well as the debate, we ask whether the voter would have agreed with the proposition before and after reading the debate. Hereafter, we call this dataset the PoliIssues dataset. We provide additional information on the crowdsourcing methodology in Appendix B.

4 Methods

LLMs considered.

For this study, we compare the performance of two open-source LLMs, namely Mistral 7B (Mistral-7b-Instruct-v0.1) and Meta’s Llama 2 70B (Llama-2-70b-chat), with OpenAI’s closed-source GPT-3.5 (gpt-3.5-turbo-1106) and GPT-4 (gpt-4-0613).

Prompting.

We follow Staab et al. (2023) to develop our prompt: each had a system role, context, question, and constraint. We experimented with different structures and found that, overall, the structure mattered little as long as the wording was clear and concise. Since we had three research questions to answer, we had three prompt structures that combined the debate proposition, the debate itself, and user demographics. We show the prompt structure used for RQ1 in Figure 2. All prompts indicated that LLMs should respond only with the labels “Pro,” “Con,” or “Tie”. Nevertheless, many of the models failed to adhere to this instruction, necessitating post-processing to extract the actual answer. Generally, instances of incorrect responses involved the answer accompanied by additional spaces or punctuation or presented in a complete sentence format, such as: “Based on the given demographics, the person is most likely to agree with the ‘Con’ side in the debate.” We used heuristics to extract the responses in these cases. Additionally, there were occasions when the LLMs failed to produce any answer, resulting in responses akin to: “I cannot determine the person’s position in the debate without additional information.” We depict the remaining prompts and provide details in Appendix A.

Evaluation.

We evaluate the accuracy of language models by comparing the answers they provide with the ground truth data from PoliProp and PoliIssues. We obtain confidence intervals through bootstrap**. Besides considering each LLM individually, we also consider the performance of a “stacked” LLM, obtained by using the output of several large language models (with slightly different instantiations) as features in a supervised machine learning model.

Baselines/Benchmarks

We interpret the LLM accuracies by establishing the following baselines and benchmarks as metrics of comparison:

•

Random; (RQ1—RQ3) Since there are three possible stances (Pro, Con, and Tie) for any given task and each debate and voter pair, the random baseline has an accuracy of 33.3%.
•

Majority; (RQ1) For RQ1, the ground truth was established by aggregating the votes for who made more convincing arguments in each debate through a simple majority vote. The Majority benchmark is the percentage of users in our dataset that agreed with the computed ground truth for this question.
•

MTurk; (RQ2–RQ3) We crowdsourced the tasks for each research question for the PoliIssues dataset, obtaining a human equivalent answer to the questions we asked the LLMs. These are detailed in Appendix B.
•

XGBoost; (RQ2) For each issue (abortion, gay marriage, capital punishment) in the PoliIssues dataset, we train a Gradient Boosting classifier to predict the stance of a user as in RQ2. We train one model separately per issue since labels are not equivalent (e.g., Pro-abortion differs from Pro-capital punishment), but we report the aggregated accuracy.

#	Question	Model	Dataset	Accuracy (%)	95% Confidence Interval
1	RQ1	Llama 2	PP	24.91	(20.65, 26.53)
2		Mistral 7B	PP	37.69	(32.89, 39.26)
3		GPT-3.5	PP	42.74	(39.38, 46.1)
4		GPT-4	PP	60.50	(57.26, 63.87)
5		Stacked	PP	61.94	(58.54, 65.34)
6		Majority	PP	60.69	(59.56, 61.79)
7		Random	PP	33.33	—
8	RQ2	Llama 2	IS	41.56	(38.16, 45.1)
9		Mistral 7B	IS	41.39	(38.4, 44.86)
10		GPT-3.5	IS	41.73	(38.52, 44.98)
11		GPT-4	IS	42.82	(39.59, 46.53)
12		MTurk	IS	39.32	(35.89, 42.88)
13		Stacked	IS	45.91	(40.02, 51.81)
14		XGBoost	IS	55.34	(50.45, 60.22)
15		Random	IS	33.33	—
16	RQ3	Llama 2	IS	41.24	(37.2, 44.02)
17		Mistral 7B	IS	42.28	(32.06, 38.28)
18		GPT-3.5	IS	38.97	(35.41, 42.11)
19		GPT-4	IS	44.38	(40.91, 47.73)
20		MTurk	IS	39.86	(36.44, 43.42)
21		Stacked	IS	46.86	(41.17, 52.55)
22		Random	IS	33.33	—

Table 2: Key results for RQ1–RQ3. We show that LLMs perform on par with humans across various tasks related to recognizing convincing arguments. When stacked using a logistic regression, LLMs outperform humans in predicting stances on prepositions before and after the debate (RQ2, RQ3). The random baseline has accuracy of 33.33% for all settings.

5 Results

Judging argument quality (RQ1).

Considering the PoliProp dataset [PP], we summarize the accuracy of the different LLMs and baseline methods in determining argument quality in Table 2 (rows #1–#7). We find a substantial performance gap between GPT-4 (60.50% accuracy) and the other models, e.g., Llama 2, which performs worse than random guessing (24.91%). GPT-4 performance is similar to human performance, as measured by the agreement of any individual vote with the remaining votes in each debate (Majority; 60.69%).

Correlating beliefs and demographic characteristics with stances (RQ2).

Considering the PoliIssues dataset [IS], we summarize the accuracy of the different LLMs and baseline methods in correlating beliefs and demographic characteristics with stances on Table 2 (rows #8–#15). Here the accuracy range of different LLMs is much more narrow, ranging from 41.39% (Mistral) to 42.82% (GPT-4). Most important however, is that the performance of LLMs is similar to that of crowdworkers (39.32%; MTurk).

Recognizing convincing arguments (RQ3).

Again, considering the PoliIssues dataset [IS], we summarize the accuracy of the different methods in recognizing users’ opinion after reading the debate on Table 2 (rows #16–#22). Different models perform similarly on the task and similar to crowdworkers, e.g., GPT-4: 44.38% of vs. crowdworkers: 39.86%.

LLMs vs. supervised learning.

Considering RQ2, we train a Gradient Boosting classifier to predict stances given user traits (row #14). We run a 20-fold cross-validation and report the mean accuracy. This model performs significantly better than LLMs at predicting stances (Accuracy: 58.25%; 95% CI: [54.02, 62.47]).

Model	Big Issues	Reasoning	Accuracy (%)	95% CI
Llama	False	False	41.30	(37.92, 44.38)
	False	True	40.05	(36.48, 43.3)
	True	False	38.92	(34.81, 41.51)
	True	True	37.38	(29.07, 35.17)
Mistral 7B	False	False	40.67	(37.2, 44.02)
	False	True	41.83	(38.04, 44.98)
	True	False	40.60	(36.96, 44.14)
	True	True	40.91	(36.6, 43.18)
GPT-3.5	False	False	42.94	(39.83, 46.17)
	False	True	39.45	(36.0, 42.58)
	True	False	41.80	(38.28, 45.1)
	True	True	37.80	(34.57, 41.03)
GPT-4	False	False	42.70	(39.47, 46.17)
	False	True	43.30	(39.95, 46.65)
	True	False	42.46	(39.11, 45.93)
	True	True	45.03	(41.51, 48.09)

Table 3: We repeat the analysis to answer RQ2 using the PoliIssues dataset but varying the prompt, either by considering big issues in the prompt (Big Issues) or by asking the LLM to reason before answering the question (Reasoning). The scenario without ‘Big issues’ or ‘Reasoning’ corresponds to lines #8–10 in Table 2,

Sensitivity to prompt.

We study whether the results obtained were sensitive to prompt used by re-running the analysis from RQ2 on the PoliIssues dataset. For each model, we rerun the analysis considering the “big issues” in user profiles and/or asking for models to reason before answering.¹¹1The prompt constraint was changed to Evaluate step-by-step the data given in the proposition before coming to an answer. Provide your reasoning for selecting an answer and then give your answer in the form of ‘Pro,’ ‘Con,’ or ‘Tie’ without using other words or punctuation. Provide your response in the following format: ‘Reasoning: your reasoning goes here. Answer: your answer goes here.’ Results are shown in Table 3. Overall, we find that the results are not sensitive to the experimented changes.

Stacking LLMs.

While the performance of language models is similar in RQ2, we find that their inter-annotator agreement is quite low (Cohen’s $\kappa$ is smaller than 0.2 for most pairs of models, see Figure 3). This is surprising since, upon our inspection of the reasoning different LLMSs’ provided for their answers, all made similar assumptions. Nevertheless, each model seems to perform well on a different subset of the debates. This motivated us to experiment with stacking LLMs, i.e., using the outputs of the different large language models outlined above as input to a simple logistic regression model. We find this strategy yields a small boost in accuracy in RQ1 (see row #5; Table 2), but a substantial one in RQ2 and RQ3 (see rows #13 and #21). Indeed, in this scenario, the accuracy is significantly better than crowdworkers for both research questions ( $p<0.05$ ). Note that the accuracy reported for the stacked model is the average of a 20-fold cross-validation.

6 Discussion

Our findings indicate that large language models demonstrate human-level performance in (1) judging argument quality, (2) predicting users’ stances on specific topics given users’ demographics and basic beliefs, and (3) detecting arguments that would be persuasive to individuals with specific demographics or beliefs. However, the overall human performance is not high in each of the three tasks [around 60% for (1), and around 40% for (2) and (3)], which could be due to the inherent difficulty of the tasks, as well as variance and randomness in the data. This does not necessarily imply that LLMs do not pose any additional risk of tailor-made misinformation in the future. It is plausible that with access to more personal information about an individual, such as personality traits, LLMs could perform better at detecting persuasive arguments (Hirsh et al., 2012). Nevertheless, it is important to consider that the more fine-grained the target, the harder and more costly it becomes to reach the targeted population, and the cost-benefit analysis is not straightforward (Tappin et al., 2023).

One hypothesis that could explain the relatively low accuracy for both LLMs and human performance is that these demographic questions and big-issue stances may not be highly relevant for the task, as suggested by Hu & Collier (2024). However, this is contradicted by the fact that a supervised XGBoost model trained with these factors yields much better results. Interestingly, stacking various LLM predictions yields performance closer to XGBoost. This indicates that while an individual LLM may not excel at detecting persuasive arguments for an individual, combining the predictions of several LLMs could achieve much more competitive performance. Consequently, LLMs can potentially generate highly effective tailored misinformation and propaganda, particularly in a multi-agent setting Schoenegger et al. (2024).

One limitation of our work is that the ability to detect personalized persuasive content, while closely related, is not equivalent to the ability to generate such content. However, we argue that since generation is easier than detection for LLMs (West et al., 2023), our work contributes a valuable and efficient framework for continuously assessing the persuasive capabilities of LLMs. Another limitation is that our dataset, from debate.org, may not be representative of the general population. The demographics of individuals opting to participate in online debates are likely skewed compared to the U.S. population and even more so globally. Additionally, we could not test the language models on non-English data due to data access limitations. However, recent research has shown that language models’ performance is considerably lower for non-English languages, especially low-resource ones (Ahuja et al., 2023). Consequently, it is plausible that the risk of misuse for microtargeting in non-English settings is currently lower. Nevertheless, as language models continue to improve, it is crucial to expand this line of research to a wider range of languages and demographics to ensure a comprehensive understanding of the risks associated with personalized persuasion.

Ethics.

In this study, we employ demographic and belief-related questions drawn from datasets that are publicly accessible and have been anonymized before release. It is crucial to emphasize the importance of responsible development and deployment of LLMs and the need for ongoing research into mitigating their potential risks Bommasani et al. (2021). Our work can inform the development of safeguards and countermeasures against the misuse of LLMs for personalized misinformation and propaganda. Finally, in this work, we used crowdsourcing to rate the persuasiveness of debates. We paid crowd workers (based in the US) $12.00/hour median rate; higher than the US federal minimum wage.

Reproducibility.

We release our code for modeling and analysis.²²2Code: https://anonymous.4open.science/r/debate-gpt-x-88F3/³³3Data: https://osf.io/8jh62/?view_only=c8c28914ca664330b02cfd476e097abc We include details about the crowdsourcing setup in Appendix B.

References

Ahuja et al. (2023) Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. MEGA: Multilingual evaluation of generative AI. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4232–4267, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.258. URL https://aclanthology.org/2023.emnlp-main.258.
Al Khatib et al. (2020) Khalid Al Khatib, Michael Völske, Shahbaz Syed, Nikolay Kolyada, and Benno Stein. Exploiting personal characteristics of debaters for predicting persuasiveness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7067–7072, 2020.
Bai et al. (2023) Hui Bai, Jan G. Voelkel, johannes Christopher Eichstaedt, and Robb Willer. Artificial intelligence can persuade humans on political issues, February 2023.
Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Broniatowski et al. (2018) David A Broniatowski, Amelia M Jamison, SiHua Qi, Lulwah AlKulaib, Tao Chen, Adrian Benton, Sandra C Quinn, and Mark Dredze. Weaponized health communication: Twitter bots and russian trolls amplify the vaccine debate. American journal of public health, 108(10):1378–1384, 2018.
Buchanan et al. (2021) Ben Buchanan, Andrew Lohn, Micah Musser, and Katerina Sedova. Truth, lies, and automation. Center for Security and Emerging Technology, 1(1):2, 2021.
Cabrio & Villata (2018) Elena Cabrio and Serena Villata. Five years of argument mining: A data-driven analysis. In IJCAI, volume 18, pp. 5427–5433, 2018.
Campbell et al. (1960) Angus Campbell, Philip E. Converse, Warren E. Miller, and Donald E. Stokes. The American Voter. Wiley, 1960.
Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 785–794, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/10.1145/2939672.2939785.
Coppock et al. (2020) Alexander Coppock, Seth J Hill, and Lynn Vavreck. The small effects of political advertising are small regardless of context, message, sender, or receiver: Evidence from 59 real-time randomized experiments. Science advances, 6(36):eabc4046, 2020.
De Angelis et al. (2023) Luigi De Angelis, Francesco Baglivo, Guglielmo Arzilli, Gaetano Pierpaolo Privitera, Paolo Ferragina, Alberto Eugenio Tozzi, and Caterina Rizzo. Chatgpt and the rise of large language models: the new ai-driven infodemic threat in public health. Frontiers in Public Health, 11:1166120, 2023.
DiResta et al. (2019) Renee DiResta, Kris Shaffer, Becky Ruppel, David Sullivan, Robert Matney, Ryan Fox, Jonathan Albright, and Ben Johnson. The tactics & tropes of the internet research agency. 2019.
Durmus & Cardie (2018) Esin Durmus and Claire Cardie. Exploring the role of prior beliefs for argument persuasion. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
Erikson & Tedin (2019) Robert S Erikson and Kent L Tedin. American public opinion: Its origins, content, and impact. Routledge, 2019.
Goldstein et al. (2023) Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246, 2023.
Guess & Coppock (2020) Andrew Guess and Alexander Coppock. Does counter-attitudinal information cause backlash? results from three large survey experiments. British Journal of Political Science, 50(4):1497–1515, 2020.
Hirsh et al. (2012) Jacob B Hirsh, Sonia K Kang, and Galen V Bodenhausen. Personalized persuasion: Tailoring persuasive appeals to recipients’ personality traits. Psychological science, 23(6):578–581, 2012.
Hu & Collier (2024) Tiancheng Hu and Nigel Collier. Quantifying the persona effect in llm simulations. arXiv preprint arXiv:2402.10811, 2024.
Johnson & Blair (2006) Ralph Henry Johnson and J Anthony Blair. Logical self-defense. Idea, 2006.
Lawrence & Reed (2020) John Lawrence and Chris Reed. Argument mining: A survey. Computational Linguistics, 45(4):765–818, 2020.
Liu et al. (2023) Zhexiong Liu, Mohamed Elaraby, Yang Zhong, and Diane Litman. Overview of imagearg-2023: The first shared task in multimodal argument mining. In Proceedings of the 10th Workshop on Argument Mining, pp. 120–132, 2023.
Matz et al. (2017) Sandra C Matz, Michal Kosinski, Gideon Nave, and David J Stillwell. Psychological targeting as an effective approach to digital mass persuasion. Proceedings of the national academy of sciences, 114(48):12714–12719, 2017.
Minkin (2023) Rachel Minkin. Diversity, equity and inclusion in the workplace: A survey report, 2023.
O’Keefe (2015) Daniel J O’Keefe. Persuasion: Theory and research. Sage Publications, 2015.
Orji et al. (2015) Rita Orji, Regan L Mandryk, and Julita Vassileva. Gender, age, and responsiveness to cialdini’s persuasion strategies. In Persuasive Technology: 10th International Conference, PERSUASIVE 2015, Chicago, IL, USA, June 3-5, 2015, Proceedings 10, pp. 147–159. Springer, 2015.
O’Keefe & Jackson (1995) Daniel J O’Keefe and Sally Jackson. Argument quality and persuasive effects: A review of current approaches. In Argumentation and values: Proceedings of the ninth Alta conference on argumentation, pp. 88–92, 1995.
Salvi et al. (2024) Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, and Robert West. On the conversational persuasiveness of large language models: A randomized controlled trial. arXiv preprint arXiv:2403.14380, 2024.
Schoenegger et al. (2024) Philipp Schoenegger, Indre Tuminauskaite, Peter S Park, and Philip E Tetlock. Wisdom of the silicon crowd: Llm ensemble prediction capabilities match human crowd accuracy. arXiv preprint arXiv:2402.19379, 2024.
Simchon et al. (2024) Almog Simchon, Matthew Edwards, and Stephan Lewandowsky. The persuasive effects of political microtargeting in the age of generative artificial intelligence. PNAS Nexus, 3(2):pgae035, January 2024. ISSN 2752-6542. doi: 10.1093/pnasnexus/pgae035.
Solaiman et al. (2023) Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, et al. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, 2023.
Staab et al. (2023) Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. Beyond memorization: Violating privacy via inference with large language models, 2023.
Tan et al. (2016) Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. In Proceedings of the 25th international conference on world wide web, pp. 613–624, 2016.
Tappin et al. (2023) Ben M Tappin, Chloe Wittenberg, Luke B Hewitt, Adam J Berinsky, and David G Rand. Quantifying the potential persuasive returns to political microtargeting. Proceedings of the National Academy of Sciences, 120(25):e2216261120, 2023.
Teeny et al. (2021) Jacob D Teeny, Joseph J Siev, Pablo Briñol, and Richard E Petty. A review and conceptual framework for understanding personalized matching effects in persuasion. Journal of Consumer Psychology, 31(2):382–414, 2021.
Wachsmuth et al. (2017) Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. Computational argumentation quality assessment in natural language. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 176–187, 2017.
Walton (2005) Douglas Walton. Fundamentals of critical argumentation. Cambridge University Press, 2005.
Weidinger et al. (2023) Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al. Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986, 2023.
West et al. (2023) Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, et al. The generative AI paradox: What it can create, it may not understand”. In The Twelfth International Conference on Learning Representations, 2023.

Appendix A Prompts

In all tasks conducted for this study, the LLMs were prompted to respond to questions using only one of the options: ”Pro,” ”Con,” or ”Tie,” without any additional words or punctuation. Nevertheless, many of the models failed to adhere to this instruction, necessitating post-processing to extract the actual answer. Generally, incorrect responses involved the answer being accompanied by additional spaces or punctuation or presented in a complete sentence format, such as: “Based on the given demographics, the person is most likely to agree with the ‘Con’ side in the debate.”⁴⁴4There were occasions when the LLMs failed to produce any answer, resulting in responses akin to: “I cannot determine the person’s position in the debate without additional information.” To extract the answer in these cases, we used a simple Regex expression, finding the first occurrence of the words “Pro,” “Con,” or “Tie.”

Table 4 shows what percentage of responses followed instructions and the corresponding percentage from which answers could be successfully extracted for the PoliProp dataset across each question. The answer extracted percentage indicates the highest achievable accuracy for each model in its results. Notably, both open-source models encountered challenges in complying with the instructions, with particular difficulty in addressing Q3.

Question	Model	Correct Form (%)	Answer Extracted (%)
1	GPT-3.5	99.88	100.00
1	GPT-4	99.06	100.00
1	Llama 2	0.00	95.08
1	Mistral	62.76	95.55
2	GPT-3.5	99.71	99.77
2	GPT-4	99.80	99.80
2	Llama 2	0.00	97.17
2	Mistral	67.14	100.00
3	GPT-3.5	99.82	99.94
3	GPT-4	99.61	99.98
3	Llama 2	0.04	97.17
3	Mistral	17.19	81.48

Table 4: Some models had difficulty following instructions and giving the answer in the correct form of either ”Pro,” ”Con,” or ”Tie.” In this table, we see what percentage of the answers were given in the correct form and what percentages contained an answer after processing the result for the PoliProp dataset.

Appendix B Crowdsourcing

We recruited participants for our study through Amazon Mechanical Turk between December 2023 and March 2024, requiring that they be 18+ years old, located in the US, and have a master’s qualification provided by Amazon. The study was paid $2.25 and had a median completion time of 11 minutes, corresponding to a pay rate of about $12.00/hour. To ensure the quality of answers, we asked users to justify their responses to each question. We then manually assessed the responses and considered them to be high-quality. We reproduce the crowdsourcing questions on the next page. We also provide an example justification below.

•

S#1: Being a Democrat and to a lesser extent white and female all correlate with being pro-LGBTQ.
•

S#3: The con side goes off on an unusual, libertarian leaning bend that probably just wouldn’t appeal to this type of person who would simply connect with the pro side more.
•

S#3: The con side argues less directly about this particular topic and more about some kind of libertarian; the state should have nothing to do with any of this kind of thing, which just isn’t as compelling as the pro side making clear why gay people should be integrated into the current system. The con side also repeatedly appeals to some really weak slippery slope stuff and doesn’t engage well with how the pro side responds.

Subtask 1

Read the following proposition, i.e., a statement that affirms or denies something.

Proposition: Gay marriage should be legal.

Consider an individual with the following demographic characteristics.

1.

Education: Graduate Degree
2.

Gender: Female
3.

Party: Undecided
4.

Political Ideology: Progressive
5.

Religious Ideology: Christian

•

In your opinion, would this person agree (Pro), disagree (Con), or be neutral or undecided (Tie) with the proposition?
•

Write a brief justification for your answer. A sensible justification is required for your HIT to get approved.

Subtask 2

Consider the following debate on the proposition, where one individual argues for the proposition (Pro) and another against (Con).

Proposition: Gay marriage should be legal.

[debate]

Consider an individual with the following demographic characteristics.

1.

Education: Graduate Degree
2.

Gender: Female
3.

Party: Undecided
4.

Political Ideology: Progressive
5.

Religious Ideology: Christian

•

Given this information, what stance do you think this person would take on the above proposition after reading the debate? Answer the same as before if you believe the debate had no effect on their opinion, and choose a different answer if you believe the debate had an effect on their opinion.
•

Write a brief justification for your answer. A sensible justification is required for your HIT to get approved.

Subtask 3

Again, consider the same debate on the proposition.

Proposition: Gay marriage should be legal.

[debate]

•

Disregarding your own point of view on the debate, please determine which debater you believe had more convincing arguments. The individual arguing for the proposition (Pro) or against it (Con)? If both were similarly convincing, indicate that it was a ”Tie.”
•

Write a brief justification for your answer. A sensible justification is required for your HIT to get approved.