LEXI: Large Language Models Experimentation Interface

Guy Laban
Department of Computer Science and Technology
University of Cambridge
Cambridge, UK
[email protected]
&Tomer Laban
[email protected]
&Hatice Gunes
Department of Computer Science and Technology
University of Cambridge
Cambridge, UK
[email protected]
Abstract

The recent developments in Large Language Models (LLM), mark a significant moment in the research and development of social interactions with artificial agents. These agents are widely deployed in a variety of settings, with potential impact on users. However, the study of social interactions with agents powered by LLM is still emerging, limited by access to the technology and to data, the absence of standardised interfaces, and challenges to establishing controlled experimental setups using the currently available business-oriented platforms. To answer these gaps, we developed LEXI, LLMs Experimentation Interface, an open-source tool enabling the deployment of artificial agents powered by LLM in social interaction behavioural experiments. Using a graphical interface, LEXI allows researchers to build agents, and deploy them in experimental setups along with forms and questionnaires while collecting interaction logs and self-reported data. The outcomes of usability testing indicate LEXI’s broad utility, high usability and minimum mental workload requirement, with distinctive benefits observed across disciplines. A proof-of-concept study exploring the tool’s efficacy in evaluating social HAIs was conducted, resulting in high-quality data. A comparison of empathetic versus neutral agents indicated that people perceive empathetic agents as more social, and write longer and more positive messages towards them.

Keywords Human-Agent Interaction  \cdot Large Language Models  \cdot Open-Source  \cdot Behavioural Experimentation  \cdot Usability Testing

1 Introduction

Conversational artificial agents (e.g., chatbots) are employed regularly in behavioural studies, aimed at engaging users in social interactions and elicit affective responses. Such interactions with human users across various contexts have shown how these agents can change users’ perceptions and behaviours towards technology [e.g., 35, 38, 3, 36], while also demonstrating the impact that engagement with these agents can have on users’ emotions and well-being [11, 30, 57]. The introduction of Large Language Models (LLMs) into conversational artificial agents, ranging from chatbots [7] to robots [56] represents a meaningful step in the research and development of human-agent interactions (HAI). However, the field is at a critical juncture due to a significant knowledge gap in empirical investigations of interactions with LLM-powered agents. Despite their increasing use, current empirical research in social and behavioural HAI is limited in its application of LLMs, as researchers face constraints in deploying interactions with LLM-powered artificial agents [53, 2]. This limitation primarily arises due to technical and resource constraints. Many social and behavioural HAI researchers may not have access to the necessary infrastructure or resources to effectively deploy and manage these advanced models within a conversational agent interface for social interaction. The absence of standardized interfaces and the necessity for integrating various technical components for deployment further hinder the ability to conduct large-scale, methodologically rigorous studies. Accordingly, many of the social and behavioural studies in the field are focused on users’ observations of content produced by these agents [e.g., 10, 25, 19, 42], rather than evaluating users’ behaviour in actual interactions with agents. This gap underscores the necessity for empirical behavioural research to explore how humans interact with these advanced agents, the impact of different LLM configurations and prompting schemes on these interactions, and the subsequent effects on user behaviour, affect, and perception.

To bridge these gaps, we introduce LEXI, LLMs Experimentation Interface, an open-source tool for conducting social interaction experiments with conversational artificial agents that are powered with LLM. The tool is designed to offer researchers a graphical interface (GUI) for prompting LLM-powered agents and deploying them within experimental designs. It integrates questionnaires and annotation features, enabling efficient collection of social interaction logs and self-reported data. This tool aims to enhance methodological precision, facilitate standardized research conditions, and improve research efficiency and accessibility. By offering an open-source tool with GUI, we aspire to foster better experimental control and replicability in social and behavioural HAI research, allowing researchers to extend from observational HAI studies to interaction-based experimental studies with artificial agents powered by-LLMs. Thereby, LEXI is aimed at enabling more comprehensive and systematic empirical studies across diverse populations and settings, supporting interdisciplinary research efforts and contributing to the democratization of AI and HAI research. LEXI provides empirical researchers with access to the tools they need to facilitate HAI experimental research with LLMs, paving the way for a deeper understanding of the social and behavioural implications of HAI.

2 The State of the art

Before exploring LLM-powered chatbots, various online providers supported users in develo** their chatbots. Some, like "Manychat" [43] and "Chatfuel" [17], were responding only to predefined scripts and integrated into existing platforms such as Facebook Messenger, WhatsApp, or Telegram. These chatbots primarily catered to small and medium businesses seeking to automate their online presence and customer communication [23]. Another notable platform is "Dialogflow" by Google [26], which allows users to build rule-based chatbots with customization options. While popular for business use, Dialogflow has also gained traction in the research community, boasting a substantial developer community (over 1.5 million developers) supporting experimentation and data collection (of interaction logs) using chatbots [see 4], with studies reporting for the collection of self-reported data via interactions with agents [61]. Several studies used Dialogflow chatbots in studies with single interactions [e.g., 37, 38, 58], and repeated interactions [e.g., 5, 61].

Since the widespread adoption of LLMs, the deployment of chatbots has evolved into a distinct task, with a focus on implementing LLM-powered chatbots. As evidenced by previous studies [e.g., 22, 33], some researchers have successfully managed to deploy their own LLM-powered chatbots. Most researchers utilise existing LLM APIs, such as OpenAI API, to create a diverse range of chatbots deployed in various contexts. These do-it-yourself (DIY) chatbots offer several advantages. They are resource-efficient and benefit from the abundance of available LLM APIs that keeps on growing [62], some of which provide convenient platforms for fine tuning prompts for these chatbots (e.g., the OpenAI playground [48], enabling users to test prompts and deploy chatbots directly from the platform). Furthermore, there are various packages and open-source tools like "LangChain" [15, 31] and "FastChat" [63] that streamline the process by offering tools to load, execute, and integrate LLM into conversational interfaces. This facilitates the reproducibility of such projects, as researchers can share their chatbot’s code with others, providing access to tools and codes to support the replication of their research. However, despite these tools simplifying the DIY process and eliminating the need for extensive AI or software development expertise, building an agent requires more than just utilising an LLM-API. Specifically, it involves develo** a user interface framework for user interaction using environments like React, Vue.js, or Angular, and connecting it to the agent as well as to additional required functionalities. The implementation of additional features necessitates additional resources from researchers invested in development. Consequently, many DIY (LLM-powered) chatbots tend to be relatively simple. They lack manipulation of experimental conditions, lack complex rules or iterations, and do not reference external data or memory related to participants’ previous interactions, among other limitations.

Much like the pre-LLMs era chatbot development platforms, the contemporary landscape features several emerging platforms enabling users to construct their own LLM-powered chatbots through GUIs. Examples include the updated version of "Chatfuel" [17], "ChatBoost" [16], "Oneai" [46], and "Officely AI" [45]. These platforms are explicitly designed for small to medium business owners, providing advanced customization options such as utilising external data, extensive memory, and a GUI for managing interactions. While they offer a single-component deployment for user convenience, they lack essential features for the integration into empirical research settings and experimentation. Primarily tailored for incorporation into instant messaging apps, these platforms fall short in supporting experimental control and user allocation into conditions. Some limit the collection of interaction logs, and their access is often restricted and requires payment for extensive systematic experiments with large sample size. As of March 2024, many of these platforms have a free access tier, but these often include many restrictions (e.g., limited number of messages, interactions or users, limited time period, and others). Despite their user-friendly nature, these platforms pose challenges for research, impeding experimental control, methodological robustness, and the ability to replicate and share agents or code within the research community. One exception is Officely AI, which, although not entirely replicable, it supports the deployment of chatbots on websites as JavaScript widgets. Its dialogue manager facilitates a degree of experimental control, while also allowing researchers to switch between different LLMs according to their preferences. Another example is the "Custom GPTs" [47] by OpenAI, allowing users to build their own GPT-powered chatbot and share it in OpenAI marketplace. Nevertheless, these chatbots are limited to OpenAI LLMs and to OpenAI’s interface, and researchers cannot collect interaction logs from users’ interactions.

An intermediary option is the latest version of Google’s Dialogflow, known as "Dialogflow CX" [26]. This version offers comparable functionalities to the conventional Dialogflow ES, but it incorporates Google’s LLM to enhance the chatbot’s capabilities. Dialogflow’s user-friendly interface can support the construction of chatbots by researchers, enabling them to effectively manage interactions through the dialogue manager. Dialogflow CX supports the deployment across various interfaces as well as the option to share the agent’s code or a repository with others. However, it’s important to note that chatbots developed in Dialogflow CX are constrained to Google’s LLM, limiting researchers ability to compare different models. Despite the availability of a free tier, sustained use of Dialogflow incurs costs, especially when integrating LLM, making it less accessible to a wider audience. Moreover, while deployment is relatively straightforward, Dialogflow CX differs from marketing-oriented chatbots as it is not a single component chatbot, and it requires researchers to assemble several components together to deploy such an agent in experimental settings, including interface and database [4]. Hence, Dialogflow CX does not allow the collection of interaction logs directly from the chat interface, requiring researchers to manage this independently.

LLM API GUI - Researcher GUI - User Collecting Logs Experimental Control Dialogue Manager External Data Questionnaires Memory Access Reproducibility Components
LEXI \checkmark \checkmark \checkmark \checkmark \checkmark \circ \circ \checkmark \neq \checkmark \checkmark \checkmark
Officely.ai \checkmark \checkmark \checkmark \circ \circ \checkmark \checkmark \neq \circ \circ \circ \checkmark
Dialogflow CX \circ \circ \checkmark \circ \neq \checkmark \circ \neq \circ \circ \checkmark \circ
Custom GPT’s \circ \checkmark \circ \neq \neq \neq \checkmark \neq \circ \neq \circ \circ
Oneai \circ \circ \checkmark \checkmark \neq \circ \checkmark \neq \checkmark \circ \neq \checkmark
DIY \checkmark \neq \circ \circ \circ \circ \circ \circ \circ \circ \checkmark \neq
Dialogflow ES \neq \circ \checkmark \circ \neq \checkmark \neq \neq \neq \circ \circ \circ
Chatfuel \circ \checkmark \circ \circ \circ \checkmark \checkmark \neq \circ \circ \neq \checkmark
ChatBoost \circ \checkmark \circ \neq \neq \circ \circ \neq \neq \circ \neq \checkmark
ManyChats \neq \checkmark \circ \neq \neq \circ \neq \neq \neq \circ \neq \checkmark
Table 1: Evaluation and comparison of different tools for deploying artificial agents online based on multiple criteria, including (from left to right): the availability of different LLM APIs (LLM API), the GUI of the researcher dashboard (GUI - Researcher) and user interaction with the agent (GUI - User), the option to collect interaction logs using the tool (Collecting Logs), implementation of experimental control and conditions (Experimental Control), features of dialogue management for controlling agents’ prompting and message iteration (Dialogue Manager), the option for training agents with external data (External Data), the use of questionnaires for collecting self reported data in deployed experiments (Questionnaires), memory capabilities of agents for remembering information from previous interactions (Memory), the tool’s accessibility in terms of ease of deployment and cost (Access), reproducibility of experimental designs deployed and their reporting via these tools and whether they support such features (Reproducibility), and the extent to which the platform depends on multiple components (Components). Each criterion is evaluated as confirming to the criterion (\checkmark), partly confirming to the criterion (\circ), or not confirming to the criterion (\neq).

3 Gaps and Goals

The current landscape of platforms and tools for deploying customized artificial agents powered by LLM reveals a significant gap in the ability to conduct rigorous, replicate, and methodologically sound HAI experiments. This gap is attributed to existing platforms primarily catering to business and service applications, lacking the necessary features for managing ongoing experiments, simulating complex experimental manipulation and control, collecting interaction logs and self-reported data, and open-source frameworks that ensure transparency and support replicability across diverse research settings. Moreover, the diversity in commercial platforms and the reliance on external interfaces complicates access, as well as introducing variety of confounds through visual stimuli such as inconsistent user interfaces, differing interaction designs and modalities, and varying in graphical elements and navigation flows, further challenging the standardisation and replication of empirical research findings [34, 50]. These platforms often work with a single model and provide little to no flexibility in changing various aspects of the agent’s operation. Moreover, most of these platforms limit efficient data collection, with no access to interaction logs and data collected via external questionnaires. Finally, DIY approach facilitated by LLM APIs, despite its resource efficiency, often results in oversimplified agents and interfaces and requires several components to operate successfully in experimental settings, which limits the accessibility to HAI research for many social and behavioural researchers.

To address these challenges, LEXI is designed specifically for HAI experimentation that leverages LLM technology while providing researchers a GUI, aiming to fulfill several critical requirements (see table 1). First, LEXI seeks to improve access to LLM technology while maintaining experimental control, facilitating the manipulation of experimental conditions and allowing users to structurally prompt agents powered by LLM via GUI (see section 4.1.3). This allows for better standardization across studies and enhances the replicability of experimentation methods and research findings in HAI. Moreover, by using LEXI, research teams could potentially save resources and time, supporting more effective research endeavors. Researchers can test and compare prompts, models, and user affect and behaviour in a systematic manner, without the need for creating in-house LLM-powered agents from scratch. Furthermore, given the widespread use of LLM across different artificial agents (e.g., chatbots, social robots, voice assistants, virtual agents, recommender systems, etc.), a standardised disembodied agent could serve as a control condition for experimentally evaluating the behavioural, social, cognitive, and ethical implications of incorporating LLMs into various artificial agents of different embodiment. Thus, researchers could compare user interactions between agents (e.g., social robots and chatbots) in a comparable manner [e.g., 39, 54].

LEXI introduces a GUI that mirrors the design of current chatbot interfaces used for personal interactions, adhering to familiar user interface (UI) conventions from CUI (conversational user interface) applications [55], as opposed to those used for corporate and service interactions. This choice aims to maintain high ecological validity by ensuring that the UI accurately reflects real-world chatbot use scenarios and user experiences (see section 4.2.2). Additionally, LEXI uses a participant registration system that mimics current user onboarding and access processes (see section 4.2.1), facilitating long-term interactions for longitudinal research. This approach seeks to make research participants feel like they are interacting with a familiar app rather than a surveying platform, aimed at sustaining genuine participant engagement over time. Finally, by recognizing the interdisciplinary nature of HAI research, LEXI is designed to be accessible to researchers with diverse backgrounds, including social and behavioural sciences, enabling them to contribute to HAI research and address complex social and behavioural questions empirically using LEXI’s GUI.

4 The Current Tool

Refer to caption
Figure 1: An interface map that explains the ’Admin Dashboard’ of LEXI, detailing the interface and functionalities available to researchers for managing experiments, agents, and forms. Yellow boxes indicate information communicated on these pages.

4.1 Researcher side

The following section, together with Figure 1, provides a structured overview of LEXI’s main components and how to add, configure, and control various aspects of running an experiment using LEXI.

4.1.1 Accessibility and Open-Source

LEXI contains a GUI, ensuring ease of use for researchers regardless of their technical background. LEXI is open-source, available for non-commercial use. Researchers can download it from GitHub111https://github.com/Tomer-Lavan/Lexi and deploy it either locally or online. LEXI is licensed under the CC BY-NC-SA 4.0 license, encouraging others to contribute and reuse LEXI for non-commercial purposes under the same license222https://creativecommons.org/licenses/by-nc-sa/4.0/.The interface includes intuitive elements like dialogue boxes, buttons, bars, and text fields. When installing the system, researchers can set up their log-in credentials for an admin account that will provide them access to the ’Admin Dashboard’ (see Figure 1). In addition, when installing the system using the source code, researchers need to insert their credentials for MongoDB and OpenAI API (or any other LLM used with LEXI).

4.1.2 Experiment Management

The main screen of the admin dashboard is the ’Experiments Management’ page, where researchers can manage their experiments. On this page, researchers can view key information about their experiments, including titles, descriptions, the number of participants recruited, the number of sessions conducted, the number of open (incomplete) sessions, the launch date of the experiment, and its current status (active or inactive). Researchers can create new experiments using the ’Add Experiment’ field, where they can specify a title and provide a description. In the ’Experiment Agents’ field, they can select the experimental design and allocate agents along with their distribution to conditions (see Figure 2). LEXI supports both single-agent studies and between-subject experimental designs (’A/B testing’), allowing for the deployment of two conditions using two distinct agents. Participants can be allocated to conditions either through random allocation, dividing them equally with 50% for each agent, or through custom distribution, where the researcher decides the allocation of participants per agent. In the ’Experiment Features’ field researchers can include additional features, such as the ’Stream Message’ feature that alters the visual stimuli during interactions, and the ’User Annotation’ feature that allows participants to rate messages with "Likes" (1) and "Dislikes" (-1). This feature provides researchers with the ability to annotate messages communicated by the agent, providing them with additional data to work with and potentially constructing models. In the ’Experiment Forms’ field researchers can select the relevant forms that will appear in the experiment, including forms for registration, a form that appears Before Conversation and a form that appears After Conversation (see Section 4.1.4). Finally, in the ’Experiment Boundaries’ field, researchers can set limits on key parameters, including the maximum number of participants, conversations per participant, and messages per interaction for each participant. This feature gives researchers more control over their experiment and adds an extra layer of security. It ensures the security of their API credentials and the integrity of the collected data, preventing misuse of the agent. Upon completing data collection, researchers can deactivate their experiment. This important step ensures that the agents cannot be used by users outside the intended scope of the experiment. After setting up an experiment, LEXI generates an ’Experiment Address’ (a URL) that researchers can share with potential participants or embed in external platforms as needed, such as Qualtrics or Google Forms. When accessing the experiment, researchers can change the content of the experiment’s main page, providing different title and main body text with instructions. LEXI stores all collected data in a MongoDB database, allowing access in both JSON and Excel formats. Researchers can download the collected data from the ’Experiment Settings’.

Refer to caption
Refer to caption
Figure 2: Left to right: (1) The ’Experiment Management’ page when adding a new experiment. (2) The ’Agents Management’ page when adding a new agent.

4.1.3 Building and Managing Agents

On the ’Agents Management’ page, researchers can add new agents, providing them with a title and description for identification. Researchers can then select the LLM, for example, between GPT-3.5-turbo and GPT-4-1106-preview. For building the agent, researchers start by setting the ’First Chat Sentence’ to establish how the agent initiates interactions and by defining the ’System Starter Prompt’ to give the agent context and guidelines for interacting with participants (see Figure 2). The ’Before User Sentence Prompt’ and ’After User Sentence Prompt’ fields can be used to further instruct the model’s responses based on user input at each iteration. LEXI offers several additional options for customizing the agent’s responses to these prompts, including: Temperature, Maximum Tokens, Top P, Frequency Penalty, Presence Penalty, and Stop Sequence [for description of these, see 52].

4.1.4 Building Forms and Questionnaires

On the ’Forms Management’ page researchers can set up forms to gather a wide array of self-reported data and responses from questionnaires at various points throughout an experiment. When adding a new form, researchers start by assigning a distinctive name to each form for identification, a title that will be displayed to participants, as well as specific instructions (see Figure 3). Then, researchers can begin listing questions in the form, choosing a relevant ’Question Type’ (see Figure 1). In the current version, each form is limited to a maximum of 15 questions/items, with a unique ’Key’ assigned to each question for subsequent storage in the dataset. For each question, researchers can enter the text in the ’Question Text’ field, specify scoring options (see Figure 1), and determine whether the question is mandatory (’Required’), has a default value, and if the scoring options are to be visible (’Numbered’). Researchers can create a ’Registration’ form designed to collect demographic information during the registration process of participants. It allows participants to enter this information once, when registering a username (served as a participant ID) for the study participation, thus avoiding the repetition of filling out this information across multiple forms and after the first session of the study. In future versions of LEXI we intend to extend this feature for screening participants and allocating them accordingly to experimental conditions (i.e., agents). Researchers can also create questionnaires that will appear before and/or after interactions for examining changes in emotions, behaviours, and perceptions. This guarantees the collection of data immediately before and after interactions, reducing the need for external survey tools, which may interrupt the study’s flow and usability. Forms linked to the ’Before Conversation’ and ’After Conversation’ fields in the ’Experiment Management’ page (see section 4.1.2) will include ’Pre’ and ’Post’ prefixes in their dataset keys, respectively. If necessary, researchers can direct participants to external surveys through the ’Post-interaction Message’ (see section 4.2.2) by editing the source code. By embedding essential parameters in these links, such as username, session, and condition, these can be transferred to the chosen survey tool.

Refer to caption
Figure 3: The ’Forms Management’ section, existing forms are displayed on the left, the form currently being edited by the researcher is in the middle, and editing fields are on the right

4.2 Participant Side

4.2.1 Registration and screening data

When participants access LEXI they can sign up by selecting the ’First Time’ button. They need to pick a username, which will act as their unique participant ID for the dataset inclusion and to facilitate their return for subsequent engagements with LEXI. Participant ID is used as a username to resemble existing real-world platforms online that should help participants remember their login credentials for future visits. When choosing a username, participants are also asked to enter their age and identified gender. Researchers can remove these fields by modifying the tool’s source code. To collect additional demographic data, researchers can add a demographic survey to the ’Registration’ form (see Section 4.1.4). Upon completing registration, participants are directed to the ’Experiment’s Main Page’. Returning participants can log in with their username through the ’Not First Time?’ page and directly access the ’Experiment’s Main Page’.

4.2.2 Interaction

On the ’Experiment’s Main Page’, participants are provided with the study’s instructions and can start the interaction by clicking ’Start Conversation’ If included by the researcher, the interaction will start with the ’Before Conversation’ form. Upon submitting their responses, participants encounter the agent’s ’First Chat Sentence’, marking the beginning of the interaction (see Figure 4). Should participants decide to end the interaction, they can do so by clicking the ’Finish’ button. Then, if included by the researcher, the ’After Conversation’ form will appear on the screen. Finally, participants are thanked for their contribution through the ’Post-interaction Message’, signifying the end of the interaction.

Refer to caption
Figure 4: In the Interaction page, participants can change the font size on the left. They can communicate with the agent through the text bar at the bottom, annotate the agent’s messages using the like/dislike buttons, and conclude the interaction by pressing the ’Finish’ button on the left.

5 Usability Testing

5.1 Methods

To evaluate whether LEXI is simple and easy to use, we conducted a usability test where 9 researchers with computer science and engineering background and 5 researchers with social and behavioural sciences background attempted to use LEXI for setting up a between-subjects experiment with two distinct agents. The participants were not familiar with LEXI before the experiment. The participants received temporary admin credentials to use LEXI, and were given instructions for potential experiments that they could create. The participants were asked to perform 5 tasks (Ti) in LEXI:

  1. 1.

    T1. Setting up two distinct agents for a between-subjects experiment that is aimed at comparing users’ behaviour towards the two agents.

  2. 2.

    T2. Setting up a registration form and a form for collecting self-reported data before and after the interaction.

  3. 3.

    T3. Setting up the experiment allocating agents and forms correctly.

  4. 4.

    T4. Testing the experiment by doing a test run, and sharing with a potential participant to collect data for one case.

  5. 5.

    T5. Download and examine the data collected.

The time taken for each task, measured in seconds, was recorded. Upon completion of each task, participants reported for their mental workload using the Task Load Index (TLX-raw) [29, 14]. Additionally, we asked for participants’ feedback on their experiences and any further comments or insights they wished to convey following each task. After all five tasks were completed, the participants evaluated LEXI’s usability via the System Usability Scale (SUS) [13]. Subsequently, we asked the participants whether they would like to use LEXI in future research, and what was their general impression of the tool. Finally, participants were asked to self-assess their technical and methodological expertise, and their familiarity with LLM. We also gathered qualitative data to gain a deeper understanding of how researchers use LEXI. We asked users to describe their experiences through a series of open-ended questions after each task, focusing on their ease of use and potential improvement. Furthermore, after completing all the tasks we asked participants open-ended questions addressing their overall usability and impression of LEXI, and how LEXI could support their research.

5.2 Quantitative Results

Our findings indicate that participants generally rated LEXI’s usability highly (M = 3.80, SD = .76, Med = 4.1, α𝛼\alphaitalic_α=.9), with no substantial difference based on users’ research backgrounds. Regarding mental workload, it was reported as minimal across all tasks, with no notable differences between the disciplines. Tasks 1 (M = 2.96, SD = 1.09, Med = 3.08) and 2 (M = 2.98, SD = .97, Med = 3.08) were observed to demand a marginally higher mental workload, possibly attributed to their longer duration and the increased number of steps involved, including the creation of prompts for building agents and coming up with questionnaire items. Specifically, participants with a computer science and engineering background recorded a slightly elevated mental workload for Task 5 (M = 3.04, SD = 1.59, Med = 2.83), which involved examining the collected data. However, these workload scores remained low and fell below the median score (see Figure 5). Concerning tasks’ duration, with the exception of Task 3, participants with a background in social and behavioual research generally took longer to complete most tasks. Tasks 1 (M = 885.17, SD = 770.86, Med = 549.09) and 2 (M = 569.16, SD = 399.60, Med = 392.02) were identified as the most time-intensive, likely due to the complexity and the multiple steps involved in designing agents and questionnaires, including the formulation of prompts and questionnaire items (see Figure 5). In terms of qualitative responses, most participants expressed that they found the tasks to be easy. Some addressed their need for further examples or additional instructions, especially for building questionnaires due to their limited experience with survey building tools. All except of one of the participants expressed that they would wish to use LEXI again for their research, and more than half of the participants expressed that they are excited about the opportunities such interface can provide for their research.

Refer to caption
Refer to caption
Figure 5: Left to right: (1) Tasks’ Mean Mental Workload by Participants’ Research Background. (2) Tasks’ Duration by Participants’ Research Background

5.3 Qualitative Results

For T1, participants positively addressed the intuitive layout and ease of setting up agents. One researcher noted, "It is easy to understand, especially navigated with the clear instructions". Another user added, "If this actually works well, and the two agents are distinguishable, it will be very useful for my research". This feedback highlights LEXI’s usable interface and its potential to mitigate current limitations in the field. During T2, users appreciated LEXI’s ability to collect data in a centralized manner, while addressing the intuitiveness of the tool’s design. For example, one participant remarked, "I like the layout, it’s very easy to add questions and resembles similar platforms for creating surveys like Google forms", while another mentioned, "I feel like in an actual questionnaire it would be very effective". This indicates that users found the tool accommodating and suitable for creating comprehensive questionnaires, an essential component of social and behavioural experimental designs in HAI. In T3, participants found the process of setting up experiments straightforward and efficient. One user expressed, "Activating and setting up the experiment was quick and easy". Another noted, "It was really convenient and easy to set-up experiments, I knew right away what I should be doing" underscoring the engaging nature of the setup process and the tool’s ability to facilitate complex experimental designs. For T4, feedback was positive regarding the tool’s performance and usability. A participant shared, "It turned out differently than I expected, as it was very smooth and efficient. I didn’t imagine how simple it would be to deploy an agent" Another comment highlighted, "The test experiment went off without a hitch", suggesting that the overall experience was positive and the tool performed well under testing conditions. For T5, users appreciated the straightforward data download and review process. One participant mentioned, "After a first second of trying to understand the interface, it was intuitive". Another stated, "Overall I think the data is in a nice format! I can already see how I would analyse it". This indicates that users found the data retrieval process efficient and the data presentation clear and useful for their research purposes.

Participants also provided general feedback on their overall impressions of LEXI’s fit for their research objectives. Many expressed a positive outlook, with the general feedback highlights LEXI’s broad utility in facilitating HAI experiments. Researchers noted that the tool’s comprehensive features transcend current practices by integrating LLM-powered artificial agents into experimental setups, significantly saving time and effort from establishing these themselves. Reflecting on this feedback in light of the participants’ backgrounds, the results shows that LEXI can effectively support researchers with varying levels of technical proficiency. Researchers with social and behavioural science background were stating that using LEXI could support their research to move from observational studies to more interaction-based experiments.

However, some users provided suggestions for improvement, reflecting their diverse backgrounds and levels of technical expertise. For example, a participant with a more technical background noted, "I would have liked to receive examples before starting", indicating a need for more guided instructions. Another user mentioned, "The interface of the forms is not clear" suggesting that visual aids might benefit those less familiar with the setups that are often practiced in experimental research. Due to their technical background, that might not adheres as strongly with behavioural experimentation research practices and methodology, it could be that these tasks were not perceived as intuitive compared to the way they were perceived by researchers with social and behavioural science background. Given that LEXI is open-source, the research community can contribute to these improvements, ensuring that the tool evolves to meet the diverse needs of its users - empirical HAI researchers. By encouraging collaboration and feedback from users of different scientific and technical backgrounds, LEXI can continue to enhance its usability and functionality.

6 Proof-of-concept Study

To validate and demonstrate LEXI’s capacity to collect high quality data in online behavioural experiments, we conducted a between-subjects study comparing two distinct agents demonstrating varying levels of empathetic communication. In this proof-of-concept study, we explored how social interactions with disembodied conversational agents influence users’ behaviour, perception, and mood, comparing empathetic communication to neutral communication. Empathy is addressed as cognitive empathy [20], specifically involving the engagement with, recognition and understanding of users’ emotions for supporting the user [9, 28], while neutrality focuses on engaging with factual objective information that is related to the content and events shared by the user [24, 28]. We built two distinct agents using LEXI that were powered with GPT-3.5-turbo, an empathetic agent (condition A), and a neutral agent (condition B). In both conditions, LEXI initiated the interaction in the following way - “Hi there! I’m Lexi, and I’m here to listen to you. Feel free to share whatever is on your mind, and I’m here to listen and help however I can. What’s been on your mind lately?”. Beyond the ’First Chat Sentence’, each experimental condition was operationalized with distinct prompts following previous studies simulating empathetic communication in artificial agents [6, 60, 12, 49, 40] to simulate clear differences of empathy and neutrality in the agents’ communication.

6.1 Methods

We recruited 100 individuals (Mage = 41.17, SD = 10.99, 46.3% identified as females) via Prolific who were randomly allocated to one of the two LEXI agents (51 participants in condition A). Since the interactions are text-based and in English, we recruited participants who reside in the UK, reporting English to be their first language. Participants received a payment of 4 GBP for 20 minutes of participation. The study was approved by the department’s ethics committee. Participants accessed the study via a link in Prolific to a Qualtrics page where they received information regarding their participation and signed an informed consent. Then, participants received instructions regarding their following interaction with LEXI. LEXI was embedded within the same page so that participants would not need to leave the Qualtrics page. When accessing LEXI, participants were asked to register a username for their participation, and answer a short demographic questionnaire reporting their age, identified gender, biological sex, marital status, and number of children. Participants were asked to report their mood using the 12-item Immediate Mood Scaler (IMS-12) [44], after which they started the interaction with LEXI by encountering the ’First Chat Sentence’ to which they could respond. The interaction continued until the participant decided to end it, clicking the ’Finish’ button, and reporting their mood once again (post-interaction) via the IMS-12 [44]. Then, a messaged popped up on the screen, telling them that they may continue with the remaining of the survey on Qualtrics. In the remaining questionnaires on Qualtrics, participants evaluated LEXI and the interaction (see section 6.2). Beyond self-reported data, we collected additional data from LEXI and the interaction logs. The experiment generated a total of 1507 message observations from users and 1607 message observations from agents.

6.2 Results

In terms of usability, participants answered SUS [13] reporting for a very high usability score (M = 4.5, SD = .42, α𝛼\alphaitalic_α = .83). Independent samples t-tests were conducted to evaluate the differences between the two conditions. In terms of the agents’ social perception, participants attributed greater agency (i.e., the extent to which the artificial agent displays the ability to plan and act independently [27]) to the empathetic agent (M=68.38, SD=21.09) compared to the neutral agent (M=63.14, SD=22.91), with a significant mean difference of -5.23, t=-4.60, p<.001𝑝.001p<.001italic_p < .001, d = -0.24. Moreover, participants perceived the empathetic agent (M=62.92,SD=26.49) to demonstrate higher sense of experience (i.e., the extent to which the artificial agent displays the ability to sense and feel [27]) over the neutral agent (M=52.11, SD=28.55), with a mean difference of -10.81, t=-7.60, p<.001𝑝.001p<.001italic_p < .001, d = -.39. Therefore, the empathetic agent was perceived as more autonomous and capable of experiencing users’ emotions, suggesting a more socially meaningful interaction experience.

Affective engagement in users’ behavioural responses towards the agents further supported these results, with participants writing longer messages to the empathetic agent (M=20.34, SD=19.36) compared to the neutral one (M=15.20, SD=14.06), which is a significant mean difference of -5.14, t=-5.94, p<.001𝑝.001p<.001italic_p < .001, d = -.31. Sentiment analysis of these messages reflected a more positive tone when interacting with the empathetic agent, as indicated by higher sentiment scores (M=.24, SD=.42 vs. M=.14, SD=.41), with a mean difference of -.10, t=-4.89, p<.001𝑝.001p<.001italic_p < .001, d = -.25.

Furthermore, the empathetic agent had a meaningful influence over participants’ mood. Improved mood following the interaction with the empathetic agent (M=5.24, SD=.93) was reported as higher than mood post interaction with the neutral agent (M=4.74, SD=1.22), with a mean difference of -.50, t=-8.98, p<.001𝑝.001p<.001italic_p < .001, d = -.46. Additionally, mood change before to after the interaction was more renounced for participants interacting with the empathetic agent (M=.60,SD=.83) than with the neutral agent (M=.37,SD=.80), with a mean difference of -.23, t=-5.37, p<.001𝑝.001p<.001italic_p < .001, d = -.28.

Finally, a binary logistic regression was conducted to evaluate the effects of agent type (empathetic vs. neutral) and number of words in message written by the agent on the likelihood that participants would like or dislike a message. The model was statistically significant, χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(2) = 41.94, p<.001𝑝.001p<.001italic_p < .001, indicating that it was able to distinguish between participants who liked or dislike a message. The model explained 3.5% of the variance in liking a message and correctly classified 61.6% of cases. The number of words in a message written by the agent was a significant predictor of liking a message, with each additional word increasing the odds of liking a message by 2% (β𝛽\betaitalic_β = .02, p<.001𝑝.001p<.001italic_p < .001, Exp(β)𝐸𝑥𝑝𝛽Exp(\beta)italic_E italic_x italic_p ( italic_β ) = 1.02). The condition was not a significant predictor of liking a message (β𝛽\betaitalic_β = -.4, p = .982)

7 Discussion and Conclusions

As LLMs are increasingly applied in conversational artificial agents [21, 59], understanding the psychological mechanisms humans employ when communicating with these agents becomes crucial. It is essential to study and understand the behavioural factors influencing HAI, including stimuli, context of use, and user characteristics. With the advancement of LLM technology and its growing role in society, identifying the opportunities and challenges associated with integrating LLMs into these agents is imperative. More importantly, we must comprehend how such interactions affect human users. Employing such agents in ongoing social interactions via LEXI will simulate systematic and comparable desired applications, producing replicable evidence on people’s perceptions and behaviours towards these agents. These findings are particularly critical now, as we stand at a juncture of creating artificial agents that further fulfil their potential to communicate freely in variety of social domains.

As an open-source tool for rigorous, controlled experimentation, LEXI could contribute to the development of ethical guidelines that can inform the design and deployment of artificial agents in sensitive applications [8, 41, 32]. Research conducted via LEXI will provide crucial evidence for the benefits and potential risks of using artificial agents in social settings, informing the development of ethical guidelines for their responsible use. As the field continues to evolve, the ethical frameworks established through empirical research could be instrumental in guiding the responsible integration of LLMs into societal contexts. LEXI’s open-source nature and GUI stands as an ethical stance in itself, promoting transparency and accessibility. It enables a broad spectrum of researchers from diverse research backgrounds to engage with advanced LLM technologies and contribute to our research field, which democratizes the ability to contribute to the field’s understanding of AI’s impact on society [2]. This approach aligns with ethical principles of inclusivity, fairness, and the responsible dissemination of AI technology [18, 51]. Furthermore, LEXI’s support for open science practices, through its open-source accessibility, directly contributes to ethical research by promoting transparency, inclusivity, and collaborative advancements in HAI research. Nevertheless, it is important to consider that researchers need to maintain crucial data privacy principles in mind. As an open-source tool deployed by researchers and connects to LLM APIs, not a product or an LLM itself, it is researchers responsibility for maintaining their participants’ data privacy, similar to when using other data collection tools or LLMs for deploying agents in social and behavioural experiments.

While LEXI currently offers key features for conducting complex experiments, further enhancements are planned to afford researchers greater methodological control. We aim to refine the interface to allow for additional experimental conditions, within-subjects treatments with variable agents during interactions, and extra phases to integrate questionnaires during interactions for repeated measures designs. The prompt engineering interface will be extended so researchers could incorporate information in a variety of ways, including Retrieval-Augmented Generation (RAG), and adding rules for dialogue management. Whilst researchers can already utilise LEXI for long-term experiments, we also plan to augment LEXI’s capacity for handling external data sources and improving memory for longitudinal designs. Kee** pace with advancements in LLMs, we intend to integrate additional open-source LLMs (e.g., Hugging Face API) and allow researchers to connect their custom models to LEXI, thus facilitating tests with human users, employing different prompts and comparisons with other LLMs. A federated learning framework is to be introduced to foster transparency and collaborative development, utilising diverse datasets. This approach, alongside tools for model sharing and community collaboration, aims to create a versatile and insightful research environment. Most importantly, sharing LEXI as an open-source tool allows it to evolve with diverse inputs. This community-driven development should ensure continuous refinement of LEXI’s features and design, tailored to the evolving needs of our diverse multidisciplinary research community.

LEXI’s introduction provides a promising step towards bridging the gap in HAI empirical research, offering an accessible and controlled environment for studying social interactions with conversational artificial agents. While its current capabilities and the potential for future enhancements position it as a valuable tool for interdisciplinary research, ongoing development and community feedback, involvement and contribution will be crucial in realizing its full potential and addressing the challenges of integrating LLMs into affective and social domains of HAI research.

Acknowledgments

G. Laban and H. Gunes are supported by the EPSRC project ARoEQ under grant ref. EP/R030782/1.

References

  • [1]
  • Allen et al. [2019] Bibb Allen, Sheela Agarwal, Jayashree Kalpathy-Cramer, and Keith Dreyer. 2019. Democratizing AI. Journal of the American College of Radiology 16, 7 (7 2019), 961–963. https://doi.org/10.1016/J.JACR.2019.04.023
  • Araujo [2018] Theo Araujo. 2018. Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions. Computers in Human Behavior 85 (2018), 183–189. https://doi.org/10.1016/j.chb.2018.03.051
  • Araujo [2020] Theo Araujo. 2020. Conversational Agent Research Toolkit: An alternative for creating and managing chatbots for experimental research. Computational Communication Research 2, 1 (2020), 35–51. https://doi.org/10.5117/CCR2020.1.002.ARAU
  • Araujo and Bol [2024] Theo Araujo and Nadine Bol. 2024. From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents. Computers in Human Behavior: Artificial Humans 2, 1 (1 2024), 100030. https://doi.org/10.1016/J.CHBAH.2023.100030
  • Axelsson et al. [2024] Minja Axelsson, Micol Spitale, and Hatice Gunes. 2024. "Oh, Sorry, I Think I Interrupted You": Designing Repair Strategies for Robotic Longitudinal Well-being Coaching. Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (3 2024), 13–22. https://doi.org/10.1145/3610977.3634948
  • Banerjee et al. [2023] Debarag Banerjee, Pooja Singh, Arjun Avadhanam, and Saksham Srivastava. 2023. Benchmarking LLM powered Chatbots: Methods and Metrics. (8 2023). https://arxiv.longhoe.net/abs/2308.04624v1
  • Bejarano et al. [2022] Gissella Bejarano, Fan Li, Nick Ruijs, and Yuan Lu. 2022. Ethics & AI: A Systematic Review on Ethical Concerns and Related Strategies for Designing with AI in Healthcare. AI 4, 1 (12 2022), 28–53. https://doi.org/10.3390/AI4010003
  • Bellet and Maloney [1991] Paul S. Bellet and Michael J. Maloney. 1991. The Importance of Empathy as an Interviewing Skill in Medicine. JAMA 266, 13 (10 1991), 1831–1832. https://doi.org/10.1001/JAMA.1991.03470130111039
  • Ben-Zion et al. [2024] Ziv Ben-Zion, Kristin Witte, Akshay Kumar Jagadish, Or Duek, Ilan Harpaz-Rotem, Marie-Christine Khorsandian, Achim Burrer, Erich Seifritz, Philipp Homan, Eric Schulz, and Tobias Raphael Spiller. 2024. “Chat-GPT on the Couch”: Assessing and Alleviating State Anxiety in Large Language Models. (5 2024). https://doi.org/10.31234/OSF.IO/J7FWB
  • Bendig et al. [2019] Eileen Bendig, Benjamin Erb, Lea Schulze-Thuesing, and Harald Baumeister. 2019. The Next Generation: Chatbots in Clinical Psychology and Psychotherapy to Foster Mental Health – A Sco** Review. Verhaltenstherapie (2019), 1–13. https://doi.org/10.1159/000501812
  • Birmingham et al. [2022] Christopher Birmingham, Ashley Perez, and Maja Mataric. 2022. Perceptions of Cognitive and Affective Empathetic Statements by Socially Assistive Robots. ACM/IEEE International Conference on Human-Robot Interaction 2022-March (2022), 323–331. https://doi.org/10.1109/HRI53351.2022.9889386
  • Brooke [1996] John Brooke. 1996. SUS: A ’Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry (1 ed.). CRC Press, 207–212. https://doi.org/10.1201/9781498710411-35
  • Bustamante and Spain [2008] Ernesto A. Bustamante and Randall D. Spain. 2008. Measurement Invariance of the Nasa TLX. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 3 (9 2008), 1522–1526. https://doi.org/10.1177/154193120805201946
  • Chase [2022] Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
  • ChatBoost [2024] ChatBoost. 2024. ChatBoost - Enterprise class chatbots. https://chatboost.chat/
  • Chatfuel [2024] Chatfuel. 2024. Chatfuel | AI agents for automated sales | Meta’s partner. https://chatfuel.com/
  • Chi et al. [2021] Nicole Chi, Emma Lurie, and Deirdre K. Mulligan. 2021. Reconfiguring Diversity and Inclusion for AI Ethics. AIES 2021 - Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (7 2021), 447–457. https://doi.org/10.1145/3461702.3462622
  • Coda-Forno et al. [2023] Julian Coda-Forno, Kristin Witte, Akshay K. Jagadish, Marcel Binz, Zeynep Akata, and Eric Schulz. 2023. Inducing anxiety in large language models increases exploration and bias. (4 2023). https://arxiv.longhoe.net/abs/2304.11111v1
  • Davis [1983] Mark H. Davis. 1983. Measuring individual differences in empathy: Evidence for a multidimensional approach. Journal of Personality and Social Psychology 44, 1 (1983), 113–126. https://doi.org/10.1037/0022-3514.44.1.113
  • Dong et al. [2023] Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu. 2023. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (8 2023), 5792–5793. https://doi.org/10.1145/3580305.3599572
  • Ferrara [2023] Alessio Ferrara. 2023. Empowering emotional well-being through a LLM-based chatbot : a comparative study with the standard journaling technique. Ph. D. Dissertation. Politcnico Milano, MIlan, Italy. https://www.politesi.polimi.it/handle/10589/209293https://hdl.handle.net/10589/209293
  • Følstad et al. [2021] Asbjørn Følstad, Theo Araujo, Effie Lai-Chong Law, Petter Bae Brandtzaeg, Symeon Papadopoulos, Lea Reis, Marcos Baez, Guy Laban, Patrick McAllister, Carolin Ischen, Rebecca Wald, Fabio Catania, Raphael Meyer von Wolff, Sebastian Hobert, and Ewa Luger. 2021. Future directions for chatbot research: an interdisciplinary research agenda. Computing 2021 103 (2021), 2915–2942. https://doi.org/10.1007/S00607-021-01016-7
  • Gelso and Kanninen [2017] Charles J. Gelso and Katri M. Kanninen. 2017. Neutrality revisited: On the value of being neutral within an empathic atmosphere. Journal of Psychotherapy Integration 27, 3 (9 2017), 330–341. https://doi.org/10.1037/INT0000072
  • Glickman and Sharot [2022] Moshe Glickman and Tali Sharot. 2022. How human-AI feedback loops alter human perceptual, emotional and social judgements. (11 2022). https://doi.org/10.31219/OSF.IO/C4E7R
  • Google Cloud [2024] Google Cloud. 2024. Dialogflow | Google Cloud. https://cloud.google.com/dialogflow?hl=en
  • Gray et al. [2007] Heather M Gray, Kurt Gray, and Daniel M Wegner. 2007. Dimensions of Mind Perception. Science 315, 5812 (2 2007), 619 LP – 619. https://doi.org/10.1126/science.1134475
  • Hall et al. [2021] Judith A. Hall, Rachel Schwartz, and Fred Duong. 2021. How do laypeople define empathy? The Journal of Social Psychology 161, 1 (1 2021), 5–24. https://doi.org/10.1080/00224545.2020.1796567
  • Hart [2006] Sandra G. Hart. 2006. Nasa-Task Load Index (NASA-TLX); 20 Years Later. Proceedings of the Human Factors and Ergonomics Society Annual Meeting (10 2006), 904–908. https://doi.org/10.1177/154193120605000909
  • Hoermann et al. [2017] Simon Hoermann, Kathryn L. McCabe, David N. Milne, and Rafael A. Calvo. 2017. Application of Synchronous Text-Based Dialogue Systems in Mental Health Interventions: Systematic Review. J Med Internet Res 19, 8 (8 2017), e7023. https://doi.org/10.2196/JMIR.7023
  • Jeong; [2023] Cheonsu Jeong;. 2023. Generative AI service implementation using LLM application architecture: based on RAG model and LangChain framework. Journal of Intelligence and Information Systems 29, 4 (2023), 129–164. https://doi.org/10.13088/JIIS.2023.29.4.129
  • Jobin et al. [2019] Anna Jobin, Marcello Ienca, and Effy Vayena. 2019. The global landscape of AI ethics guidelines. Nature Machine Intelligence 2019 1:9 1, 9 (9 2019), 389–399. https://doi.org/10.1038/s42256-019-0088-2
  • Kim et al. [2023] Taewan Kim, Seolyeong Bae, Hyun Ah Kim, Su-woo Lee, Hwajung Hong, Chanmo Yang, and Young-Ho Kim. 2023. MindfulDiary: Harnessing Large Language Model to Support Psychiatric Patients’ Journaling. 1 (10 2023). https://doi.org/XXXXXXX.XXXXXXX
  • Kunkel et al. [1995] Klaus Kunkel, Maria Bannert, and Peter W. Fach. 1995. The influence of design decisions on the usability of direct manipulation user interfaces. Behaviour & Information Technology 14, 2 (1995), 93–106. https://doi.org/10.1080/01449299508914629
  • Laban [2021] Guy Laban. 2021. Perceptions of Anthropomorphism in a Chatbot Dialogue: The Role of Animacy and Intelligence. In Proceedings of the 9th International Conference on Human-Agent Interaction. ACM, New York, NY, USA, 305–310. https://doi.org/10.1145/3472307.3484686
  • Laban and Araujo [2020a] Guy Laban and Theo Araujo. 2020a. The Effect of Personalization Techniques in Users’ Perceptions of Conversational Recommender System. In IVA ’20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA ’20). ACM, Virtual Event, Glasgow, Scotland UK, 3. https://doi.org/10.1145/3383652.3423890
  • Laban and Araujo [2020b] Guy Laban and Theo Araujo. 2020b. Working Together with Conversational Agents: The Relationship of Perceived Cooperation with Service Performance Evaluations. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 11970. https://doi.org/10.1007/978-3-030-39540-7{_}15
  • Laban and Araujo [2022] Guy Laban and Theo Araujo. 2022. Don’t Take it Personally: Resistance to Individually Targeted Recommendations from Conversational Recommender Agents. In HAI 2022 - Proceedings of the 10th Conference on Human-Agent Interaction. Association for Computing Machinery, New York, NY, USA, 57–66. https://doi.org/10.1145/3527188.3561929
  • Laban et al. [2021] Guy Laban, Jean-Noël George, Val Morrison, and Emily S. Cross. 2021. Tell me more! Assessing interactions with social robots from speech. Paladyn, Journal of Behavioral Robotics 12, 1 (2021), 136–159. https://doi.org/10.1515/pjbr-2021-0011
  • Laban et al. [2023] Guy Laban, Arvid Kappas, Val Morrison, and Emily S. Cross. 2023. Building Long-Term Human–Robot Relationships: Examining Disclosure, Perception and Well-Being Across Time. International Journal of Social Robotics (11 2023), 1–27. https://doi.org/10.1007/S12369-023-01076-Z
  • Lee et al. [2022] Minha Lee, Jaisie Sin, Guy Laban, Matthias Kraus, Leigh Clark, Martin Porcheron, Benjamin R Cowan, Asbjørn Følstad, Cosmin Munteanu, and Heloisa Candello. 2022. Ethics of Conversational User Interfaces. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3491101.3503699
  • Leng and Yuan [2023] Yan Leng and Yuan Yuan. 2023. Do LLM Agents Exhibit Social Behavior? (12 2023). https://arxiv.longhoe.net/abs/2312.15198v2
  • Manychat [2024] Manychat. 2024. Chat Marketing Made Easy with Manychat. https://manychat.com/
  • Nahum et al. [2017] Mor Nahum, Thomas M Van Vleet, Vikaas S Sohal, Julie J Mirzabekov, Vikram R Rao, Deanna L Wallace, Morgan B Lee, Heather Dawes, Alit Stark-Inbar, Joshua Thomas Jordan, Bruno Biagianti, Michael Merzenich, and Edward F Chang. 2017. Immediate Mood Scaler: Tracking Symptoms of Depression and Anxiety Using a Novel Mobile Mood Scale. JMIR Mhealth Uhealth 5, 4 (2017), e44. https://doi.org/10.2196/mhealth.6544
  • OfficelyAI [2024] OfficelyAI. 2024. Officely AI - Turn Your Data Into an AI Agents. https://www.officely.ai/
  • OneAI [2024] OneAI. 2024. Drive Growth with GPT Agent on Your Website | OneAI. https://oneai.com/
  • Open AI [2024] Open AI. 2024. Introducing GPTs. https://openai.com/blog/introducing-gpts
  • OpenAI API [2024] OpenAI API. 2024. Playground - OpenAI API. https://platform.openai.com/playground
  • Paiva et al. [2017] Ana Paiva, Iolanda Leite, Hana Boukricha, and Ipke Wachsmuth. 2017. Empathy in virtual agents and robots: A survey. ACM Transactions on Interactive Intelligent Systems 7, 3 (9 2017). https://doi.org/10.1145/2912150
  • Park and Hwan Lim [1999] Kyung S. Park and Chee Hwan Lim. 1999. A structured methodology for comparative evaluation of user interface designs using usability criteria and measures. International Journal of Industrial Ergonomics 23, 5-6 (3 1999), 379–389. https://doi.org/10.1016/S0169-8141(97)00059-0
  • Roche et al. [2022] Cathy Roche, P. J. Wall, and Dave Lewis. 2022. Ethics and diversity in artificial intelligence policies, strategies and initiatives. AI and Ethics 2022 3:4 3, 4 (10 2022), 1095–1115. https://doi.org/10.1007/S43681-022-00218-9
  • Saravia [2022] Elvis Saravia. 2022. Prompt Engineering Guide. Technical Report. https://github.com/dair-ai/Prompt-Engineering-Guide
  • Sathish et al. [2024] Vishwas Sathish, Hannah Lin, Aditya K Kamath, and Anish Nyayachavadi. 2024. LLeMpower: Understanding Disparities in the Control and Access of Large Language Models. arXiv preprint arXiv:2404.09356 (2024).
  • Sayis and Gunes [2024] Batuhan Sayis and Hatice Gunes. 2024. Technology-assisted Journal Writing for Improving Student Mental Wellbeing: Humanoid Robot vs. Voice Assistant. ACM/IEEE International Conference on Human-Robot Interaction (3 2024), 945–949. https://doi.org/10.1145/3610978.3640721
  • Sin et al. [2023] Jaisie Sin, Heloisa Candello, Leigh Clark, Benjamin R. Cowan, Minha Lee, Cosmin Munteanu, Martin Porcheron, Sarah Theres Völkel, Stacy Branham, Robin N. Brewer, Ana Paula Chaves, Razan Jaber, and Amanda Lazar. 2023. CUI@CHI: Inclusive Design of CUIs Across Modalities and Mobilities. Conference on Human Factors in Computing Systems - Proceedings (4 2023). https://doi.org/10.1145/3544549.3573820
  • Spitale et al. [2023] Micol Spitale, Minja Axelsson, and Hatice Gunes. 2023. VITA: A Multi-modal LLM-based System for Longitudinal, Autonomous, and Adaptive Robotic Mental Well-being Coaching. (12 2023). http://arxiv.longhoe.net/abs/2312.09740
  • Vaidyam et al. [2019] Aditya Nrusimha Vaidyam, Hannah Wisniewski, John David Halamka, Matcheri S. Kashavan, and John Blake Torous. 2019. Chatbots and Conversational Agents in Mental Health: A Review of the Psychiatric Landscape. Canadian journal of psychiatry. Revue canadienne de psychiatrie 64, 7 (7 2019), 456–464. https://doi.org/10.1177/0706743719828977
  • Warren-Smith et al. [2023] Gabriella Warren-Smith, Guy Laban, Emily-Marie Pacheco, and Emily S. Cross. 2023. Knowledge cues to human origins facilitate self-disclosure during interactions with chatbots. PsyArxiv (12 2023). https://doi.org/10.31234/OSF.IO/7QSDN
  • Xi et al. [2023] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie **, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuan**g Huang, and Tao Gui. 2023. The Rise and Potential of Large Language Model Based Agents: A Survey. (9 2023). https://arxiv.longhoe.net/abs/2309.07864v3
  • Yalcin [2019] Ozge Nilay Yalcin. 2019. Evaluating Empathy in Artificial Agents. 2019 8th International Conference on Affective Computing and Intelligent Interaction, ACII 2019 (9 2019), 290–296. https://doi.org/10.1109/ACII.2019.8925498
  • Zarouali et al. [2024] Brahim Zarouali, Theo Araujo, Jakob Ohme, and Claes de Vreese. 2024. Comparing Chatbots and Online Surveys for (Longitudinal) Data Collection: An Investigation of Response Characteristics, Data Quality, and User Evaluation. Communication Methods and Measures (1 2024). https://doi.org/10.1080/19312458.2022.2156489
  • Zhao et al. [2023] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, **hao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. (3 2023). https://arxiv.longhoe.net/abs/2303.18223v13
  • Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. (6 2023). https://arxiv.longhoe.net/abs/2306.05685v4