MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control

Yeonji Lee1           Sangjun Park1,2           Kyunghyun Cho3superscript3{}^{3^{*}}start_FLOATSUPERSCRIPT 3 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT           **Yeong Bak1superscript1{}^{1^{*}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT
1Sungkyunkwan University, Suwon, South Korea
2Upstage AI, South Korea
3New York University, NY, USA
[email protected], [email protected],
[email protected], [email protected]
Abstract

As mental health issues globally escalate, there is a tremendous need for advanced digital support systems. We introduce MentalAgora, a novel framework employing large language models enhanced by interaction between multiple agents for tailored mental health support. This framework operates through three stages: strategic debating, tailored counselor creation, and response generation, enabling the dynamic customization of responses based on individual user preferences and therapeutic needs. We conduct experiments utilizing a high-quality evaluation dataset TherapyTalk crafted with mental health professionals, showing that MentalAgora generates expert-aligned and user preference-enhanced responses. Our evaluations, including experiments and user studies, demonstrate that MentalAgora aligns with professional standards and effectively meets user preferences, setting a new benchmark for digital mental health interventions.111Code and Dataset: https://github.com/jennylee03/MentalAgora

MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control


Yeonji Lee1           Sangjun Park1,2           Kyunghyun Cho3superscript3{}^{3^{*}}start_FLOATSUPERSCRIPT 3 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT           **Yeong Bak1superscript1{}^{1^{*}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT 1Sungkyunkwan University, Suwon, South Korea 2Upstage AI, South Korea 3New York University, NY, USA [email protected], [email protected], [email protected], [email protected]


**footnotetext: Corresponding authors

1 Introduction

With 1 in every 8 people confirming they have experienced mental disorders (WHO, 2022), estimates suggest that nearly half of the global population will encounter mental health challenges at some point (McGrath et al., 2023). This trend underscores the need for effective and personalized support systems tailored to meet the needs of individuals (Lu et al., 2023; Chen et al., 2023a). Moreover, the rise of digital platforms has transformed communication, increasing the demand for online support and engagement. This shift highlights the importance of accessible and effective digital interventions.

Due to excellent communication skills, there are ongoing efforts to apply Large Language Models (LLMs) in the fields of sociology, psychology, and psychiatry (Sharma et al., 2023; Chiu et al., 2024; Li et al., 2023). Also, to leverage the benefits of collective intelligence, research into imitating human interactions is actively underway (Zhang et al., 2024; Chan et al., 2023) using multi-agent debating and simulation systems.

In the mental health domain, it is an ongoing topic to provide emotionally supportive responses for people having mental concerns (Hua et al., 2024). Recent advancements have improved their ability to generate empathetic and clinically relevant responses (Wei et al., 2021; Loh and Raamkumar, 2023; Shen et al., 2020). Despite these improvements, a single LLM still struggles to comprehensively understand users’ complex situations, limiting the depth and specificity of responses (Guo et al., 2024).

Addressing this gap, we propose a new framework, MentalAgora, adopting a multi-agent method in the mental health domain and supporting response generation. This framework leverages the concept of multi-agent interaction Chen et al. (2023b); Park et al. (2023), where each agent models different aspects, providing a richer interpretation. MentalAgora applies various counseling strategies as attributes of LLM agents, involving them in debating to generate the most suitable persona addressing user concerns. This persona prompt enables LLMs to provide the most appropriate responses for the user. To verify the effectiveness of MentalAgora, we create the TherapyTalk dataset. The automatic and human evaluation results show that MentalAgora outperforms other methods, achieving the highest alignment with expert responses and satisfying user preferences.

Furthermore, we collect actual user concerns to verify whether MentalAgora can provide assistance to real users, offering responses to their issues. As a result, MentalAgora demonstrates greater user satisfaction compared to all other baselines, highlighting its potential in digital mental health interventions.

Our contributions in this paper are manifold:

  • We develop a novel framework, MentalAgora, to facilitate dynamic and nuanced interactions via multi-agent debating, delivering mental health support based on user preferences.

  • We enhance attribute controllable generation to dynamically customize responses by allowing for a multifaceted counselor persona creation tailored to a wide range of user preferences.

  • We curate a high-quality dataset, TherapyTalk, in collaboration with mental health professionals to ensure responses are grounded in expert knowledge.

  • Our work showcases the potential utility of this approach in transforming mental health support, validated through comprehensive experiments and user study.

Refer to caption
Figure 1: MentalAgora framework overview. This diagram outlines the MentalAgora framework, showcasing its three stages: Strategic Debating, Tailored Counselor Creation, and Response Generation, which collectively enhance the creation of personalized therapeutic responses based on user-specific needs.

2 Related Work

In mental health support response generation, most research has been conducted with single-agent LLMs, which often yield general and static responses. Multi-agent methods can overcome these limitations, generating more dynamic and nuanced responses. However, their application in the mental health domain has been limited despite success in evaluation or reasoning. This is notable, as the intricate nature of mental health issues could benefit from the diverse perspectives and responses offered by multi-agent systems.

2.1 Mental Health Support Response Generation

The majority of these studies uses Transformers and LLMs to discern the user’s intent and emotions within the conversation. Following Rashkin et al. (2019)’s introduction of the empathetic dataset and task, subsequent studies have focused on generating and rewriting empathetic responses (Sharma et al., 2021; Mishra et al., 2023; Shen et al., 2020). These studies aim to enhance the detection of emotions and sentiments in user utterances (Zhan et al., 2022) and strive to attain greater empathy. Building on counselor reflection as a key verbal skill, Min et al. (2022) developed a system for analyzing these reflections, while Wang et al. (2023) leveraged user profiles and comment histories to generate personalized responses, using generative models conditioned on the questions and respondents’ data.

Sharma et al. (2023) introduced a cognitive therapeutic technique aimed at countering negative thoughts by replacing them with more optimistic reframed thoughts highlighting the capability of LLMs in the psychiatry field. Additionally, Liu et al. (2023) discusses the development of chatbots designed to enhance counseling interactions in terms of psychological support. These studies have laid the groundwork for support response generation, setting the stage for our research that incorporates multi-agent systems in this domain.

2.2 Controlled Text Generation

In the domain of controlled text generation, Zou et al. (2021) proposed “inverse prompting” to enhance control over text generation in poem generation and question answering. Han et al. (2022) focused on mimicking fictional characters through pseudo dialog prompting in capturing character styles. Similarly, Madotto et al. (2021) addressed continual learning in task-oriented dialogue systems with a benchmark featuring 37 domains and an efficient architectural method. Yang et al. (2023) presents a parameter-efficient approach for attribute-controlled generation, achieving improvements in attribute-specific text generation with minimal additional training parameters.

2.3 Multi-Agent Systems

Our approach draws inspiration from relevant literature in the fields of multi-agent debating and simulation framework. Chen et al. (2023b) proposes a framework that demonstrates how multi-agent systems refine chatbot dynamics, promoting adaptive responses. Shao et al. (2023) reveals the capability of fine-tuning models to emulate distinct personas, enriching the diversity of character simulations. Additionally, Chan et al. (2023) and Du et al. (2023) highlight the role of these systems in improving evaluative and reasoning capacities of models through structured debates. Recent findings suggest that the performance of LLMs can be significantly enhanced by increasing the number of agents Li et al. (2024). Through a simple sampling-and-voting method, it has been demonstrated that the efficacy of LLMs scales with the number of instantiated agents, an approach that is orthogonal to existing methods and suggests promising avenues for future research. In contrast to the common applications of multi-agent systems in evaluation and reasoning tasks, our research extends their utility to the mental health domain, addressing the intricate nature of psychological support with a richer, more context-sensitive dialogue framework.

3 MentalAgora Framework

To enhance the capability of LLMs to generate tailored support responses, we propose MentalAgora, a framework simulating debating sessions with LLMs imitating strategic counselors. Since the structure of MentalAgora can consider diverse perspectives, it provides more contextually relevant and personalized support responses. Our framework operates through three stages: strategic debating, tailored counselor creation, and support response generation.

3.1 Strategic Agents Setup

We select counseling strategies (Fall et al., 2022; Ellis, 1989) and compose attributes based on the strategies to ensure that each response is not only individually dynamic but also therapeutically appropriate. There are three attributes used: reframing, which involves altering perspectives for a more positive outlook; unconditional positive regard, emphasizing total acceptance and support of the user; and solution-focused approaches, which concentrate on the user’s strengths and goals (Zhou et al., 2023; Biyani et al., 2014). Each of these attributes is chosen for its proven efficacy in traditional counseling settings (Fall et al., 2022; Ellis, 1989). Appendix A provides detailed description for the attributes.

Building on these attributes, our counselor agents function based on their own attributes within a multi-agent role setup, utilizing prompting to incorporate core counseling strategies to analyze and address the concerns and needs of users.

3.2 Stage 1: Strategic Debating

In the strategic debating stage (Algorithm 1), an LLM functions as an independent agent representing each attribute through unique role settings reflecting the specific attribute. Each agent is instructed to evaluate the user’s status and concerns from its own perspective. The debate is also structured to determine the relative importance of each attribute in addressing the user’s concerns. Each agent ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT generates a response based on the user’s post P𝑃Pitalic_P and the debating history D𝐷Ditalic_D. This process is repeated by the number of debating turns N𝑁Nitalic_N. Through this process, the debating history regarding the user post is obtained.

Algorithm 1 Strategic Debating Stage
0:  User post P𝑃Pitalic_P, Number of debating turns N𝑁Nitalic_N, Counselor agents A={a1,a2,a3}𝐴subscript𝑎1subscript𝑎2subscript𝑎3A=\{a_{1},a_{2},a_{3}\}italic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }
0:  Debating dialogue history D𝐷Ditalic_D
1:  D=[]𝐷D=[]italic_D = [ ] {Initialize dialogue history list}
2:  for i𝑖iitalic_i from 1 to N𝑁Nitalic_N do
3:     for each agent ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in A𝐴Aitalic_A do
4:        Generate response Ri,ajsubscript𝑅𝑖subscript𝑎𝑗R_{i,a_{j}}italic_R start_POSTSUBSCRIPT italic_i , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT for P𝑃Pitalic_P with ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT referring to D𝐷Ditalic_D
5:        Add response Ri,ajsubscript𝑅𝑖subscript𝑎𝑗R_{i,a_{j}}italic_R start_POSTSUBSCRIPT italic_i , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT to dialogue D𝐷Ditalic_D
6:     end for
7:  end for
8:  return  D𝐷Ditalic_D

3.3 Stage 2: Tailored Counselor Creation

Refer to caption
Figure 2: Prompt example for tailored counselor creation. The green texts are placeholders for actual data.

This stage involves creating a prompt for a tailored counselor to generate actual user responses. The LLM is instructed to reference the user’s posts and debating history to determine the degree of influence of each agent and to generate the counselor prompt. The degree of influence is expressed as a score between 1 and 3. Figure 2 manifests our input prompt for this stage.

3.4 Stage 3: Support Response Generation

Refer to caption
Figure 3: Prompt example for response generation. The green texts are placeholders for actual data.

In the final step, we instruct the LLM to write a response that provides the most appropriate support for the user’s post, with the prompt shown in Figure 3. At this time, the LLM is given the previously created counselor prompt as a persona. Since the prompt includes the influence scores for each attribute, the LLM can numerically reflect the importance of each attribute and also utilize the extensive user information embedded in debating history.

4 Datasets

It is necessary for user posts to contain their personal mental concerns along with their individual characteristic information. Additionally, for evaluating responses generated by the model, golden responses are required. These responses need to be tailored to each post, provide emotional empathy, and be clinically appropriate. First, we used the Counsel Chat dataset (Bertagnolli, 2020), which consists of user concerns and responses from mental health experts on an online mental health platform.

We also build our own dataset TherapyTalk. TherapyTalk consists of subreddit posts related to mental health concerns from Reddit and responses to these posts by mental health experts. Unlike Counsel Chat, TherapyTalk is additionally labeled by annotators how much each response includes the three attributes. These attribute levels are used in our analysis to demonstrate similarity between our approach and expert responses. We guide the experts to write consistent responses with guidelines. Appendix B describes the dataset construction process in detail.

TherapyTalk Counsel Chat Dataset
Model Method BLEU R-L BScore GM HM BLEU R-L BScore GM HM
GPT-4.0 SA 24.52 15.86 94.79 33.28 26.23 18.35 14.43 94.16 32.58 23.69
SAA 24.23 16.51 94.71 33.59 26.69 18.05 14.47 94.20 32.45 23.62
MAA 25.27 16.94 94.70 34.35 27.48 19.48 15.36 95.32 34.45 25.42
MentalAgora 28.59 16.50 95.31 35.56 28.28 19.53 15.40 95.14 34.45 25.45
GPT-3.5-turbo SA 21.18 15.09 94.83 31.18 24.19 15.99 14.08 94.04 29.18 20.63
SAA 21.38 15.34 94.45 31.41 24.48 15.04 13.77 94.14 28.55 19.83
MAA 22.95 15.25 94.77 32.13 25.06 18.83 14.30 94.19 31.65 22.78
MentalAgora 26.50 15.73 94.80 34.06 26.82 19.14 14.87 94.22 32.00 23.10
LLAMA-2-13b SA 21.11 15.36 93.25 31.15 24.35 10.98 13.95 93.99 28.28 19.50
SAA 10.52 15.01 94.08 24.58 17.41 08.76 12.80 94.06 26.22 15.91
MAA 18.70 15.52 94.59 30.17 23.35 16.35 14.27 94.15 31.17 21.92
MentalAgora 26.49 15.65 94.61 33.98 26.73 17.74 14.29 94.21 32.00 23.00
MentalAlpaca SA 18.42 15.33 94.96 29.93 23.07 10.74 14.36 93.22 27.89 19.29
SAA 08.49 13.65 95.12 22.26 14.88 09.28 12.52 93.29 25.76 16.88
MAA 18.70 15.52 94.59 30.17 23.35 12.36 13.43 94.18 29.58 20.08
MentalAgora 20.65 15.90 94.97 31.48 24.62 18.89 14.47 94.44 32.54 23.54
Table 1: Comparison of various LLM configurations across four model settings on two counseling datasets using BLEU, ROUGE-L (R-L), and BERTScore (BScore), which measure the quality of the generated responses compared to the ground truth. GM and HM represent geometric mean and harmonic mean, respectively, of the evaluation metrics. MentalAgora outperforms the other configurations for almost all metrics.
Method Cus. Sat. Pro. Rel. Und. Rank
SA 3.72 3.74 3.51 3.41 3.38 2.68
SAA 3.90 3.76 3.46 3.71 3.33 2.74
MAA 3.92 3.96 3.87 3.68 3.65 2.49
MentalAgora 4.03 3.99 3.99 3.82 3.81 1.80
Table 2: Human evaluation results on Customization (Cus.), Satisfaction (Sat.), Professionalism (Pro.), Relevance (Rel.), and Understanding (Und.), scored on a scale of 1 to 5, along with their overall preference ranks. All differences between our framework and the baselines are statistically significant with a p𝑝pitalic_p-value less than 0.05, except for those indicated with \dagger.

5 Experiments

We conduct two experiments to evaluate MentalAgora, focusing on its effectiveness and controllability. We measure the framework’s capacity to tailor responses that meet professional mental health care criteria and align with individual user preferences. Also, we examine our framework’s controllability that indicating how well it reflects each attribute score provided as input into responses.

Models

To investigate the robustness and stability of MentalAgora, we extend our research to include tests across various LLMs. The models are included GPT-3.5-turbo (Ouyang et al., 2022), GPT-4.0 (OpenAI, 2024), LLaMA-2-13B (Touvron et al., 2023) and Mentalalpaca (Xu et al., 2024). Each of these models represents different architectures and capacities, providing a comprehensive basis for evaluating the framework’s performance.

5.1 Tailored Support Response Generation

The investigation aims to validate the efficacy of our framework in providing mental health support grounded in expert principles and tailored to individual user needs. The baseline settings used for the comparison are as follows.

  • Single Agent (SA) generates responses without any specific attributes or iterative debating, serving as our baseline for fundamental response generation capabilities.

  • Single Agent Attributes (SAA) enhances the single agent setup by incorporating attributes into the response generation, enabling us to compare the adaptability of a single agent in terms of ensuring customized response.

  • Multi Agents Attributes (MAA) employs a multi-agent system but excludes the iterative debating component. This setup aims to assess the impact of debating history in our framework.

Automatic Evaluation

Table 1 shows the automatic evaluation results. We utilize two benchmark datasets: TherapyTalk and a counseling utterance generation dataset (Bertagnolli, 2020). We evaluated the expert alignment of the methods by comparing the annotated responses of experts in the dataset across various metrics. Evaluations using metrics such as Unigram BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004), BERTScore Zhang* et al. (2020), Geometric mean (GM), and Harmonic mean (HM) show that MentalAgora achieves superior performance across the board, enhancing the alignment and relevance of responses.

Regarding the impact of configurations, SAA improves the performance compared to SA setup. Also, MentalAgora leads to better performance than MAA, supporting the value of interactive debate elements in response quality. LLaMA-2-13B and MentalAlpaca show variability in performance, with notable declines in SAA configuration. This may indicate difficulties in attribute integration by a single agent.

Human Evaluation

We also evaluate the generated responses by three human judgments. This evaluation involves a specialized form with guidelines and criteria to assess three baselines and MentalAgora responses. Evaluators score the responses on 5 likert scale based on following criteria detailed in Appendix C. For the ranking task, evaluators are asked to order the responses from most to least effective based on their overall attractiveness and effectiveness assessment, identifying which response is most directly benefit and best-satisfied user preferences.

Table 2 indicates the results of human evaluation. MentalAgora excels in customizing responses across various dimensions. It scores the highest in all the criteria, confirming MentalAgora’s ability to adapt responses to individual user needs and maintain a high level of contextually appropriate engagement considering users’ states. The significant lead in the overall ranking highlights the framework’s effectiveness and an advantage over the single agent setups or non-debating setups. The inter-annotator reliability, assessed using Fleiss’ kappa (Fleiss, 1971), is 0.558 overall.

LLM as a judge evaluation

In addition, we employ an LLM as a judge approach (Zheng et al., 2023). This method involves using GPT-4.0-turbo to assess the generated responses using a 5-point Likert scale, enabling us to cross-validate the human judgments with automatic evaluations. Table 5 shows the results that MentalAgora outperforms other methods in all aspects.

5.2 Attribute Controllable Generation

In Section 5.1, we verify our framework’s effectiveness in aligning with expert responses. Building on this, we conduct two further investigations. The first assesses whether our framework maintains this expert alignment even in terms of attribute scoring, thereby supporting its robustness in expert alignment. The second experiment, a control test, checks if these attribute scores are correctly represented in the responses. Through these two steps, we verify precise attribute classification and effective attribute exploitation skills, respectively. Unlike Section 5.1, configuration SA is omitted because it does not employ attributes.

Automatic Evaluation

We utilize MentalRoBERTa Ji et al. (2022a) fine-tuned on TherapyTalk dataset to predict attribute score of responses. To assess the controllability, we estimate scores of responses from each method and human experts, enabling to compare attribute levels in responses.

Refer to caption
Figure 4: Automatic evaluation results of overall attribute control using Mean Absolute Error (MAE) which demonstrates the differences between scores from expert responses in TherapyTalk and generated responses across various LLMs and methods. MentalAgora outperforms all other configurations.
Refer to caption
Figure 5: Automatic evaluation results of overall attribute control across LLMs and methods using MAE which measures the difference of attribute scores between generated responses and given input scores. MentalAgora outperforms the other configurations compared.

In Figure 4, the attribute levels of the methods are compared to ones from human experts by calculating MAE. MentalAgora shows the best performance, indicating its superior attribute integration capabilities. This suggests that MentalAgora is effective in matching the attribute scores of expert responses, reflecting a high level of precision in response generation.

Figure 5 examines models’ internal controllability of how much they can provide responses matching the input attribute scores in tailored counselor prompts. MAE is difference between model input attribute scores and predicted scores of their output response. The majority of methods exhibit higher MAE values, highlighting challenges in attribute control. MentalAgora stands out in this evaluation, which shows its effectiveness in reflecting the input scores into the generated responses.

6 Analysis

We conduct an ablation study focusing on how the inclusion and exclusion of specific attribute agents affect the generated responses in terms of controllability and interpretability. For a more in-depth exploration of the experimental results, Table 7 provides a comparative analysis influence of the inclusion or removal of specific attributes on responses.

6.1 Ablation Experimental Setup

These are additional experimental setups implemented to further assess the controllability and interpretability of attribute integration.

  • Removal Attributes Setup removes one attribute at a time for all possible combinations of two attributes to evaluate how the absence of each influences overall response performance and expert alignment.

  • Uniform Attribute Setup configures all agents to utilize the same attribute, allowing us to analyze how dominant the selected attribute becomes in the response generation process.

By adjusting the attributes in these specified setups, we demonstrate the framework’s ability to control and reflect these attributes in its outputs.

Method BLEU R-L BScore
MentalAgora 28.59 16.50 95.31
- Reframing 26.46 15.09 94.73
- Solution 26.62 15.28 94.74
- Regard 25.91 14.57 94.71
Reframing only 25.78 14.76 94.68
Solution only 25.30 14.37 94.72
Regard only 26.50 15.19 94.72
Table 3: Ablation study to evaluate the impact of each attribute by removing one and observing changes in task performance and alignment with desired outcomes. MentalAgora, which combines three attributes, outperforms other methods.
Method Reframing Solution Regard Total Diff.
MentalAgora -0.01 -0.03 +0.02 0.06
- Reframing -0.17 +0.20 -0.02 0.39
- Solution +0.10 -0.33 +0.09 0.52
- Regard +0.29 +0.26 -0.51 1.06
Reframing only +0.63 -0.14 +0.05 0.69
Solution only -0.31 +0.40 -0.03 0.74
Regard only -0.08 +0.10 +0.10 0.28
Table 4: Comparison of methods with human experts’ scores. Values represent the difference from human scores. MentalAgora shows higher alignment with human scores than the other methods.

6.2 Results and Analysis

Expert-alignment Performance

The performance evaluation in Table 3 employs the same metrics and dataset as outlined in Table 1. For overall performance, utilizing all attributes in MentalAgora configuration yields the highest performance metrics, indicating optimal framework operation. In contrast, selective removal of attributes decreases performance metrics, suggesting that balanced attribute configurations are crucial for maintaining performance.

Attribute Controllability

Table 4 utilized expert response attribute scores from TherapyTalk dataset. The deltas calculated represent the mean of score differences between expert responses and those generated by the framework with ablation settings, with the total difference computed as the absolute sum of these deltas. We observe that removing a specific attribute results in a divergence from human alignment and incorporating these attributes helps in aligning closer to expert-like responses. Also, focusing on a single attribute fails to align with human responses, resulting in lower overall scores.

Additionally, we observe in our experiment that when only one attribute is used, the corresponding score increases, and when a specific attribute is excluded, the score decreases. In either case, these changes lead to a divergence from human alignment. This observation suggests that the framework’s effective integration of various counseling strategies, represented as attributes in response generation, aligns well with expert views, supporting the notion that a diverse mix of counseling strategies enhances the comprehensiveness of handling multiple aspects of a user’s condition Corey (2016); Norcross and Goldfried (2005).

7 User Study

We verify the effectiveness of our responses through the datasets in various ways, but we cannot determine if the users who wrote the posts are really satisfied more than other baselines. We conduct a user study to assess if our method is effective in addressing the concerns of real people with actual problems. We engage 18 participants in this study. Each participant is asked to provide a personal concern in a detailed format. Plus, we ask users to select which of the three attributes they most expected to be included in the response.

Refer to caption
Figure 6: User study results show the preference rankings for different configurations: MentalAgora, User Attribute, Other 2 attributes, and Random Attribute. The x-axis represents the rankings (1 to 4) while the y-axis shows the percentage of participants who ranked the method at that position. Users prefer MentalAgora over the other baselines.

7.1 User Study Setup

Participants receive four types of responses to their concerns, each generated by differently configured versions of MentalAgora.

  • User Attribute generates responses based on a single attribute chosen by the user to observe user preference for responses aligned with their direct choices.

  • Other 2 Attributes use two attributes not selected by the user to see the preference changes when the user-selected attribute is excluded.

  • Random Attribute employs a randomly selected attribute except the attribute chosen by the user to assess the impact of responses that diverge from the user’s preferences.

7.2 Results and Analysis

Figure 6 demonstrates the user study results highlighting a clear preference trend towards MentalAgora, which ranked as the most favored across all choices.

User Preference

The preference rankings favor MentalAgora, which consistently secures the top position. This reflects a high level of user preference for responses that integrate all attributes, suggesting that The response from MentalAgora is not only well aligned with experts but also consistent with the user’s preferences. The results illustrate the impact of attribute configuration on user preferences. While MentalAgora is most preferred, responses based on user attribute setup show moderate preferences. Random and other 2 attributes are notably less preferred, emphasizing the importance of employing user-relevant attributes in achieving user satisfaction.

Variability in Preference

We note that incorporating user attribute configurations results in a reduction in variance among the participant preferences. This finding indicates that knowing user preferences allows responses that align closely with user expectations and exhibit more consistency in satisfying diverse user needs. However, while MentalAgora tends to rank highly in terms of user preference, it is also associated with a slightly higher variance. This suggests that MentalAgora frequently delivers highly preferred responses, but the degree of satisfaction can vary among different users. This variability highlights that MentalAgora, effectively meets the preferences of many users without direct user preference input  Gilbert and Orlans (2011).

8 Conclusion and Future Work

Our framework generates expert-aligned and user preference-enhanced responses through a multi-agent system. Our experiments confirm the framework’s capability to integrate complex user needs and maintain robust performance across different LLMs, showcasing its potential for real-world applications. Moving forward, we aim to enhance MentalAgora by incorporating a wider range of attributes to increase diversity. Additionally, we plan to validate the framework’s effectiveness in therapy by deploying it in actual clinical environments and assessing its impact on therapeutic outcomes.

Limitations

Our study currently utilizes three distinct attributes based on well-known counseling methodologies. However, the use of counseling techniques is highly variable, dependent on individual preferences and specific issues Tompkins et al. (2013); Swift et al. (2018). This diversity necessitates the expansion of our framework to include a wider range of attributes to better cater to the varied needs of users. Additionally, we conduct our experiments with extensive text inputs that represent complex personal concerns, providing comprehensive answers that cover various aspects of the issues presented. This method effectively allows for a deep analysis of response capabilities within single posts. However, it lacks the capability to capture the dynamics typical of multi-turn interactions between a user and a virtual counselor. Future research should look to include interactive settings that facilitate real-time exchanges to enhance the realism and effectiveness of our model. Our dataset, which is composed of sets of three posts per virtual user, is designed to support such expansions, enabling more detailed studies on interactive counseling scenarios.

Ethics Statement

Data Use and Confidentiality

Our research creates and uses the TherapyTalk dataset, which comprises anonymized posts containing personal information. We strictly prohibit using this data to identify individuals or for any purpose outside the intended scope of research. All examples of user posts used in this paper are rephrased. Compliance with licensing agreements governing the use of this dataset is mandatory.

IRB Approval

The primary objective of our research is to mitigate emotional stress and offer potential solutions for those with mental health concerns. It is crucial to understand that our methods do not replace professional medical diagnosis or treatment. The research includes a user study that has been reviewed and approved by the Institutional Review Board (IRB), ensuring that all procedures adhere to the ethical standards required for human subject research.222IRB approval number: SKKU 2024-05-027

Ethical Use

This study employs techniques to direct the responses of LLMs by adjusting attributes. These methods are designed to improve the relevance and supportiveness of responses within therapeutic contexts. It is critical to state that using these techniques to negatively affect individuals’ mental health is strictly prohibited.

References

Appendix A Counseling Attributes

We have choose three attributes to differentiate each LLM agent: reframing, unconditional positive regard, and solution-focused. These three attributes are derived from the representative characteristics of cognitive therapy (Beck, 1976), person-centered therapy (Raskin and Rogers, 2005), and solution-focused therapy (Bannink, 2007), respectively.

In the context of cognitive therapy, cognitive reframing is referred to as cognitive restructuring. cognitive restructuring involves challenging and changing negative thoughts. It helps individuals recognize and avoid cognitive distortions—such as overgeneralization or magnifying negative behaviors—and replace them with more constructive thinking. Through this process, depressed clients can access rational thinking that is typically blocked by their cognitive distortions. In our experiment, reframing is set as an attribute to turn negative thoughts positive.

Person-centered therapy, also known as client-centered or Rogerian therapy, is based on the humanistic perspective, emphasizing the inherent worth of individuals and their potential for personal growth. Unconditional positive regard is a core principle and one of the three foundational conditions for therapeutic change, along with empathy and congruence in person-centered therapy. Unconditional positive regard refers to the therapist’s acceptance and support of a client regardless of what the client says or does. This acceptance is not contingent on the client’s behavior or characteristics, meaning it is "unconditional." In our experiment, unconditional positive regard attribute functions to provide respect, recognition, and empathy towards users.

Solution-focused brief therapy is a short-term, goal-oriented therapeutic approach that focuses on hel** clients find solutions to their problems rather than dwelling on the problems themselves. In our experiment, the solution-focused attribute is used to provide the best method for users to quickly solve problems.

Appendix B TherapyTalk Dataset Creation

We design a new support response dataset TherapyTalk annotated by human mental health experts. We initially utilized Reddit Mental Health Dataset (Low et al., 2020) comprised of user posts from subreddits related to psychological concerns. This dataset was then processed using a similarity-based methodology to structure multiple posts into cohesive units. Subsequently, mental health professionals, drawing upon their extensive background and domain knowledge, manually crafted motivational responses. This intricate process was applied to a total of 97 user posts.

  1. 1.

    Post Collecting Reddit Mental Health Dataset (Low et al., 2020) contains a wealth of posts related to various concerns. We sample some of these posts to use as the basis for our dataset.

  2. 2.

    Post Filtering For our experiment, posts should contain sufficient lengths of content related to mental health problems, including personal anecdotes or reflections on oneself. Thus, we filtered posts to be within a range of 500 to 2,000 characters each.

  3. 3.

    Post Clustering To create a dataset with a wider and more diverse range of information, we group multiple posts into a single example. To ensure that unrelated posts are not combined, we utilize clustering techniques. We utilize MentalRoberta (Ji et al., 2022b) to map the content of each post into the embeddings, and subsequently cluster them according to the K-nearest neighbor technique.

  4. 4.

    Example Composition We randomly select one post and then identify the ten closest candidate posts. From these candidate posts, we randomly choose two, treating a total of three posts as a single example.

  5. 5.

    Example Validation Despite grou** posts based on similarity, conflicts in the details of each post’s content may still arise. To prevent this, we employ five human annotators who are experts in psychology and distribute the dataset among them. Four are working as professional counseling therapists, and one is a graduate student in psychology. They label any inconsistencies between posts. We remove all invalid examples.

  6. 6.

    Response Tagging We instruct the experts to write responses to valid examples. We guide them to compose empathetic and helpful answers, tailored to reflect the unique characteristics of the virtual user.

  7. 7.

    Attribute Estimation We employed other annotators to label how much the responses utilize the attributes of reframing, regard, and solution. These labels inform which method the experts select and how much the counseling strategy should be applied for each post. The attribute scores are used to develop the scoring model. To capture a broader relationship between responses and scores, attribute scores are also labeled for artificially generated responses to each post.

Finally, each example consists of three posts containing similar concerns, and tailored and empathetic responses from experts that can provide assistance. Through this process, we construct a dataset comprising a total of 97 examples. We utilize this dataset as our test set for evaluation.

Post 1 I often feel inferior to those around me. When I’m with my friends, I don’t feel cool enough. This feeling is particularly strong when I talk to girls; I always feel inadequate, which causes me to pull away and stop interacting. I’ve never been good at talking or making conversation, leading to awkward situations, which I believe is the root of the problem. I’m not sure what I hope to gain from sharing this, but I thought some of you might relate.
Post 2 Is it common to feel like you have different personalities with different people? I’m a 19-year-old male, and lately, I’ve noticed that I unknowingly project different aspects of myself to different people. It’s making me feel like I have split personalities. For example, some friends believe I spend all day playing video games and partying, while others think I’m deeply into art. Some see me as very introverted, while others view me as outgoing. I feel like no one truly understands who I am except for myself, and each friend group only sees certain sides of me. This is really starting to affect me, as I’m unsure of who I really am or what I want when I’m alone. Is this normal?
Post 3 Socializing and interacting with others feels overwhelming and drains my energy. Do others find it easier than I do, like they’re playing a complex game while I struggle with something much simpler? No matter how I approach it, I often feel like people see me as strange or unconventional. I find it confusing, whether it’s initiating conversations or navigating the complexities of social norms like when it’s appropriate to invite someone based on various factors. It’s frustrating and it’s starting to really get to me. All I want is to feel comfortable talking to people.
Response I understand how "being social" can be a lot of stress as we often have to switch on and off our personal aspects regardless of how we feel at that moment. There are also many things we have to consider before, while, and sometimes even after a conversation with others. I feel this thought overflows in our mind not just when a person is more likely to be an inferior type, but when a person loses himself or herself in a conversation. We are often way too sensitive about how other people portray us - we want to be "good". We are then slowly encroached by the thought, ’Who am I?’ I know it is not easy to refrain from this chain of thought, but remember that it is actually your brain that judges you first, not the others!
Figure 7: An example of our dataset TherapyTalk. The contents of the user post are not original and are rephrased.

Figure 7 is an example from TherapyTalk. The response follows our guidelines by incorporating information specific to each post. It includes emotional empathy towards the author and provides simple tips to help improve the author’s issues.

Appendix C Human Evaluation Guideline and Criteria

We employ three human annotators for the evaluation. These annotators possess knowledge of psychology. We ask them to rate the responses on a scale of 1 to 5 based on the following criteria:

  • Understanding measuring how well the response interprets and reflects the user’s emotional state and experiences.

  • Relevance assessing the pertinence of the response to the user’s specific issues.

  • Professionalism examining the integration of three strategies (Reframing, Solution-focused approaches, and Positive regard).

  • Customization determining the personalization of the response to the user’s unique situation.

  • Satisfaction rating how well the response aligns with the user’s preferences and expectations.

We instruct the annotators about the goals of our study and evaluation, including explanations of the confidentiality measures in place. Additionally, we explain the content warnings associated with the sensitive material they will encounter, ensuring they are prepared to handle potentially distressing content responsibly. We also compensate the annotators at a rate of approximately $12 per hour. The explanation and analysis of the results are provided in Section 5.1.

Appendix D User Study Guideline

We conduct a user study to evaluate our response generation framework, recruiting 18 participants from a diverse demographic including university students and working professionals to cover a broad range of personal concerns (Section 7). Participants provide their concerns through a Google form and receive approximately four tailored responses in English within a week. They are then asked to select the response that best aligns with their preferences and expectations based on their personal criteria. Participants are compensated with approximately $8 for their participation. We ensure that all participants are thoroughly informed about the study’s objectives and the confidentiality measures in place. Content warnings are provided to prepare them for potentially sensitive material they might encounter during the evaluation. Additionally, participants agree to terms of use, which include detailed information on how their data will be utilized in the study.

Appendix E Additional Evaluation Results

Table 5 presents the results of the LLM Evaluation. The analysis of these results and the details of the related experiments are discussed in Section 5.1.

Tailored Response Dataset Counsel Chat Dataset
Model Method Cus. Sat. Pro. Rel. Und. Cus. Sat. Pro. Rel. Und.
GPT-4-0125 SA 3.88 4.00 4.13 4.19 4.25 4.36 4.27 4.64 4.27 4.00
SAA 3.63 4.00 4.00 3.88 4.16 4.09 4.09 4.36 4.27 4.00
MAA 4.06 4.19 3.69 4.25 4.35 4.10 4.10 4.27 4.27 4.33
MentalAgora 4.69 4.75 4.25 4.94 5.00 4.36 4.55 4.82 4.73 4.33
GPT-3.5-turbo SA 3.56 3.78 4.11 3.89 3.78 4.36 4.27 4.64 4.27 4.00
SAA 3.33 3.11 3.56 3.56 3.33 4.09 4.09 4.36 4.27 4.00
MAA 3.78 3.78 4.33 4.00 4.00 4.09 4.09 4.27 4.27 4.33
MentalAgora 4.33 4.22 4.89 4.67 4.67 4.36 4.55 4.82 4.73 4.33
LLAMA-2-13b SA 1.70 1.80 1.90 1.80 1.90 3.25 3.50 3.88 3.50 3.50
SAA 1.90 2.00 2.00 1.90 2.20 2.62 2.62 2.75 2.62 2.75
MAA 1.80 1.80 1.90 1.80 1.80 4.00 4.00 4.00 4.25 4.25
MentalAgora 2.30 2.20 2.20 2.40 2.30 4.62 4.88 4.75 4.62 4.88
MentalAlpaca SA 3.92 4.00 4.08 4.08 3.92 3.25 3.50 3.88 3.50 3.50
SAA 2.15 2.15 2.38 2.08 2.23 2.62 2.62 2.75 2.62 2.75
MAA 3.92 3.77 4.23 3.92 4.08 4.62 4.88 4.75 4.62 4.88
MentalAgora 4.31 4.15 4.31 4.38 4.23 4.00 4.00 4.00 4.25 4.25
Table 5: This table showcases the scores assigned by GPT4-turbo, serving as a judge using a 5-point Likert scale across two evaluation datasets. The evaluation criteria include Customization (Cus.), Satisfaction (Sat.), Professionalism (Pro.), Relevance (Rel.), and Understanding (Und.). Scores range from 1 (Poor) to 5 (Excellent).

Appendix F Prompts

Figure 8 and Figure 9 provide examples of the role-setting prompts for each agent and the prompts used for debating in our framework.

Refer to caption
Figure 8: Prompt examples of each agent role setup using in Strategic Agents Setup of Section 3.
Refer to caption
Figure 9: Prompt examples of simulating debating session between agents using in Strategic Debating of Section 3. The green slots are replaced with actual data before generating.

Appendix G Examples

Table 6 presents a case study comparing MentalAgora with other configurations, focusing on the inclusion of specific counseling attributes. The responses are color-coded to highlight these attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By systematically removing each attribute from the responses, we can analyze the impact of their absence or presence. The figure is an example of the experimental results discussed in Section 5. The post input is not the actual user post and is paraphrased.

Post Whenever I go on a date with someone new, I get so anxious beforehand that I feel sick, and it’s becoming increasingly frustrating. This started in the past year, ever since I began exploring the dating scene. I keep telling myself that it’s no big deal, it’s just one date, I don’t have to see the person again if I don’t want to, and I won’t be pressured into anything. But despite these reassurances, on the day of the date, I can’t control the anxiety. I get a stomach ache that feels like intense butterflies, which makes me nauseous. All I can think about is how sick I feel and how I don’t want to show up to the date like this. I want to enjoy it, but I can’t make the feeling stop. Normally, I manage my anxiety well in social situations, like meeting groups of people or friends of friends, but my brain seems to interpret dates differently. It’s really frustrating.
Method Response
MentalAgora It sounds like you’re facing significant anxiety with dating, a common challenge yet uniquely intense for each person. Your capacity to handle social situations well typically, yet feeling overwhelmed by dating, highlights how personal and intricate anxiety triggers can be. Seeing these feelings as your body’s way of signaling a deep desire for connection and personal growth can be empowering. Anxiety, in its essence, isn’t your enemy but a part of you that’s asking for attention and care. Let’s try reframing this experience: What if these intense feelings are not barriers but signposts, guiding you towards understanding what you truly seek in relationships and within yourself? This perspective doesn’t downplay your discomfort but acknowledges your strength and resilience in facing these emotions. Practically speaking, before your next date, try this strategy: Allocate a specific time to focus on your concerns, say 10 minutes, to acknowledge and address them. Then, shift your focus to positive affirmations and visualize a successful and enjoyable date experience.
Removing Reframing It sounds like you’ve found yourself in a really tough spot, where your excitement for new connections is overshadowed by substantial anxiety. It’s perfectly okay to feel this way, and your experiences are valid. Dating can indeed trigger deeper anxieties, and it’s commendable that you’re seeking ways to navigate this. Together, let’s explore strategies to help manage these feelings. First, consider grounding techniques before your dates, such as deep breathing or visualizing a positive outcome to help ease your nervous system. It might also be beneficial before the date to engage in an activity you find relaxing, whether it’s listening to music, a short walk, or reading something enjoyable. Additionally, setting smaller, manageable goals for each date can gradually build your confidence.
Removing Positive-regard It sounds like you’re experiencing intense anxiety specific to dating, which is both distressing and frustrating, especially when you feel confident in other social situations. Let’s explore two paths to manage this: first, practical steps to reduce anxiety in the moment, and second, shifting your perspective on dating to reduce long-term stress. For immediate relief, consider establishing a pre-date routine that calms you. This could be deep breathing exercises, a short walk, listening to your favorite music, or anything else that grounds you. Additionally, setting up a safe call or text with a friend during or after the date can provide a comforting safety net. In terms of reframing your perspective, it might help to see each date not as an evaluation of your worth.
Table 6: Case study of MentalAgora and other configurations. The responses highlight the inclusion of specific counseling attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By removing each attribute from the responses, we can observe how the absence or presence of each attribute impacts the overall response. For instance, in the absence of Reframing, the response may lack a shift in perspective, while the absence of Unconditional Positive Regard might result in less empathy and acceptance. Conversely, some responses might inadvertently emphasize other attributes more strongly when one is removed.
Removing Solution-focused It sounds like you’ve encountered quite a challenging hurdle in your dating experiences, and it’s understandable how this level of anxiety can feel overwhelming. What stands out is your awareness and introspection about your feelings. in your dating experiences, and it’s understandable how this level of anxiety can feel overwhelming. What stands out is your awareness and introspection about your feelings. That’s a significant first step. With the nervousness manifesting physically, it’s clear this response isn’t something you can simply talk yourself out of, and that’s okay. Your feelings are valid, and it’s important to acknowledge them without judgment. Incorporating a blend of acceptance and cognitive reframing might offer some relief. Begin by affirming that it’s perfectly normal to feel anxious about dating—it’s a sign of your desire for meaningful connections. Then, let’s gently explore these feelings. What specific thoughts contribute to your anxiety? Can we challenge them or look at them differently? Maybe we can gradually shift the perspective from fearing.
Reframing only It sounds like you’re experiencing intense pre-date anxiety, and I hear how frustrating and challenging this is for you. Your feelings are valid, and it’s common to feel anxious about dating. Consider trying to reframe your thoughts about dating. Instead of viewing each date as a high-stakes event, think of it as an opportunity to meet someone new and learn about them, with no pressure on the outcome. To manage your physiological symptoms, deep breathing exercises can be helpful. Try taking deep, slow breaths before your date to calm your nervous system. Additionally, grounding techniques, like noticing five things you can see, four things you can touch, three things you can hear, two things you can smell, and one thing you can taste, can also bring you back to the present moment and help reduce anxiety.
Positive-regard only It sounds like you’re experiencing a lot of anxiety around dating, which is understandable given it’s a relatively new experience for you. Your feelings are completely valid, and it’s important to remember you’re not alone in feeling this way. Dating can indeed be nerve-wracking, but there are ways to manage the anxiety you’re feeling. First, applying Cognitive-Behavioral Techniques (CBT) might help you challenge and reframe any negative thoughts you have about dating. For instance, instead of thinking, "This date will go poorly because I’m anxious," try to think, "I’m anxious because I’m trying something new, and that’s okay. I can handle it. Incorporating Mindfulness-Based Stress Reduction (MBSR) techniques can also be beneficial.
Solution-focused only I understand how frustrating and challenging it can feel to experience such intense anxiety before a date, especially when you’re eager to enjoy these experiences. The physical symptoms you’re describing are common manifestations of anxiety, and it’s important to acknowledge them while also working towards managing their impact. Let’s start by addressing the immediate symptoms with some practical strategies. Deep breathing exercises, mindfulness, or even a brief walk can be helpful to soothe your nervous system. Also, preparing a small ‘ritual’ before your dates—like listening to a favorite song or having a reassuring talk with a friend—can serve as a grounding mechanism. Now, for a more lasting strategy, consider reflecting on past situations where you’ve successfully managed your anxiety. What skills did you utilize then that could be applicable now? Incrementally setting goals for each date can also build your confidence.
Table 7: Case study of MentalAgora and other configurations. The responses highlight the inclusion of specific counseling attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By removing each attribute from the responses, we can observe how the absence or presence of each attribute impacts the overall response. For instance, in the absence of Reframing, the response may lack a shift in perspective, while the absence of Unconditional Positive Regard might result in less empathy and acceptance. Conversely, some responses might inadvertently emphasize other attributes more strongly when one is removed.