MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control

Yeonji Lee¹ Sangjun Park^1,2 Kyunghyun Cho

{}^{3^{*}}

**Yeong Bak

{}^{1^{*}}

¹Sungkyunkwan University, Suwon, South Korea
²Upstage AI, South Korea
³New York University, NY, USA
[email protected], [email protected],
[email protected], [email protected]

Abstract

As mental health issues globally escalate, there is a tremendous need for advanced digital support systems. We introduce MentalAgora, a novel framework employing large language models enhanced by interaction between multiple agents for tailored mental health support. This framework operates through three stages: strategic debating, tailored counselor creation, and response generation, enabling the dynamic customization of responses based on individual user preferences and therapeutic needs. We conduct experiments utilizing a high-quality evaluation dataset TherapyTalk crafted with mental health professionals, showing that MentalAgora generates expert-aligned and user preference-enhanced responses. Our evaluations, including experiments and user studies, demonstrate that MentalAgora aligns with professional standards and effectively meets user preferences, setting a new benchmark for digital mental health interventions.¹¹1Code and Dataset: https://github.com/jennylee03/MentalAgora

MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control

Yeonji Lee¹ Sangjun Park^1,2 Kyunghyun Cho ${}^{3^{*}}$ **Yeong Bak ${}^{1^{*}}$ ¹Sungkyunkwan University, Suwon, South Korea ²Upstage AI, South Korea ³New York University, NY, USA [email protected], [email protected], [email protected], [email protected]

^*^*footnotetext: Corresponding authors

1 Introduction

With 1 in every 8 people confirming they have experienced mental disorders (WHO, 2022), estimates suggest that nearly half of the global population will encounter mental health challenges at some point (McGrath et al., 2023). This trend underscores the need for effective and personalized support systems tailored to meet the needs of individuals (Lu et al., 2023; Chen et al., 2023a). Moreover, the rise of digital platforms has transformed communication, increasing the demand for online support and engagement. This shift highlights the importance of accessible and effective digital interventions.

Due to excellent communication skills, there are ongoing efforts to apply Large Language Models (LLMs) in the fields of sociology, psychology, and psychiatry (Sharma et al., 2023; Chiu et al., 2024; Li et al., 2023). Also, to leverage the benefits of collective intelligence, research into imitating human interactions is actively underway (Zhang et al., 2024; Chan et al., 2023) using multi-agent debating and simulation systems.

In the mental health domain, it is an ongoing topic to provide emotionally supportive responses for people having mental concerns (Hua et al., 2024). Recent advancements have improved their ability to generate empathetic and clinically relevant responses (Wei et al., 2021; Loh and Raamkumar, 2023; Shen et al., 2020). Despite these improvements, a single LLM still struggles to comprehensively understand users’ complex situations, limiting the depth and specificity of responses (Guo et al., 2024).

Addressing this gap, we propose a new framework, MentalAgora, adopting a multi-agent method in the mental health domain and supporting response generation. This framework leverages the concept of multi-agent interaction Chen et al. (2023b); Park et al. (2023), where each agent models different aspects, providing a richer interpretation. MentalAgora applies various counseling strategies as attributes of LLM agents, involving them in debating to generate the most suitable persona addressing user concerns. This persona prompt enables LLMs to provide the most appropriate responses for the user. To verify the effectiveness of MentalAgora, we create the TherapyTalk dataset. The automatic and human evaluation results show that MentalAgora outperforms other methods, achieving the highest alignment with expert responses and satisfying user preferences.

Furthermore, we collect actual user concerns to verify whether MentalAgora can provide assistance to real users, offering responses to their issues. As a result, MentalAgora demonstrates greater user satisfaction compared to all other baselines, highlighting its potential in digital mental health interventions.

Our contributions in this paper are manifold:

•

We develop a novel framework, MentalAgora, to facilitate dynamic and nuanced interactions via multi-agent debating, delivering mental health support based on user preferences.
•

We enhance attribute controllable generation to dynamically customize responses by allowing for a multifaceted counselor persona creation tailored to a wide range of user preferences.
•

We curate a high-quality dataset, TherapyTalk, in collaboration with mental health professionals to ensure responses are grounded in expert knowledge.
•

Our work showcases the potential utility of this approach in transforming mental health support, validated through comprehensive experiments and user study.

Refer to caption — Figure 1: MentalAgora framework overview. This diagram outlines the MentalAgora framework, showcasing its three stages: Strategic Debating, Tailored Counselor Creation, and Response Generation, which collectively enhance the creation of personalized therapeutic responses based on user-specific needs.

2 Related Work

In mental health support response generation, most research has been conducted with single-agent LLMs, which often yield general and static responses. Multi-agent methods can overcome these limitations, generating more dynamic and nuanced responses. However, their application in the mental health domain has been limited despite success in evaluation or reasoning. This is notable, as the intricate nature of mental health issues could benefit from the diverse perspectives and responses offered by multi-agent systems.

2.1 Mental Health Support Response Generation

The majority of these studies uses Transformers and LLMs to discern the user’s intent and emotions within the conversation. Following Rashkin et al. (2019)’s introduction of the empathetic dataset and task, subsequent studies have focused on generating and rewriting empathetic responses (Sharma et al., 2021; Mishra et al., 2023; Shen et al., 2020). These studies aim to enhance the detection of emotions and sentiments in user utterances (Zhan et al., 2022) and strive to attain greater empathy. Building on counselor reflection as a key verbal skill, Min et al. (2022) developed a system for analyzing these reflections, while Wang et al. (2023) leveraged user profiles and comment histories to generate personalized responses, using generative models conditioned on the questions and respondents’ data.

Sharma et al. (2023) introduced a cognitive therapeutic technique aimed at countering negative thoughts by replacing them with more optimistic reframed thoughts highlighting the capability of LLMs in the psychiatry field. Additionally, Liu et al. (2023) discusses the development of chatbots designed to enhance counseling interactions in terms of psychological support. These studies have laid the groundwork for support response generation, setting the stage for our research that incorporates multi-agent systems in this domain.

2.2 Controlled Text Generation

In the domain of controlled text generation, Zou et al. (2021) proposed “inverse prompting” to enhance control over text generation in poem generation and question answering. Han et al. (2022) focused on mimicking fictional characters through pseudo dialog prompting in capturing character styles. Similarly, Madotto et al. (2021) addressed continual learning in task-oriented dialogue systems with a benchmark featuring 37 domains and an efficient architectural method. Yang et al. (2023) presents a parameter-efficient approach for attribute-controlled generation, achieving improvements in attribute-specific text generation with minimal additional training parameters.

2.3 Multi-Agent Systems

Our approach draws inspiration from relevant literature in the fields of multi-agent debating and simulation framework. Chen et al. (2023b) proposes a framework that demonstrates how multi-agent systems refine chatbot dynamics, promoting adaptive responses. Shao et al. (2023) reveals the capability of fine-tuning models to emulate distinct personas, enriching the diversity of character simulations. Additionally, Chan et al. (2023) and Du et al. (2023) highlight the role of these systems in improving evaluative and reasoning capacities of models through structured debates. Recent findings suggest that the performance of LLMs can be significantly enhanced by increasing the number of agents Li et al. (2024). Through a simple sampling-and-voting method, it has been demonstrated that the efficacy of LLMs scales with the number of instantiated agents, an approach that is orthogonal to existing methods and suggests promising avenues for future research. In contrast to the common applications of multi-agent systems in evaluation and reasoning tasks, our research extends their utility to the mental health domain, addressing the intricate nature of psychological support with a richer, more context-sensitive dialogue framework.

3 MentalAgora Framework

To enhance the capability of LLMs to generate tailored support responses, we propose MentalAgora, a framework simulating debating sessions with LLMs imitating strategic counselors. Since the structure of MentalAgora can consider diverse perspectives, it provides more contextually relevant and personalized support responses. Our framework operates through three stages: strategic debating, tailored counselor creation, and support response generation.

3.1 Strategic Agents Setup

We select counseling strategies (Fall et al., 2022; Ellis, 1989) and compose attributes based on the strategies to ensure that each response is not only individually dynamic but also therapeutically appropriate. There are three attributes used: reframing, which involves altering perspectives for a more positive outlook; unconditional positive regard, emphasizing total acceptance and support of the user; and solution-focused approaches, which concentrate on the user’s strengths and goals (Zhou et al., 2023; Biyani et al., 2014). Each of these attributes is chosen for its proven efficacy in traditional counseling settings (Fall et al., 2022; Ellis, 1989). Appendix A provides detailed description for the attributes.

Building on these attributes, our counselor agents function based on their own attributes within a multi-agent role setup, utilizing prompting to incorporate core counseling strategies to analyze and address the concerns and needs of users.

3.2 Stage 1: Strategic Debating

In the strategic debating stage (Algorithm 1), an LLM functions as an independent agent representing each attribute through unique role settings reflecting the specific attribute. Each agent is instructed to evaluate the user’s status and concerns from its own perspective. The debate is also structured to determine the relative importance of each attribute in addressing the user’s concerns. Each agent $a_{j}$ generates a response based on the user’s post $P$ and the debating history $D$ . This process is repeated by the number of debating turns $N$ . Through this process, the debating history regarding the user post is obtained.

Algorithm 1 Strategic Debating Stage

0: User post

P

, Number of debating turns

N

, Counselor agents

A=\{a_{1},a_{2},a_{3}\}

0: Debating dialogue history

D

D=[]

{Initialize dialogue history list}

2: for

i

from 1 to

N

3: for each agent

a_{j}

A

4: Generate response

R_{i,a_{j}}

for

P

with

a_{j}

referring to

D

5: Add response

R_{i,a_{j}}

to dialogue

D

6: end for

7: end for

8: return

D

3.3 Stage 2: Tailored Counselor Creation

This stage involves creating a prompt for a tailored counselor to generate actual user responses. The LLM is instructed to reference the user’s posts and debating history to determine the degree of influence of each agent and to generate the counselor prompt. The degree of influence is expressed as a score between 1 and 3. Figure 2 manifests our input prompt for this stage.

3.4 Stage 3: Support Response Generation

In the final step, we instruct the LLM to write a response that provides the most appropriate support for the user’s post, with the prompt shown in Figure 3. At this time, the LLM is given the previously created counselor prompt as a persona. Since the prompt includes the influence scores for each attribute, the LLM can numerically reflect the importance of each attribute and also utilize the extensive user information embedded in debating history.

4 Datasets

It is necessary for user posts to contain their personal mental concerns along with their individual characteristic information. Additionally, for evaluating responses generated by the model, golden responses are required. These responses need to be tailored to each post, provide emotional empathy, and be clinically appropriate. First, we used the Counsel Chat dataset (Bertagnolli, 2020), which consists of user concerns and responses from mental health experts on an online mental health platform.

We also build our own dataset TherapyTalk. TherapyTalk consists of subreddit posts related to mental health concerns from Reddit and responses to these posts by mental health experts. Unlike Counsel Chat, TherapyTalk is additionally labeled by annotators how much each response includes the three attributes. These attribute levels are used in our analysis to demonstrate similarity between our approach and expert responses. We guide the experts to write consistent responses with guidelines. Appendix B describes the dataset construction process in detail.

		TherapyTalk					Counsel Chat Dataset
Model	Method	BLEU	R-L	BScore	GM	HM	BLEU	R-L	BScore	GM	HM
GPT-4.0	SA	24.52	15.86	94.79	33.28	26.23	18.35	14.43	94.16	32.58	23.69
	SAA	24.23	16.51	94.71	33.59	26.69	18.05	14.47	94.20	32.45	23.62
	MAA	25.27	16.94	94.70	34.35	27.48	19.48	15.36	95.32	34.45	25.42
	MentalAgora	28.59	16.50	95.31	35.56	28.28	19.53	15.40	95.14	34.45	25.45
GPT-3.5-turbo	SA	21.18	15.09	94.83	31.18	24.19	15.99	14.08	94.04	29.18	20.63
	SAA	21.38	15.34	94.45	31.41	24.48	15.04	13.77	94.14	28.55	19.83
	MAA	22.95	15.25	94.77	32.13	25.06	18.83	14.30	94.19	31.65	22.78
	MentalAgora	26.50	15.73	94.80	34.06	26.82	19.14	14.87	94.22	32.00	23.10
LLAMA-2-13b	SA	21.11	15.36	93.25	31.15	24.35	10.98	13.95	93.99	28.28	19.50
	SAA	10.52	15.01	94.08	24.58	17.41	08.76	12.80	94.06	26.22	15.91
	MAA	18.70	15.52	94.59	30.17	23.35	16.35	14.27	94.15	31.17	21.92
	MentalAgora	26.49	15.65	94.61	33.98	26.73	17.74	14.29	94.21	32.00	23.00
MentalAlpaca	SA	18.42	15.33	94.96	29.93	23.07	10.74	14.36	93.22	27.89	19.29
	SAA	08.49	13.65	95.12	22.26	14.88	09.28	12.52	93.29	25.76	16.88
	MAA	18.70	15.52	94.59	30.17	23.35	12.36	13.43	94.18	29.58	20.08
	MentalAgora	20.65	15.90	94.97	31.48	24.62	18.89	14.47	94.44	32.54	23.54

Table 1: Comparison of various LLM configurations across four model settings on two counseling datasets using BLEU, ROUGE-L (R-L), and BERTScore (BScore), which measure the quality of the generated responses compared to the ground truth. GM and HM represent geometric mean and harmonic mean, respectively, of the evaluation metrics. MentalAgora outperforms the other configurations for almost all metrics.

Method	Cus.	Sat.	Pro.	Rel.	Und.	Rank
SA	3.72	3.74	3.51	3.41	3.38	2.68
SAA	3.90	3.76	3.46	3.71^†	3.33	2.74
MAA	3.92	3.96^†	3.87^†	3.68^†	3.65	2.49
MentalAgora	4.03	3.99	3.99	3.82	3.81	1.80

Table 2: Human evaluation results on Customization (Cus.), Satisfaction (Sat.), Professionalism (Pro.), Relevance (Rel.), and Understanding (Und.), scored on a scale of 1 to 5, along with their overall preference ranks. All differences between our framework and the baselines are statistically significant with a

p

-value less than 0.05, except for those indicated with

\dagger

5 Experiments

We conduct two experiments to evaluate MentalAgora, focusing on its effectiveness and controllability. We measure the framework’s capacity to tailor responses that meet professional mental health care criteria and align with individual user preferences. Also, we examine our framework’s controllability that indicating how well it reflects each attribute score provided as input into responses.

Models

To investigate the robustness and stability of MentalAgora, we extend our research to include tests across various LLMs. The models are included GPT-3.5-turbo (Ouyang et al., 2022), GPT-4.0 (OpenAI, 2024), LLaMA-2-13B (Touvron et al., 2023) and Mentalalpaca (Xu et al., 2024). Each of these models represents different architectures and capacities, providing a comprehensive basis for evaluating the framework’s performance.

5.1 Tailored Support Response Generation

The investigation aims to validate the efficacy of our framework in providing mental health support grounded in expert principles and tailored to individual user needs. The baseline settings used for the comparison are as follows.

•

Single Agent (SA) generates responses without any specific attributes or iterative debating, serving as our baseline for fundamental response generation capabilities.
•

Single Agent Attributes (SAA) enhances the single agent setup by incorporating attributes into the response generation, enabling us to compare the adaptability of a single agent in terms of ensuring customized response.
•

Multi Agents Attributes (MAA) employs a multi-agent system but excludes the iterative debating component. This setup aims to assess the impact of debating history in our framework.

Automatic Evaluation

Table 1 shows the automatic evaluation results. We utilize two benchmark datasets: TherapyTalk and a counseling utterance generation dataset (Bertagnolli, 2020). We evaluated the expert alignment of the methods by comparing the annotated responses of experts in the dataset across various metrics. Evaluations using metrics such as Unigram BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004), BERTScore Zhang* et al. (2020), Geometric mean (GM), and Harmonic mean (HM) show that MentalAgora achieves superior performance across the board, enhancing the alignment and relevance of responses.

Regarding the impact of configurations, SAA improves the performance compared to SA setup. Also, MentalAgora leads to better performance than MAA, supporting the value of interactive debate elements in response quality. LLaMA-2-13B and MentalAlpaca show variability in performance, with notable declines in SAA configuration. This may indicate difficulties in attribute integration by a single agent.

Human Evaluation

We also evaluate the generated responses by three human judgments. This evaluation involves a specialized form with guidelines and criteria to assess three baselines and MentalAgora responses. Evaluators score the responses on 5 likert scale based on following criteria detailed in Appendix C. For the ranking task, evaluators are asked to order the responses from most to least effective based on their overall attractiveness and effectiveness assessment, identifying which response is most directly benefit and best-satisfied user preferences.

Table 2 indicates the results of human evaluation. MentalAgora excels in customizing responses across various dimensions. It scores the highest in all the criteria, confirming MentalAgora’s ability to adapt responses to individual user needs and maintain a high level of contextually appropriate engagement considering users’ states. The significant lead in the overall ranking highlights the framework’s effectiveness and an advantage over the single agent setups or non-debating setups. The inter-annotator reliability, assessed using Fleiss’ kappa (Fleiss, 1971), is 0.558 overall.

LLM as a judge evaluation

In addition, we employ an LLM as a judge approach (Zheng et al., 2023). This method involves using GPT-4.0-turbo to assess the generated responses using a 5-point Likert scale, enabling us to cross-validate the human judgments with automatic evaluations. Table 5 shows the results that MentalAgora outperforms other methods in all aspects.

5.2 Attribute Controllable Generation

In Section 5.1, we verify our framework’s effectiveness in aligning with expert responses. Building on this, we conduct two further investigations. The first assesses whether our framework maintains this expert alignment even in terms of attribute scoring, thereby supporting its robustness in expert alignment. The second experiment, a control test, checks if these attribute scores are correctly represented in the responses. Through these two steps, we verify precise attribute classification and effective attribute exploitation skills, respectively. Unlike Section 5.1, configuration SA is omitted because it does not employ attributes.

Automatic Evaluation

We utilize MentalRoBERTa Ji et al. (2022a) fine-tuned on TherapyTalk dataset to predict attribute score of responses. To assess the controllability, we estimate scores of responses from each method and human experts, enabling to compare attribute levels in responses.

In Figure 4, the attribute levels of the methods are compared to ones from human experts by calculating MAE. MentalAgora shows the best performance, indicating its superior attribute integration capabilities. This suggests that MentalAgora is effective in matching the attribute scores of expert responses, reflecting a high level of precision in response generation.

Figure 5 examines models’ internal controllability of how much they can provide responses matching the input attribute scores in tailored counselor prompts. MAE is difference between model input attribute scores and predicted scores of their output response. The majority of methods exhibit higher MAE values, highlighting challenges in attribute control. MentalAgora stands out in this evaluation, which shows its effectiveness in reflecting the input scores into the generated responses.

6 Analysis

We conduct an ablation study focusing on how the inclusion and exclusion of specific attribute agents affect the generated responses in terms of controllability and interpretability. For a more in-depth exploration of the experimental results, Table 7 provides a comparative analysis influence of the inclusion or removal of specific attributes on responses.

6.1 Ablation Experimental Setup

These are additional experimental setups implemented to further assess the controllability and interpretability of attribute integration.

•

Removal Attributes Setup removes one attribute at a time for all possible combinations of two attributes to evaluate how the absence of each influences overall response performance and expert alignment.
•

Uniform Attribute Setup configures all agents to utilize the same attribute, allowing us to analyze how dominant the selected attribute becomes in the response generation process.

By adjusting the attributes in these specified setups, we demonstrate the framework’s ability to control and reflect these attributes in its outputs.

Method	BLEU	R-L	BScore
MentalAgora	28.59	16.50	95.31
- Reframing	26.46	15.09	94.73
- Solution	26.62	15.28	94.74
- Regard	25.91	14.57	94.71
Reframing only	25.78	14.76	94.68
Solution only	25.30	14.37	94.72
Regard only	26.50	15.19	94.72

Table 3: Ablation study to evaluate the impact of each attribute by removing one and observing changes in task performance and alignment with desired outcomes. MentalAgora, which combines three attributes, outperforms other methods.

Method	Reframing	Solution	Regard	Total Diff.
MentalAgora	-0.01	-0.03	+0.02	0.06
- Reframing	-0.17	+0.20	-0.02	0.39
- Solution	+0.10	-0.33	+0.09	0.52
- Regard	+0.29	+0.26	-0.51	1.06
Reframing only	+0.63	-0.14	+0.05	0.69
Solution only	-0.31	+0.40	-0.03	0.74
Regard only	-0.08	+0.10	+0.10	0.28

Table 4: Comparison of methods with human experts’ scores. Values represent the difference from human scores. MentalAgora shows higher alignment with human scores than the other methods.

6.2 Results and Analysis

Expert-alignment Performance

The performance evaluation in Table 3 employs the same metrics and dataset as outlined in Table 1. For overall performance, utilizing all attributes in MentalAgora configuration yields the highest performance metrics, indicating optimal framework operation. In contrast, selective removal of attributes decreases performance metrics, suggesting that balanced attribute configurations are crucial for maintaining performance.

Attribute Controllability

Table 4 utilized expert response attribute scores from TherapyTalk dataset. The deltas calculated represent the mean of score differences between expert responses and those generated by the framework with ablation settings, with the total difference computed as the absolute sum of these deltas. We observe that removing a specific attribute results in a divergence from human alignment and incorporating these attributes helps in aligning closer to expert-like responses. Also, focusing on a single attribute fails to align with human responses, resulting in lower overall scores.

Additionally, we observe in our experiment that when only one attribute is used, the corresponding score increases, and when a specific attribute is excluded, the score decreases. In either case, these changes lead to a divergence from human alignment. This observation suggests that the framework’s effective integration of various counseling strategies, represented as attributes in response generation, aligns well with expert views, supporting the notion that a diverse mix of counseling strategies enhances the comprehensiveness of handling multiple aspects of a user’s condition Corey (2016); Norcross and Goldfried (2005).

7 User Study

We verify the effectiveness of our responses through the datasets in various ways, but we cannot determine if the users who wrote the posts are really satisfied more than other baselines. We conduct a user study to assess if our method is effective in addressing the concerns of real people with actual problems. We engage 18 participants in this study. Each participant is asked to provide a personal concern in a detailed format. Plus, we ask users to select which of the three attributes they most expected to be included in the response.

7.1 User Study Setup

Participants receive four types of responses to their concerns, each generated by differently configured versions of MentalAgora.

•

User Attribute generates responses based on a single attribute chosen by the user to observe user preference for responses aligned with their direct choices.
•

Other 2 Attributes use two attributes not selected by the user to see the preference changes when the user-selected attribute is excluded.
•

Random Attribute employs a randomly selected attribute except the attribute chosen by the user to assess the impact of responses that diverge from the user’s preferences.

7.2 Results and Analysis

Figure 6 demonstrates the user study results highlighting a clear preference trend towards MentalAgora, which ranked as the most favored across all choices.

User Preference

The preference rankings favor MentalAgora, which consistently secures the top position. This reflects a high level of user preference for responses that integrate all attributes, suggesting that The response from MentalAgora is not only well aligned with experts but also consistent with the user’s preferences. The results illustrate the impact of attribute configuration on user preferences. While MentalAgora is most preferred, responses based on user attribute setup show moderate preferences. Random and other 2 attributes are notably less preferred, emphasizing the importance of employing user-relevant attributes in achieving user satisfaction.

Variability in Preference

We note that incorporating user attribute configurations results in a reduction in variance among the participant preferences. This finding indicates that knowing user preferences allows responses that align closely with user expectations and exhibit more consistency in satisfying diverse user needs. However, while MentalAgora tends to rank highly in terms of user preference, it is also associated with a slightly higher variance. This suggests that MentalAgora frequently delivers highly preferred responses, but the degree of satisfaction can vary among different users. This variability highlights that MentalAgora, effectively meets the preferences of many users without direct user preference input Gilbert and Orlans (2011).

8 Conclusion and Future Work

Our framework generates expert-aligned and user preference-enhanced responses through a multi-agent system. Our experiments confirm the framework’s capability to integrate complex user needs and maintain robust performance across different LLMs, showcasing its potential for real-world applications. Moving forward, we aim to enhance MentalAgora by incorporating a wider range of attributes to increase diversity. Additionally, we plan to validate the framework’s effectiveness in therapy by deploying it in actual clinical environments and assessing its impact on therapeutic outcomes.

Limitations

Our study currently utilizes three distinct attributes based on well-known counseling methodologies. However, the use of counseling techniques is highly variable, dependent on individual preferences and specific issues Tompkins et al. (2013); Swift et al. (2018). This diversity necessitates the expansion of our framework to include a wider range of attributes to better cater to the varied needs of users. Additionally, we conduct our experiments with extensive text inputs that represent complex personal concerns, providing comprehensive answers that cover various aspects of the issues presented. This method effectively allows for a deep analysis of response capabilities within single posts. However, it lacks the capability to capture the dynamics typical of multi-turn interactions between a user and a virtual counselor. Future research should look to include interactive settings that facilitate real-time exchanges to enhance the realism and effectiveness of our model. Our dataset, which is composed of sets of three posts per virtual user, is designed to support such expansions, enabling more detailed studies on interactive counseling scenarios.

Ethics Statement

Data Use and Confidentiality

Our research creates and uses the TherapyTalk dataset, which comprises anonymized posts containing personal information. We strictly prohibit using this data to identify individuals or for any purpose outside the intended scope of research. All examples of user posts used in this paper are rephrased. Compliance with licensing agreements governing the use of this dataset is mandatory.

IRB Approval

The primary objective of our research is to mitigate emotional stress and offer potential solutions for those with mental health concerns. It is crucial to understand that our methods do not replace professional medical diagnosis or treatment. The research includes a user study that has been reviewed and approved by the Institutional Review Board (IRB), ensuring that all procedures adhere to the ethical standards required for human subject research.²²2IRB approval number: SKKU 2024-05-027

Ethical Use

This study employs techniques to direct the responses of LLMs by adjusting attributes. These methods are designed to improve the relevance and supportiveness of responses within therapeutic contexts. It is critical to state that using these techniques to negatively affect individuals’ mental health is strictly prohibited.

References

Bannink (2007) F. P. Bannink. 2007. Solution-focused brief therapy. Journal of Contemporary Psychotherapy: On the Cutting Edge of Modern Developments in Psychotherapy, 37(2):87–94.
Beck (1976) Aaron T. Beck. 1976. Cognitive therapy and the emotional disorders. Cognitive therapy and the emotional disorders. International Universities Press, Oxford, England.
Bertagnolli (2020) Nicolas Bertagnolli. 2020. Counsel chat: Bootstrap** high-quality therapy data.
Biyani et al. (2014) Prakhar Biyani, Cornelia Caragea, Prasenjit Mitra, and John Yen. 2014. Identifying emotional and informational support in online health communities. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 827–836, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. Preprint, arXiv:2308.07201.
Chen et al. (2023a) ** Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, Defu Lian, and Enhong Chen. 2023a. When large language models meet personalization: Perspectives of challenges and opportunities. Preprint, arXiv:2307.16376.
Chen et al. (2023b) Weize Chen, Yusheng Su, **gwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023b. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. Preprint, arXiv:2308.10848.
Chiu et al. (2024) Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, and Tim Althoff. 2024. A computational framework for behavioral assessment of llm therapists. Preprint, arXiv:2401.00820.
Corey (2016) Gerald Corey. 2016. Theory and Practice of Counseling and Psychotherapy. Cengage Learning, Boston.
Du et al. (2023) Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. Preprint, arXiv:2305.14325.
Ellis (1989) Albert Ellis. 1989. Rational-emotive therapy and cognitive behavior therapy: Similarities and differences. Cognitive Therapy and Research, 13(5):325–340.
Fall et al. (2022) Kevin A. Fall, Janice Miner Holden, and Andre Marquis. 2022. Theoretical Models of Counseling and Psychotherapy, 4 edition. Routledge.
Fleiss (1971) Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378.
Gilbert and Orlans (2011) Maria Gilbert and Vanja Orlans. 2011. Integrative Therapy: 100 Key Points and Techniques. Routledge, London.
Guo et al. (2024) Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. Preprint, arXiv:2402.01680.
Han et al. (2022) Seungju Han, Beomsu Kim, ** Yong Yoo, Seokjun Seo, Sangbum Kim, Enkhbayar Erdenee, and Buru Chang. 2022. Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. Preprint, arXiv:2204.10825.
Hua et al. (2024) Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Yi han Sheu, Peilin Zhou, Lauren V. Moran, Sophia Ananiadou, and Andrew Beam. 2024. Large language models in mental health care: a sco** review. Preprint, arXiv:2401.02984.
Ji et al. (2022a) Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2022a. MentalBERT: Publicly available pretrained language models for mental healthcare. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7184–7190, Marseille, France. European Language Resources Association.
Ji et al. (2022b) Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2022b. MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare. In Proceedings of LREC.
Li et al. (2023) Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. 2023. Theory of mind for multi-agent collaboration via large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, Singapore. Association for Computational Linguistics.
Li et al. (2024) Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. 2024. More agents is all you need. Preprint, arXiv:2402.05120.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2023) June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023. Chatcounselor: A large language models for mental health support. Preprint, arXiv:2309.15461.
Loh and Raamkumar (2023) Siyuan Brandon Loh and Aravind Sesagiri Raamkumar. 2023. Harnessing large language models’ empathetic response generation capabilities for online mental health counselling support. Preprint, arXiv:2310.08017.
Low et al. (2020) Daniel M Low, Laurie Rumker, John Torous, Guillermo Cecchi, Satrajit S Ghosh, and Tanya Talkar. 2020. Natural language processing reveals vulnerable mental health support groups and heightened health anxiety on reddit during covid-19: Observational study. Journal of medical Internet research, 22(10):e22635.
Lu et al. (2023) Zhenyi Lu, Wei Wei, Xiaoye Qu, Xian-Ling Mao, Dangyang Chen, and Jixiong Chen. 2023. Miracle: Towards personalized dialogue generation with latent-space multiple personal attribute control. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5933–5957, Singapore. Association for Computational Linguistics.
Madotto et al. (2021) Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eunjoon Cho, Pascale Fung, and Zhiguang Wang. 2021. Continual learning in task-oriented dialogue systems. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7452–7467, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
McGrath et al. (2023) John J McGrath, Ali Al-Hamzawi, Jordi Alonso, Yasmin Altwaijri, Laura H Andrade, Evelyn J Bromet, Ronny Bruffaerts, José Miguel Caldas de Almeida, Stephanie Chardoul, Wai Tat Chiu, Louisa Degenhardt, Olga V Demler, Finola Ferry, Oye Gureje, Josep Maria Haro, Elie G Karam, Georges Karam, Salma M Khaled, Viviane Kovess-Masfety, Marta Magno, Maria Elena Medina-Mora, Jacek Moskalewicz, Fernando Navarro-Mateu, Daisuke Nishi, Oleguer Plana-Ripoll, José Posada-Villa, Charlene Rapsey, Nancy A Sampson, Juan Carlos Stagnaro, Dan J Stein, Margreet ten Have, Yolanda Torres, Cristian Vladescu, Peter W Woodruff, Zahari Zarkov, Ronald C Kessler, Sergio Aguilar-Gaxiola, Ali Al-Hamzawi, Jordi Alonso, Yasmin A. Altwaijri, Laura Helena Andrade, Lukoye Atwoli, Corina Benjet, Evelyn J. Bromet, Ronny Bruffaerts, Brendan Bunting, José Miguel Caldas de Almeida, Graça Cardoso, Stephanie Chardoul, Alfredo H. Cía, Louisa Degenhardt, Giovanni De Girolamo, Oye Gureje, Josep Maria Haro, Meredith G. Harris, Hristo Hinkov, Chi yi Hu, Peter De Jonge, Aimee N. Karam, Elie G. Karam, Georges Karam, Alan E. Kazdin, Norito Kawakami, Ronald C. Kessler, Andrzej Kiejna, Viviane Kovess-Masfety, John J. McGrath, Maria Elena Medina-Mora, Jacek Moskalewicz, Fernando Navarro-Mateu, Daisuke Nishi, Marina Piazza, José Posada-Villa, Kate M. Scott, Juan Carlos Stagnaro, Dan J. Stein, Margreet Ten Have, Yolanda Torres, Maria Carmen Viana, Daniel V. Vigo, Cristian Vladescu, David R. Williams, Peter Woodruff, Bogdan Wojtyniak, Miguel Xavier, and Alan M. Zaslavsky. 2023. Age of onset and cumulative risk of mental disorders: a cross-national analysis of population surveys from 29 countries. The Lancet Psychiatry, 10(9):668–681.
Min et al. (2022) Do June Min, Verónica Pérez-Rosas, Kenneth Resnicow, and Rada Mihalcea. 2022. PAIR: Prompt-aware margIn ranking for counselor reflection scoring in motivational interviewing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 148–158, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Mishra et al. (2023) Kshitij Mishra, Priyanshu Priya, and Asif Ekbal. 2023. PAL to lend a hel** hand: Towards building an emotion adaptive polite and empathetic counseling conversational agent. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12254–12271, Toronto, Canada. Association for Computational Linguistics.
Norcross and Goldfried (2005) John C. Norcross and Marvin R. Goldfried. 2005. Handbook of Psychotherapy Integration. Oxford University Press, New York.
OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Preprint, arXiv:2203.02155.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
Park et al. (2023) Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. Preprint, arXiv:2304.03442.
Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: a new benchmark and dataset. Preprint, arXiv:1811.00207.
Raskin and Rogers (2005) Nathaniel J. Raskin and Carl R. Rogers. 2005. Person-centered therapy., pages 130–165. Current psychotherapies, 7th ed., instr. ed. Thomson Brooks/Cole Publishing Co, Belmont, CA, US.
Shao et al. (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
Sharma et al. (2021) Ashish Sharma, Inna W. Lin, Adam S. Miner, David C. Atkins, and Tim Althoff. 2021. Towards facilitating empathic conversations in online mental health support: A reinforcement learning approach. Preprint, arXiv:2101.07714.
Sharma et al. (2023) Ashish Sharma, Kevin Rushton, Inna Lin, David Wadden, Khendra Lucas, Adam Miner, Theresa Nguyen, and Tim Althoff. 2023. Cognitive reframing of negative thoughts through human-language model interaction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9977–10000, Toronto, Canada. Association for Computational Linguistics.
Shen et al. (2020) Siqi Shen, Charles Welch, Rada Mihalcea, and Verónica Pérez-Rosas. 2020. Counseling-style reflection generation using generative pretrained transformers with augmented context. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 10–20, 1st virtual meeting. Association for Computational Linguistics.
Swift et al. (2018) Joshua K. Swift, Jennifer L. Callahan, Michael Cooper, and Susannah Parkin. 2018. The impact of accommodating client preference in psychotherapy: A meta-analysis. Journal of Clinical Psychology, 74(11):1925–1937.
Tompkins et al. (2013) Kelley Tompkins, Joshua Swift, and Jennifer Callahan. 2013. Working with clients by incorporating their preferences. Psychotherapy (Chicago, Ill.), 50:279–83.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Wang et al. (2023) Hongru Wang, Rui Wang, Fei Mi, Yang Deng, Zezhong Wang, Bin Liang, Ruifeng Xu, and Kam-Fai Wong. 2023. Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12047–12064, Singapore. Association for Computational Linguistics.
Wei et al. (2021) Wei Wei, Jiayi Liu, Xianling Mao, Guibing Guo, Feida Zhu, Pan Zhou, and Yuchong Hu. 2021. Emotion-aware chat machine: Automatic emotional response generation for human-like emotional interaction. Preprint, arXiv:2106.03044.
WHO (2022) World Health Organization WHO. 2022. Mental health and covid-19: Early evidence of the pandemic’s impact. https://iris.who.int/bitstream/handle/10665/352189/WHO-2019-nCoV-Sci-Brief-Mental-health-2022.1-eng.pdf?sequence=1.
Xu et al. (2024) Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. 2024. Mental-llm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1):1–32.
Yang et al. (2023) Kexin Yang, Dayiheng Liu, Wenqiang Lei, Baosong Yang, Mingfeng Xue, Boxing Chen, and Jun Xie. 2023. Tailor: A soft-prompt-based approach to attribute-based controlled text generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 410–427, Toronto, Canada. Association for Computational Linguistics.
Zhan et al. (2022) Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, and Junyi Jessy Li. 2022. Why do you feel this way? summarizing triggers of emotions in social media posts. Preprint, arXiv:2210.12531.
Zhang et al. (2024) **tian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. 2024. Exploring collaboration mechanisms for llm agents: A social psychology view. Preprint, arXiv:2310.02124.
Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.
Zhou et al. (2023) **feng Zhou, Zhuang Chen, Bo Wang, and Minlie Huang. 2023. Facilitating multi-turn emotional support conversation with positive emotion elicitation: A reinforcement learning approach. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1714–1729, Toronto, Canada. Association for Computational Linguistics.
Zou et al. (2021) Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang, Zhilin Yang, and Jie Tang. 2021. Controllable generation from pre-trained language models via inverse prompting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 2450–2460, New York, NY, USA. Association for Computing Machinery.

Appendix A Counseling Attributes

We have choose three attributes to differentiate each LLM agent: reframing, unconditional positive regard, and solution-focused. These three attributes are derived from the representative characteristics of cognitive therapy (Beck, 1976), person-centered therapy (Raskin and Rogers, 2005), and solution-focused therapy (Bannink, 2007), respectively.

In the context of cognitive therapy, cognitive reframing is referred to as cognitive restructuring. cognitive restructuring involves challenging and changing negative thoughts. It helps individuals recognize and avoid cognitive distortions—such as overgeneralization or magnifying negative behaviors—and replace them with more constructive thinking. Through this process, depressed clients can access rational thinking that is typically blocked by their cognitive distortions. In our experiment, reframing is set as an attribute to turn negative thoughts positive.

Person-centered therapy, also known as client-centered or Rogerian therapy, is based on the humanistic perspective, emphasizing the inherent worth of individuals and their potential for personal growth. Unconditional positive regard is a core principle and one of the three foundational conditions for therapeutic change, along with empathy and congruence in person-centered therapy. Unconditional positive regard refers to the therapist’s acceptance and support of a client regardless of what the client says or does. This acceptance is not contingent on the client’s behavior or characteristics, meaning it is "unconditional." In our experiment, unconditional positive regard attribute functions to provide respect, recognition, and empathy towards users.

Solution-focused brief therapy is a short-term, goal-oriented therapeutic approach that focuses on hel** clients find solutions to their problems rather than dwelling on the problems themselves. In our experiment, the solution-focused attribute is used to provide the best method for users to quickly solve problems.

Appendix B TherapyTalk Dataset Creation

We design a new support response dataset TherapyTalk annotated by human mental health experts. We initially utilized Reddit Mental Health Dataset (Low et al., 2020) comprised of user posts from subreddits related to psychological concerns. This dataset was then processed using a similarity-based methodology to structure multiple posts into cohesive units. Subsequently, mental health professionals, drawing upon their extensive background and domain knowledge, manually crafted motivational responses. This intricate process was applied to a total of 97 user posts.

1.

Post Collecting Reddit Mental Health Dataset (Low et al., 2020) contains a wealth of posts related to various concerns. We sample some of these posts to use as the basis for our dataset.
2.

Post Filtering For our experiment, posts should contain sufficient lengths of content related to mental health problems, including personal anecdotes or reflections on oneself. Thus, we filtered posts to be within a range of 500 to 2,000 characters each.
3.

Post Clustering To create a dataset with a wider and more diverse range of information, we group multiple posts into a single example. To ensure that unrelated posts are not combined, we utilize clustering techniques. We utilize MentalRoberta (Ji et al., 2022b) to map the content of each post into the embeddings, and subsequently cluster them according to the K-nearest neighbor technique.
4.

Example Composition We randomly select one post and then identify the ten closest candidate posts. From these candidate posts, we randomly choose two, treating a total of three posts as a single example.
5.

Example Validation Despite grou** posts based on similarity, conflicts in the details of each post’s content may still arise. To prevent this, we employ five human annotators who are experts in psychology and distribute the dataset among them. Four are working as professional counseling therapists, and one is a graduate student in psychology. They label any inconsistencies between posts. We remove all invalid examples.
6.

Response Tagging We instruct the experts to write responses to valid examples. We guide them to compose empathetic and helpful answers, tailored to reflect the unique characteristics of the virtual user.
7.

Attribute Estimation We employed other annotators to label how much the responses utilize the attributes of reframing, regard, and solution. These labels inform which method the experts select and how much the counseling strategy should be applied for each post. The attribute scores are used to develop the scoring model. To capture a broader relationship between responses and scores, attribute scores are also labeled for artificially generated responses to each post.

Finally, each example consists of three posts containing similar concerns, and tailored and empathetic responses from experts that can provide assistance. Through this process, we construct a dataset comprising a total of 97 examples. We utilize this dataset as our test set for evaluation.

Post 1	I often feel inferior to those around me. When I’m with my friends, I don’t feel cool enough. This feeling is particularly strong when I talk to girls; I always feel inadequate, which causes me to pull away and stop interacting. I’ve never been good at talking or making conversation, leading to awkward situations, which I believe is the root of the problem. I’m not sure what I hope to gain from sharing this, but I thought some of you might relate.
Post 2	Is it common to feel like you have different personalities with different people? I’m a 19-year-old male, and lately, I’ve noticed that I unknowingly project different aspects of myself to different people. It’s making me feel like I have split personalities. For example, some friends believe I spend all day playing video games and partying, while others think I’m deeply into art. Some see me as very introverted, while others view me as outgoing. I feel like no one truly understands who I am except for myself, and each friend group only sees certain sides of me. This is really starting to affect me, as I’m unsure of who I really am or what I want when I’m alone. Is this normal?
Post 3	Socializing and interacting with others feels overwhelming and drains my energy. Do others find it easier than I do, like they’re playing a complex game while I struggle with something much simpler? No matter how I approach it, I often feel like people see me as strange or unconventional. I find it confusing, whether it’s initiating conversations or navigating the complexities of social norms like when it’s appropriate to invite someone based on various factors. It’s frustrating and it’s starting to really get to me. All I want is to feel comfortable talking to people.
Response	I understand how "being social" can be a lot of stress as we often have to switch on and off our personal aspects regardless of how we feel at that moment. There are also many things we have to consider before, while, and sometimes even after a conversation with others. I feel this thought overflows in our mind not just when a person is more likely to be an inferior type, but when a person loses himself or herself in a conversation. We are often way too sensitive about how other people portray us - we want to be "good". We are then slowly encroached by the thought, ’Who am I?’ I know it is not easy to refrain from this chain of thought, but remember that it is actually your brain that judges you first, not the others!

Figure 7: An example of our dataset TherapyTalk. The contents of the user post are not original and are rephrased.

Figure 7 is an example from TherapyTalk. The response follows our guidelines by incorporating information specific to each post. It includes emotional empathy towards the author and provides simple tips to help improve the author’s issues.

Appendix C Human Evaluation Guideline and Criteria

We employ three human annotators for the evaluation. These annotators possess knowledge of psychology. We ask them to rate the responses on a scale of 1 to 5 based on the following criteria:

•

Understanding measuring how well the response interprets and reflects the user’s emotional state and experiences.
•

Relevance assessing the pertinence of the response to the user’s specific issues.
•

Professionalism examining the integration of three strategies (Reframing, Solution-focused approaches, and Positive regard).
•

Customization determining the personalization of the response to the user’s unique situation.
•

Satisfaction rating how well the response aligns with the user’s preferences and expectations.

We instruct the annotators about the goals of our study and evaluation, including explanations of the confidentiality measures in place. Additionally, we explain the content warnings associated with the sensitive material they will encounter, ensuring they are prepared to handle potentially distressing content responsibly. We also compensate the annotators at a rate of approximately $12 per hour. The explanation and analysis of the results are provided in Section 5.1.

Appendix D User Study Guideline

We conduct a user study to evaluate our response generation framework, recruiting 18 participants from a diverse demographic including university students and working professionals to cover a broad range of personal concerns (Section 7). Participants provide their concerns through a Google form and receive approximately four tailored responses in English within a week. They are then asked to select the response that best aligns with their preferences and expectations based on their personal criteria. Participants are compensated with approximately $8 for their participation. We ensure that all participants are thoroughly informed about the study’s objectives and the confidentiality measures in place. Content warnings are provided to prepare them for potentially sensitive material they might encounter during the evaluation. Additionally, participants agree to terms of use, which include detailed information on how their data will be utilized in the study.

Appendix E Additional Evaluation Results

Table 5 presents the results of the LLM Evaluation. The analysis of these results and the details of the related experiments are discussed in Section 5.1.

		Tailored Response Dataset					Counsel Chat Dataset
Model	Method	Cus.	Sat.	Pro.	Rel.	Und.	Cus.	Sat.	Pro.	Rel.	Und.
GPT-4-0125	SA	3.88	4.00	4.13	4.19	4.25	4.36	4.27	4.64	4.27	4.00
	SAA	3.63	4.00	4.00	3.88	4.16	4.09	4.09	4.36	4.27	4.00
	MAA	4.06	4.19	3.69	4.25	4.35	4.10	4.10	4.27	4.27	4.33
	MentalAgora	4.69	4.75	4.25	4.94	5.00	4.36	4.55	4.82	4.73	4.33
GPT-3.5-turbo	SA	3.56	3.78	4.11	3.89	3.78	4.36	4.27	4.64	4.27	4.00
	SAA	3.33	3.11	3.56	3.56	3.33	4.09	4.09	4.36	4.27	4.00
	MAA	3.78	3.78	4.33	4.00	4.00	4.09	4.09	4.27	4.27	4.33
	MentalAgora	4.33	4.22	4.89	4.67	4.67	4.36	4.55	4.82	4.73	4.33
LLAMA-2-13b	SA	1.70	1.80	1.90	1.80	1.90	3.25	3.50	3.88	3.50	3.50
	SAA	1.90	2.00	2.00	1.90	2.20	2.62	2.62	2.75	2.62	2.75
	MAA	1.80	1.80	1.90	1.80	1.80	4.00	4.00	4.00	4.25	4.25
	MentalAgora	2.30	2.20	2.20	2.40	2.30	4.62	4.88	4.75	4.62	4.88
MentalAlpaca	SA	3.92	4.00	4.08	4.08	3.92	3.25	3.50	3.88	3.50	3.50
	SAA	2.15	2.15	2.38	2.08	2.23	2.62	2.62	2.75	2.62	2.75
	MAA	3.92	3.77	4.23	3.92	4.08	4.62	4.88	4.75	4.62	4.88
	MentalAgora	4.31	4.15	4.31	4.38	4.23	4.00	4.00	4.00	4.25	4.25

Table 5: This table showcases the scores assigned by GPT4-turbo, serving as a judge using a 5-point Likert scale across two evaluation datasets. The evaluation criteria include Customization (Cus.), Satisfaction (Sat.), Professionalism (Pro.), Relevance (Rel.), and Understanding (Und.). Scores range from 1 (Poor) to 5 (Excellent).

Appendix F Prompts

Figure 8 and Figure 9 provide examples of the role-setting prompts for each agent and the prompts used for debating in our framework.

Appendix G Examples

Table 6 presents a case study comparing MentalAgora with other configurations, focusing on the inclusion of specific counseling attributes. The responses are color-coded to highlight these attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By systematically removing each attribute from the responses, we can analyze the impact of their absence or presence. The figure is an example of the experimental results discussed in Section 5. The post input is not the actual user post and is paraphrased.

Post	Whenever I go on a date with someone new, I get so anxious beforehand that I feel sick, and it’s becoming increasingly frustrating. This started in the past year, ever since I began exploring the dating scene. I keep telling myself that it’s no big deal, it’s just one date, I don’t have to see the person again if I don’t want to, and I won’t be pressured into anything. But despite these reassurances, on the day of the date, I can’t control the anxiety. I get a stomach ache that feels like intense butterflies, which makes me nauseous. All I can think about is how sick I feel and how I don’t want to show up to the date like this. I want to enjoy it, but I can’t make the feeling stop. Normally, I manage my anxiety well in social situations, like meeting groups of people or friends of friends, but my brain seems to interpret dates differently. It’s really frustrating.
Method	Response
MentalAgora	It sounds like you’re facing significant anxiety with dating, a common challenge yet uniquely intense for each person. Your capacity to handle social situations well typically, yet feeling overwhelmed by dating, highlights how personal and intricate anxiety triggers can be. Seeing these feelings as your body’s way of signaling a deep desire for connection and personal growth can be empowering. Anxiety, in its essence, isn’t your enemy but a part of you that’s asking for attention and care. Let’s try reframing this experience: What if these intense feelings are not barriers but signposts, guiding you towards understanding what you truly seek in relationships and within yourself? This perspective doesn’t downplay your discomfort but acknowledges your strength and resilience in facing these emotions. Practically speaking, before your next date, try this strategy: Allocate a specific time to focus on your concerns, say 10 minutes, to acknowledge and address them. Then, shift your focus to positive affirmations and visualize a successful and enjoyable date experience.
Removing Reframing	It sounds like you’ve found yourself in a really tough spot, where your excitement for new connections is overshadowed by substantial anxiety. It’s perfectly okay to feel this way, and your experiences are valid. Dating can indeed trigger deeper anxieties, and it’s commendable that you’re seeking ways to navigate this. Together, let’s explore strategies to help manage these feelings. First, consider grounding techniques before your dates, such as deep breathing or visualizing a positive outcome to help ease your nervous system. It might also be beneficial before the date to engage in an activity you find relaxing, whether it’s listening to music, a short walk, or reading something enjoyable. Additionally, setting smaller, manageable goals for each date can gradually build your confidence.
Removing Positive-regard	It sounds like you’re experiencing intense anxiety specific to dating, which is both distressing and frustrating, especially when you feel confident in other social situations. Let’s explore two paths to manage this: first, practical steps to reduce anxiety in the moment, and second, shifting your perspective on dating to reduce long-term stress. For immediate relief, consider establishing a pre-date routine that calms you. This could be deep breathing exercises, a short walk, listening to your favorite music, or anything else that grounds you. Additionally, setting up a safe call or text with a friend during or after the date can provide a comforting safety net. In terms of reframing your perspective, it might help to see each date not as an evaluation of your worth.

Table 6: Case study of MentalAgora and other configurations. The responses highlight the inclusion of specific counseling attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By removing each attribute from the responses, we can observe how the absence or presence of each attribute impacts the overall response. For instance, in the absence of Reframing, the response may lack a shift in perspective, while the absence of Unconditional Positive Regard might result in less empathy and acceptance. Conversely, some responses might inadvertently emphasize other attributes more strongly when one is removed.

Removing Solution-focused	It sounds like you’ve encountered quite a challenging hurdle in your dating experiences, and it’s understandable how this level of anxiety can feel overwhelming. What stands out is your awareness and introspection about your feelings. in your dating experiences, and it’s understandable how this level of anxiety can feel overwhelming. What stands out is your awareness and introspection about your feelings. That’s a significant first step. With the nervousness manifesting physically, it’s clear this response isn’t something you can simply talk yourself out of, and that’s okay. Your feelings are valid, and it’s important to acknowledge them without judgment. Incorporating a blend of acceptance and cognitive reframing might offer some relief. Begin by affirming that it’s perfectly normal to feel anxious about dating—it’s a sign of your desire for meaningful connections. Then, let’s gently explore these feelings. What specific thoughts contribute to your anxiety? Can we challenge them or look at them differently? Maybe we can gradually shift the perspective from fearing.
Reframing only	It sounds like you’re experiencing intense pre-date anxiety, and I hear how frustrating and challenging this is for you. Your feelings are valid, and it’s common to feel anxious about dating. Consider trying to reframe your thoughts about dating. Instead of viewing each date as a high-stakes event, think of it as an opportunity to meet someone new and learn about them, with no pressure on the outcome. To manage your physiological symptoms, deep breathing exercises can be helpful. Try taking deep, slow breaths before your date to calm your nervous system. Additionally, grounding techniques, like noticing five things you can see, four things you can touch, three things you can hear, two things you can smell, and one thing you can taste, can also bring you back to the present moment and help reduce anxiety.
Positive-regard only	It sounds like you’re experiencing a lot of anxiety around dating, which is understandable given it’s a relatively new experience for you. Your feelings are completely valid, and it’s important to remember you’re not alone in feeling this way. Dating can indeed be nerve-wracking, but there are ways to manage the anxiety you’re feeling. First, applying Cognitive-Behavioral Techniques (CBT) might help you challenge and reframe any negative thoughts you have about dating. For instance, instead of thinking, "This date will go poorly because I’m anxious," try to think, "I’m anxious because I’m trying something new, and that’s okay. I can handle it. Incorporating Mindfulness-Based Stress Reduction (MBSR) techniques can also be beneficial.
Solution-focused only	I understand how frustrating and challenging it can feel to experience such intense anxiety before a date, especially when you’re eager to enjoy these experiences. The physical symptoms you’re describing are common manifestations of anxiety, and it’s important to acknowledge them while also working towards managing their impact. Let’s start by addressing the immediate symptoms with some practical strategies. Deep breathing exercises, mindfulness, or even a brief walk can be helpful to soothe your nervous system. Also, preparing a small ‘ritual’ before your dates—like listening to a favorite song or having a reassuring talk with a friend—can serve as a grounding mechanism. Now, for a more lasting strategy, consider reflecting on past situations where you’ve successfully managed your anxiety. What skills did you utilize then that could be applicable now? Incrementally setting goals for each date can also build your confidence.

Table 7: Case study of MentalAgora and other configurations. The responses highlight the inclusion of specific counseling attributes: red indicates Reframing, orange indicates Unconditional Positive Regard, and blue indicates Solution-Focused. By removing each attribute from the responses, we can observe how the absence or presence of each attribute impacts the overall response. For instance, in the absence of Reframing, the response may lack a shift in perspective, while the absence of Unconditional Positive Regard might result in less empathy and acceptance. Conversely, some responses might inadvertently emphasize other attributes more strongly when one is removed.