\interspeechcameraready\name

Z. Cui1,†, C. Lei2,3,†, W. Wu4, Y. Duan2,3, D. Qu2,3, J. Wu1, R. Chen2,3,∗, C. Zhang1,∗thanks: \dagger Co-first author. * Corresponding author.

Spontaneous Speech-Based Suicide Risk Detection
Using Whisper and Large Language Models

Abstract

The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse acoustic and linguistic features embedded in spontaneous speech, both the Whisper speech model and textual large language models (LLMs) are used for suicide risk detection. Both all-parameter finetuning and parameter-efficient finetuning approaches are used to adapt the pre-trained models for suicide risk detection, and multiple audio-text fusion approaches are evaluated to combine the representations of Whisper and the LLM. The proposed system achieves a detection accuracy of 0.807 and an F1-score of 0.846 on the test set with 119 subjects, indicating promising potential for real suicide risk detection applications.

Index Terms: Suicide Risk, Large Language Model, Whisper

1 Introduction

Suicide is a serious global public health issue with more than 0.7 million people dying from suicide every year around the world [1]. Suicide is also a leading cause of death among young individuals aged ten to twenty-four [2, 3]. Early detection and intervention are crucial and serve as the most effective ways to prevent potential suicide attempts.

Diagnosis of suicide is challenging since there is no single clinical characterisation of a suicidal individual and it relies heavily on the ability, desire and honesty of a patient to communicate their symptoms, moods or cognitions [4]. Automatic suicide risk detection has been explored based on clinical interviews [5], questionnaires [6], health records [7], suicide notes [8, 9] and social contents [10, 11]. A growing body of research has shown that speech is useful for detecting high-risk suicide individuals [4, 12, 13, 14]. Speech contains both semantic and non-semantic information and can be measured cheaply, remotely, non-invasively and non-intrusively. As an individual becomes pre-suicide, their speech undergoes discernible changes in its source features (e.g. jitter, shimmer), prosodic features (e.g. F0), format features, and spectral features [4] and tends to contain more disfluency, producing more hesitations and speech errors [13]. In terms of semantics, suicidal speech tends to have different top words from non-suicidal speech [14].

Foundation models are large-scale models pre-trained on vast amounts of data across a wide range of domains, which can gain specialized capabilities through finetuning or transfer learning. Foundation models have produced superior performance on many speech processing tasks such as automatic speech recognition [15], emotion recognition [16], and detection of mental disorders and cognitive diseases such as depression [17] and Alzheimer’s disease [18, 19]. In particular, Large Language Models (LLM) are develo** rapidly recently, which has generated large impacts on various research fields including medical-related tasks [20, 21, 22]. Despite its success, the use of large foundation models for automatic suicide risk detection has not been studied.

This work investigates the use of large foundation models for speech-based automatic suicide risk detection. A dataset is collected which contains spontaneous speech on suicide-related topics, due to the absence of publicly available suicide speech dataset. The dataset consists of over 1000 adolescents aged 10 to 18. Both speech foundation models and LLMs are applied to detect suicide risk. We first investigate different tuning strategies for speech and text models separately, including all-parameter finetuning (APFT) and parameter efficient finetuning (PEFT) [23]. Then, different fusion methods are tested to combine speech and text modalities. Furthermore, we analyze the impact of various prompt formats on LLMs’ performance. To the best of our knowledge, this is the first work that applies speech foundation models and LLMs to suicide risk detection.

The rest of the paper is organised as follows. Section 2 describes the dataset collected in this study. Section 3 introduces the proposed method for automatic suicide risk detection. The experimental setup is shown in Section 4. Section 5 presents the results and discussions. We conclude in Section 6.

2 Suicide Data

The dataset was collected from 47 primary and secondary schools in Guangdong, China. The interviewing team consists of volunteers majoring in relevant fields such as psychology and education, who received project-specific training to ensure the professionalism of the interviews. Based on city-wide screening data, we selected at-risk students according to their history of suicide or self-harm behaviour. Additionally, we included some non-risk students as a control group for the interviews.

The interview consists of informed consent, cognitive test (including Animal Fluency Test, Digit Span Test, Short-term Memory Test and Long-term Memory Test), Mini International Neuropsychiatric Interview for Children and Adolescents (MINI-KID) [24], as well as speech interview including question answering, reading, etc. The interviews were conducted in quiet classrooms within the schools, utilising uniformly standardised recording devices to ensure recording quality, primarily capturing the voices of the participants. The ethical aspects of the project’s interviews have been reviewed and approved by the Medical Ethics Committee of the Tsinghua University Technology Ethics Committee.

We collected recordings of 1179 adolescents aged 10 to 18. All recordings were conducted in Mandarin Chinese. With the result of MINI-KID, we obtained the suicide risk level of all 1179 participants, in which 631 participants (53.5%) are at suicide risk. Detailed statistics regarding age and gender can be found in Table 1.

In this paper, we utilised the audio recordings of participants’ self-introductions, originally framed as “How would you describe yourself?”. This question was selected due to its diverse range of responses among different participants. Beyond paralinguistic cues, these responses also contain valuable semantic information that can be effectively utilised for classification purposes. The total length of recordings used in this paper is 905 minutes. The recordings were segmented into utterances using a voice activity detection module from FunASR toolkit [25]. Pauses were removed between utterances. The dataset was divided into a train-dev-test split, with a ratio of 8:1:1, details shown in Table 1. The distribution of age, gender, and suicide risk are roughly the same across the three groups.

Table 1: Statistics about age, gender and train-dev-test split of the collected dataset.
Age 10 to 12 13 to 15 16 to 18
257 546 376
Gender        Male Female
       761 428
Train-Dev-Test Split Train Dev Test
944 116 119

3 Automatic Suicide Risk Detection

The proposed pipeline is illustrated in Figure 1. The system contains two branches. In the audio branch, the speech recording is fed into a speech foundation model to extract acoustic embeddings. In the text branch, an automatic speech recognition model (i.e. a Whisper-Large model [26]) first transcribes the input speech recording. The transcriptions are then encoded by a text foundation model to obtain text embeddings. The acoustic and text embeddings are then fused for the detection of suicide risk. Different foundation models, finetuning strategies, and fusion methods are investigated.

3.1 Speech foundation model: Whisper

A Whisper-Large-v3 model [26]111https://huggingface.co/openai/whisper-large-v3 is used to extract acoustic embeddings. Whisper-Large-v3 model is a Transformer-based encoder-decoder model pre-trained on 1 million hours of weakly labelled audio and 4 million hours of pseudo-labelled audio. The training data consists of diverse audio sources from various environments, recording conditions, speakers, and languages. The diversity and scale of the training data empower the Whisper model to generate high-quality audio embedding, enabling robust representation of audio content. The encoder part of the Whisper-Large-v3 model is used in this study for audio feature extraction, which contains 640M parameters.

Refer to caption
Figure 1: Overall pipeline. Modal fusion on acoustic feature and text feature is performed, followed by a classifier for detection of suicide risk.

3.2 Large Language Models

For text foundation models, Baichuan2 [27] and Qwen1.5 [28] are compared. Both are Transformer-based LLMs. For both models, the size of 7B parameters is used in the study. The Baichuan2 LLM is open-sourced by Baichuan Intelligence Inc., which is pre-trained on a corpus with 2.6 trillion tokens. The Qwen1.5 model is proposed by Alibaba Cloud and pre-trained on 2.4 trillion tokens.

3.3 Finetuning strategy

Both speech and text foundation models are finetuned on the training set, which adapts the models to better recognise patterns and features relevant to identifying suicide risk. Both APFT and PEFT are investigated. During APFT, all parameters of the foundation model are updated. For PEFT, LoRA finetuning [23] is utilised, with weights of the pre-trained model frozen and a set of trainable rank decomposition matrices injected into each layer of the Transformer architecture, which greatly reduces the number of trainable parameters. PEFT only has 3M trainable parameters for the Whisper encoder and about 8M trainable parameters for LLM in size of 7B.

3.4 Fusion methods

Refer to caption
(a) Concatenation fusion.
Refer to caption
(b) In-context fusion.
Figure 2: Model structure of two different fusion methods. For concatenation, the speech and text feature are extracted separately and pooled before concatenated. For in-context, speech embedding is mapped to LLM’s hidden space and concatenated with embedded tokens, before fed into decoders of LLMs.

Two types of fusion approaches are compared: concatenation fusion and in-context fusion. The detailed structure of the two types of fusion is shown in Figure 2(a) and Figure 2(b).

In the concatenation-based approach, temporal pooling is first applied to both acoustic embedding and text embedding. Mean pooling is adopted for the models based on the Transformer encoder (i.e. Whisper encoder) while the last hidden state is used for models based on Transformer decoder (i.e. LLMs) since the decoder’s auto-regressive structure ensures that the final step encapsulates the most information due to its ability to leverage preceding steps for generating the current output. The pooled speech embedding and text embedding are then concatenated and fed into a fully connected layer for binary classification of suicide risk.

In the in-context fusion method, the speech embedding is fed into LLM as in-context information. The speech embedding first passes through a fully connected layer, map** it from the speech model’s hidden space to the LLM’s hidden space. The transcriptions are tokenized and encoded by the embedding layer of the text foundation model, which is then concatenated with the mapped speech embedding and fed into the Transformer decoder layers. The classifier also consists of a fully connected layer which produces binary diagnostic results.

4 Experiment Setup

4.1 Baselines

The proposed method is compared to the following three baselines:

  • “eGeMAPs+SVM”: a system consists of hand-crafted features and a traditional machine learning classifier. eGeMAPs feature set [29] is a carefully selected collection of acoustic parameters designed to capture emotions, states, and traits in human speech. The eGeMAPS feature is extracted with the openSMILE toolkit [30]. The classification was performed using the eGeMAPS feature set as input, alongside an SVM serving as the classifier.

  • “W2V2+BERT”: a system uses Wav2Vec2-XLSR-53 model [31]222https://huggingface.co/facebook/wav2vec2-large-xlsr-53 for acoustic embeddings and BERT-Base-Chinese model [32]333https://huggingface.co/google-bert/bert-base-chinese for text embeddings, fused by concatenation.

  • “Qwen-few-shot”: a 5-shot learning on Qwen1.5-7B-Chat model, which is post-trained with both supervised finetuning and direct preference optimisation to align based on Qwen1.5-7B model.

4.2 Implementation Details

The system was implemented in PyTorch using the Huggingface framework. Speed perturbation was applied when finetuning the speech foundation model. Sox toolkit444https://sourceforge.net/projects/sox was used to alter the speed to 0.9 and 1.1 times the original while kee** the pitch unchanged, given the important role of pitch in suicide risk detection. During the finetuning stage, utterances were segmented into chunks with a window length of 30 seconds and a window shift of 10 seconds. PEFT toolkit [33] was utilised for LoRA finetuning. The foundation models were finetuned for three epochs and the classifier was trained for ten epochs. For APFT, learning rate was selected from 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, 3×1063superscript1063\times 10^{-6}3 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, and 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT based on the validation set performance. For PEFT, a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT was used. For training of the classifier, the learning rate was set to 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. A linear scheduler was utilised for learning rate adjustment, with a warm-up period that covered 20% of the total steps.

To ensure the reliability of experimental results and reduce the impact of randomness, we conducted five independent experiments for each setting using different random seeds. The average and standard deviation of these five results are reported.

5 Results and Discussion

The detection results of baseline methods on the collected dataset are shown in Table 2. The accuracy of “eGeMAPs+SVM”on test set is 0.537. For “W2V2+BERT”, we got an average accuracy of 0.694 on 5 seeds. As for “Qwen-few-shot”, out of 119 samples in the test set, the model failed to give responses for 45 samples. Among the 74 samples with clear response, the accuracy rate is 0.689. The overall accuracy in test set is 0.429.

Table 2: Results of the baselines. “eGeMAPs+SVM” stands for eGeMAPs feature and SVM as classifier. For “W2V2+BERT” the average accuracy on 5 seeds is reported. The “Qwen-few-shot” accuracy is overall result including sample with clear response and those for which the model failed to give response.
Method Accuracy
eGeMAPs+SVM 0.537
W2V2+BERT 0.694
Qwen-few-shot 0.429

5.1 Comparison of Different Tuning and Fusion Strategies

Table 3: Results combining different tuning and fusion strategies. “CC” stands for concatenation fusion. “IC” stands for in-context fusion. The accuracy is reported in format of mean ±plus-or-minus\pm± standard deviation.
Speech Text Fusion Accuracy
Model Tuning Model Tuning
1 Whisper APFT 0.758 ±plus-or-minus\pm± 0.029
2 Whisper PEFT 0.633 ±plus-or-minus\pm± 0.036
3 Baichuan2 APFT 0.763 ±plus-or-minus\pm± 0.015
4 Baichuan2 PEFT 0.698 ±plus-or-minus\pm± 0.043
5 Qwen1.5 APFT 0.675 ±plus-or-minus\pm± 0.034
6 Qwen1.5 PEFT 0.737 ±plus-or-minus\pm± 0.024
7 Whisper APFT Baichuan2 APFT CC 0.768 ±plus-or-minus\pm± 0.021
8 Whisper PEFT Baichuan2 PEFT CC 0.718 ±plus-or-minus\pm± 0.040
9 Whisper APFT Qwen1.5 APFT CC 0.710 ±plus-or-minus\pm± 0.029
10 Whisper PEFT Qwen1.5 PEFT CC 0.752 ±plus-or-minus\pm± 0.014
11 Whisper APFT Qwen1.5 APFT IC 0.730 ±plus-or-minus\pm± 0.033
12 Whisper APFT Qwen1.5 PEFT IC 0.743 ±plus-or-minus\pm± 0.026
13 Whisper PEFT Qwen1.5 PEFT IC 0.694 ±plus-or-minus\pm± 0.019

The classification results of system using different tuning and fusion strategies are compared in Table 3. As for LLMs, Baichuan2-7B-Base and Qwen1.5-7B were compared. As for tuning strategies, APFT and PEFT were investigated. PEFT only contains less than 0.2% trainable parameters compared with APFT. Regarding modal fusion, the use of single modal, concatenation fusion and incontext fusion were studied.

Comparing APFT and PEFT (No. 1 & 2, No. 3 & 4, No. 5 & 6), it is observed that APFT produced much better results on Whisper model and Baichuan2 model, while on Qwen1.5 model PEFT got better result.

When applying concatenation fusion with APFT (No. 1, 3 & 7), despite the good individual results of Whisper APFT and Baichuan2 APFT, a fusion of these two models yielded few improvement. This may be attributed to the fact that both models, having undergone APFT, potentially converged to similar feature representations, thus impairing the complementarity of the features essential for effective classification.

However, when employing the concatenation fusion with PEFT (No. 2, 4 & 8 and No. 2, 6 & 10), it produced better performance than individually PEFT models, which indicates that speech and text models with PEFT could extract complementary information that can improve the performance of classification. It may come from its conservative alteration of the model’s pre-trained attributes, due to the limited number of trainable parameters involved. But with the Baichuan2 model, though the fusion results of the PEFT model got an improvement over the individual, it’s still 5% lower than the APFT model. It indicates that, in some cases, APFT can achieve more in-depth model optimisation, thereby obtaining higher performance, even though this may sacrifice the complementary of features. Furthermore, selecting an appropriate training strategy is important for different pre-trained models and downstream tasks.

Regarding the in-context fusion results, the comparison between Whisper features with APFT and PEFT (No. 12 & 13) proves that APFT enhances the ability of the Whisper model to learn task-specific information. In addition, finetuning LLM with speech features prevents LLM from optimising towards the same feature space as the speech features. However, the worse performance on in-context fusion compared with concatenation fusion indicates that training LLMs simply with fine-tuned speech embeddings may not be sufficient and implies a need for the exploration of more powerful fusion strategies.

Table 4: The ensemble result of our best systems. The best accuracy and F1-score of single and ensemble system is listed.
System No. 7 10 12 Ensemble
Best Accuracy 0.790 0.772 0.781 0.807
Best F1-score 0.832 0.820 0.815 0.846

At last, we did an ensemble of models by voting. The best accuracy and F1-score are reported in Table 4. The index numbers refer to model numbers in Table 3. The ensemble produced the best accuracy of 0.807 and the best F1-score of 0.846.

Table 5: Specific prompt settings used in analysis of prompt influence. Original prompts were in Chinese and translated to English in the table.
No-prompt Only original transcription as input.
Prompt 1 Determine suicide risk.
Prompt 2
You are a very good psychologist, the above
is an interview about suicide risk, please
determine whether there is a risk of suicide.

5.2 Analysis of Prompt Influence when Finetuning

Previous studies have proved the efficacy of prompt engineering for the performance of LLMs [34, 35]. With complex instructions, LLMs achieved remarkable few/zero-shot generalizations by following these complex instructions. When finetuning LLMs, we assume that appropriate prompts can facilitate better optimisation towards task-specific objectives.

We examined the impact of prompts on the performance of LLMs for the suicide risk detection task. Qwen1.5-7B model and Qwen1.5-7B-Chat model, which has undergone instruction tuning, were compared. Three prompt settings were compared, as shown in Table 5. The first is only using the original transcription as input. The second is providing a brief overview of the task information to LLM, referring to “Prompt 1” in the table. We also tested a detailed prompt which provides more information about the role of LLM and the description of the task, referring to “Prompt 2” in the table. Other settings are the same as No. 12 in Table 3.

Table 6: The impact of different prompt settings with LoRA finetuned Qwen1.5-7B and Qwen1.5-7B-Chat.
Model Prompt Accuracy
Qwen1.5-7B No-prompt 0.703 ±plus-or-minus\pm± 0.046
Qwen1.5-7B Prompt 1 0.743 ±plus-or-minus\pm± 0.026
Qwen1.5-7B Prompt 2 0.718 ±plus-or-minus\pm± 0.044
Qwen1.5-7B-Chat No-prompt 0.706 ±plus-or-minus\pm± 0.029
Qwen1.5-7B-Chat Prompt 1 0.727 ±plus-or-minus\pm± 0.027
Qwen1.5-7B-Chat Prompt 2 0.703 ±plus-or-minus\pm± 0.024

The results are reported in Table 6. Comparing different prompts of each model, Prompt 1 outperformed No-prompt. It may be because Prompt 1 provides the model with brief information about the downstream task, thereby enabling the model to better focus on task-relevant aspects, enhancing its optimisation towards the specific task. Interestingly, Prompt 1 also outperformed Prompt 2 which provides more detailed information. One possible explanation is that detailed descriptions are essential for effectively guiding an LLM when it’s not finetuned, considering its general-purpose nature. In contrast, finetuning adapts the model to the specific task, making detailed prompts less useful which reduce the information density in the input and thus impair model performance. Moreover, comparing the performance of the Base and Chat version, the Chat model exhibits about 1.5% reduction on average accuracy. This indicates that model alignment can lead to performance reduction when finetuning LLM to downstream, even with detailed prompts.

In conclusion, this analysis suggests that when finetuning LLMs for downstream tasks, a brief prompt about the task is beneficial and sufficient. Additionally, compared to aligned models, base models perform better when finetuning is needed.

6 Conclusions

This paper proposes a framework for spontaneous speech-based suicide risk detection using speech and language foundation models. Both the Whisper model and LLMs were studied and different finetuning strategies and audio-text modality fusion methods were evaluated. The proposed system achieves an accuracy of 0.807 and an F1-score of 0.846, showing promising potential for real application of early suicide risk detection.

7 Acknowledgement

We appreciate the efforts of Linshanjie Da, Ya**g Sun, and Zeming Zhang in organizing volunteers for offline voice data collection. We are grateful to Yuhao He for his contributions to the design of the voice handbook, and to Juan Wang for her suggestions on voice handbook design and her support during the offline interview.

References

  • [1] World Health Organization, “Suicide worldwide in 2019: Global health estimates,” 2021.
  • [2] B. Shain, P. K. Braverman, W. P. Adelman, E. M. Alderman, C. C. Breuner, D. A. Levine, A. V. Marcell, R. F. O’Brien et al., “Suicide and suicide attempts in adolescents,” Pediatrics, vol. 138, no. 1, p. e20161420, 2016.
  • [3] M. Orri, S. Scardera, L. C. Perret, D. Bolanis, C. Temcheff, J. R. Séguin, M. Boivin, G. Turecki, R. E. Tremblay, S. M. Côté et al., “Mental health problems and risk of suicidal ideation and attempts in adolescents,” Pediatrics, vol. 146, no. 1, p. e20193823, 2020.
  • [4] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10–49, 2015.
  • [5] S. Dhelim, L. Chen, H. Ning, and C. Nugent, “Artificial intelligence for suicide assessment using audiovisual cues: A review,” Artificial Intelligence Review, vol. 56, no. 6, pp. 5591–5618, 2023.
  • [6] J. Oh, K. Yun, J.-H. Hwang, and J.-H. Chae, “Classification of suicide attempts through a machine learning algorithm based on multiple systemic psychiatric scales,” Frontiers in Psychiatry, vol. 8, p. 192, 2017.
  • [7] C. Su, R. Aseltine, R. Doshi, K. Chen, S. C. Rogers, and F. Wang, “Machine learning for suicide risk prediction in children and adolescents with electronic health records,” Translational Psychiatry, vol. 10, no. 1, p. 413, 2020.
  • [8] T. Zhang, A. M. Schoene, and S. Ananiadou, “Automatic identification of suicide notes with a Transformer-based deep learning model,” Internet Interventions, vol. 25, p. 100422, 2021.
  • [9] S. Ghosh, A. Ekbal, and P. Bhattacharyya, “A multitask framework to detect depression, sentiment and multi-label emotion from suicide notes,” Cognitive Computation, pp. 1–20, 2022.
  • [10] A. Roy, K. Nikolitch, R. McGinn, S. **ah, W. Klement, and Z. A. Kaminsky, “A machine learning approach predicts future risk to suicidal ideation from social media data,” NPJ Digital Medicine, vol. 3, no. 1, p. 78, 2020.
  • [11] Q. Cheng, T. M. Li, C.-L. Kwok, T. Zhu, and P. S. Yip, “Assessing suicide risk and emotional distress in Chinese social media: A text mining and machine learning study,” Journal of Medical Internet Research, vol. 19, no. 7, p. e243, 2017.
  • [12] S. Scherer, J. Pestian, and L.-P. Morency, “Investigating the speech characteristics of suicidal adolescents,” in Proc. ICASSP, Vancouver, 2013.
  • [13] B. Stasak, J. Epps, H. T. Schatten, I. W. Miller, E. M. Provost, and M. F. Armey, “Read speech voice quality and disfluency in individuals with recent suicidal ideation or suicide attempt,” Speech Communication, vol. 132, pp. 10–20, 2021.
  • [14] A. Belouali, S. Gupta, V. Sourirajan, J. Yu, N. Allen, A. Alaoui, M. A. Dutton, and M. J. Reinhard, “Acoustic and language analysis of speech for suicidal ideation among us veterans,” BioData Mining, vol. 14, no. 1, pp. 1–17, 2021.
  • [15] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. Subramanian, T. Wang, S.-w. Yang et al., “An exploration of self-supervised pretrained representations for end-to-end speech recognition,” in Proc. ASRU, Crtagena, 2021.
  • [16] E. Morais, R. Hoory, W. Zhu, I. Gat, M. Damasceno, and H. Aronowitz, “Speech emotion recognition using self-supervised features,” in Proc. ICASSP, Singapore, 2022.
  • [17] W. Wu, C. Zhang, and P. Woodland, “Self-supervised representations in speech-based depression detection,” in Proc. ICASSP, Rhodes Island, 2023.
  • [18] Z. Cui, W. Wu, W.-Q. Zhang, J. Wu, and C. Zhang, “Transferring speech-generic and depression-specific knowledge for Alzheimer’s disease detection,” in Proc. ASRU, Taipei, 2023.
  • [19] H. Meng, B. Mak, M.-W. Mak, H. Fung, X. Gong, T. Kwok, X. Liu, V. Mok, P. Wong, J. Woo et al., “Integrated and enhanced pipeline system to support spoken language analytics for screening neurocognitive disorders,” in Proc. Interspeech, Dublin, 2023.
  • [20] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang, and Q. Li, “Recommender systems in the era of large language models (LLMs),” arXiv preprint arXiv:2307.02046, 2023.
  • [21] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023.
  • [22] M. Sadeghi, B. Egger, R. Agahi, R. Richer, K. Capito, L. H. Rupp, L. Schindler-Gmelch, M. Berking, and B. M. Eskofier, “Exploring the capabilities of a language model-only approach for depression detection in text data,” in Proc. BHI, Pittsburgh, 2023.
  • [23] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, 2021.
  • [24] D. V. Sheehan, K. H. Sheehan, R. D. Shytle, J. Janavs, Y. Bannon, J. E. Rogers, K. M. Milo, S. L. Stock, and B. Wilkinson, “Reliability and validity of the mini international neuropsychiatric interview for children and adolescents (MINI-KID),” The Journal of Clinical Psychiatry, vol. 71, no. 3, p. 17393, 2010.
  • [25] Z. Gao, Z. Li, J. Wang, H. Luo, X. Shi, M. Chen, Y. Li, L. Zuo, Z. Du, Z. Xiao, and S. Zhang, “FunASR: A fundamental end-to-end speech recognition toolkit,” in Proc. Interspeech, Dublin, 2023.
  • [26] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, Hawaii, 2023.
  • [27] Baichuan, “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023.
  • [28] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
  • [29] F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
  • [30] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: The munich versatile and fast open-source audio feature extractor,” in Proc. MM, Firenze, 2010.
  • [31] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, Dublin, 2021.
  • [32] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, Minneapolis, 2019.
  • [33] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan, “PEFT: State-of-the-art parameter-efficient fine-tuning methods,” https://github.com/huggingface/peft, 2022.
  • [34] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  • [35] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen, “Reasoning with language model prompting: A survey,” in Proc. ACL, Toronto, 2023.