Skip to main content

Showing 1–25 of 25 results for author: Leong, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13960  [pdf, other

    cs.CL cs.AI

    Evolving to be Your Soulmate: Personalized Dialogue Agents with Dynamically Adapted Personas

    Authors: Yi Cheng, Wenge Liu, Kaishuai Xu, Wenjun Hou, Yi Ouyang, Chak Tou Leong, Xian Wu, Yefeng Zheng

    Abstract: Previous research on persona-based dialogue agents typically preset the agent's persona before deployment, which remains static thereafter. In this paper, we take a step further and explore a new paradigm called Self-evolving Personalized Dialogue Agents (SPDA), where the agent continuously evolves during the conversation to better align with the user's anticipation by dynamically adapting its per… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Work in progress

  2. arXiv:2405.16229  [pdf, other

    cs.CL cs.CR

    No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

    Authors: Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie Li

    Abstract: The existing safety alignment of Large Language Models (LLMs) is found fragile and could be easily attacked through different strategies, such as through fine-tuning on a few harmful examples or manipulating the prefix of the generation results. However, the attack mechanisms of these strategies are still underexplored. In this paper, we ask the following question: \textit{while these approaches c… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: work in progress

  3. arXiv:2403.01811  [pdf, other

    cs.CL

    Enhancing Multi-Domain Automatic Short Answer Grading through an Explainable Neuro-Symbolic Pipeline

    Authors: Felix Künnecke, Anna Filighera, Colin Leong, Tim Steuer

    Abstract: Grading short answer questions automatically with interpretable reasoning behind the grading decision is a challenging goal for current transformer approaches. Justification cue detection, in combination with logical reasoners, has shown a promising direction for neuro-symbolic architectures in ASAG. But, one of the main challenges is the requirement of annotated justification cues in the students… ▽ More

    Submitted 19 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  4. arXiv:2402.06967  [pdf, other

    cs.CL cs.AI

    Instruct Once, Chat Consistently in Multiple Rounds: An Efficient Tuning Framework for Dialogue

    Authors: Jian Wang, Chak Tou Leong, Jiashuo Wang, Dongding Lin, Wenjie Li, Xiao-Yong Wei

    Abstract: Tuning language models for dialogue generation has been a prevalent paradigm for building capable dialogue agents. Yet, traditional tuning narrowly views dialogue generation as resembling other language generation tasks, ignoring the role disparities between two speakers and the multi-round interactive process that dialogues ought to be. Such a manner often leads to unsatisfactory chat consistency… ▽ More

    Submitted 30 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted by ACL 2024

  5. arXiv:2401.05928  [pdf, other

    cs.CL

    Mitigating Unhelpfulness in Emotional Support Conversations with Multifaceted AI Feedback

    Authors: Jiashuo Wang, Chunpu Xu, Chak Tou Leong, Wenjie Li, **g Li

    Abstract: An emotional support conversation system aims to alleviate users' emotional distress and assist them in addressing their challenges. To generate supportive responses, it is critical to consider multiple factors such as empathy, support strategies, and response coherence, as established in prior methods. Nonetheless, previous models occasionally generate unhelpful responses, which intend to provide… ▽ More

    Submitted 17 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: ACL 2024 Findings

  6. arXiv:2312.11792  [pdf, other

    cs.CL

    COOPER: Coordinating Specialized Agents towards a Complex Dialogue Goal

    Authors: Yi Cheng, Wenge Liu, Jian Wang, Chak Tou Leong, Yi Ouyang, Wenjie Li, Xian Wu, Yefeng Zheng

    Abstract: In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional service-focused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measu… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024

  7. arXiv:2311.12488  [pdf, other

    eess.AS cs.SD

    Adapting pretrained speech model for Mandarin lyrics transcription and alignment

    Authors: Jun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang

    Abstract: The tasks of automatic lyrics transcription and lyrics alignment have witnessed significant performance improvements in the past few years. However, most of the previous works only focus on English in which large-scale datasets are available. In this paper, we address lyrics transcription and alignment of polyphonic Mandarin pop music in a low-resource setting. To deal with the data scarcity issue… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted by ASRU 2023

  8. arXiv:2311.10174  [pdf, other

    cs.CL

    JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing

    Authors: Shester Gueuwou, Sophie Siake, Colin Leong, Mathias Müller

    Abstract: Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towa… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: EMNLP 20223 (Findings)

  9. arXiv:2310.09618  [pdf

    cs.CL

    Moral consensus and divergence in partisan language use

    Authors: Nakwon Rim, Marc G. Berman, Yuan Chang Leong

    Abstract: Polarization has increased substantially in political discourse, contributing to a widening partisan divide. In this paper, we analyzed large-scale, real-world language use in Reddit communities (294,476,146 comments) and in news outlets (6,749,781 articles) to uncover psychological dimensions along which partisan language is divided. Using word embedding models that captured semantic associations… ▽ More

    Submitted 14 October, 2023; originally announced October 2023.

    Comments: 43 pages, 14 figures

  10. arXiv:2310.09573  [pdf, other

    cs.CL

    Self-Detoxifying Language Models via Toxification Reversal

    Authors: Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li

    Abstract: Language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (PLMs) for safer deployment. Existing methods can be roughly categorized as finetuning-based and decoding-based. However, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. In this… ▽ More

    Submitted 14 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 main conference

  11. arXiv:2310.07397  [pdf, other

    cs.CL cs.AI

    Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation

    Authors: Jian Wang, Yi Cheng, Dongding Lin, Chak Tou Leong, Wenjie Li

    Abstract: Target-oriented dialogue systems, designed to proactively steer conversations toward predefined targets or accomplish specific system-side goals, are an exciting area in conversational AI. In this work, by formulating a <dialogue act, topic> pair as the conversation target, we explore a novel problem of personalized target-oriented dialogue by considering personalization during the target accompli… ▽ More

    Submitted 13 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP-2023 main conference

  12. Language Models Can Learn Exceptions to Syntactic Rules

    Authors: Cara Su-Yi Leong, Tal Linzen

    Abstract: Artificial neural networks can generalize productively to novel contexts. Can they also learn exceptions to those productive rules? We explore this question using the case of restrictions on English passivization (e.g., the fact that "The vacation lasted five days" is grammatical, but "*Five days was lasted by the vacation" is not). We collect human acceptability judgments for passive sentences wi… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: Accepted to SCiL 2023

  13. arXiv:2304.09919  [pdf, other

    cs.CL cs.AI

    The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages

    Authors: Vesa Akerman, David Baines, Damien Daspit, Ulf Hermjakob, Taeho Jang, Colin Leong, Michael Martin, Joel Mathew, Jonathan Robie, Marcus Schwarting

    Abstract: Efficiently and accurately translating a corpus into a low-resource language remains a challenge, regardless of the strategies employed, whether manual, automated, or a combination of the two. Many Christian organizations are dedicated to the task of translating the Holy Bible into languages that lack a modern translation. Bible translation (BT) work is currently underway for over 3000 extremely l… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

  14. arXiv:2303.16985  [pdf, other

    cs.CL cs.AI

    Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages

    Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole, Younwoo Choi, Tosin Adewumi

    Abstract: Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapt… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: Accepted to AfricaNLP workshop at ICLR2023

  15. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  16. arXiv:2210.14712  [pdf, other

    cs.CL cs.AI

    Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks

    Authors: Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, Daniel Whitenack

    Abstract: We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages ac… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: 14 pages, 1 figure, 3 tables, accepted to and presented at EMNLP 2022

    Journal ref: EMNLP 2022

  17. arXiv:2208.01897  [pdf, other

    cs.CV

    Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

    Authors: Mei Chee Leong, Haosong Zhang, Hui Li Tan, Liyuan Li, Joo Hwee Lim

    Abstract: Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's mode… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: The Ninth Workshop on Fine-Grained Visual Categorization (FGVC9) @ CVPR2022

  18. arXiv:2207.03546  [pdf, other

    eess.AS cs.CL cs.SD

    BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus

    Authors: Josh Meyer, David Ifeoluwa Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack Julian Weber, Salomon Kabongo, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Emezue, Jonathan Mukiibi, Salomey Osei, Apelete Agbolo, Victor Akinode, Bernard Opoku, Samuel Olanrewaju, Jesujoba Alabi, Shamsuddeen Muhammad

    Abstract: BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba.… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted to INTERSPEECH 2022

  19. arXiv:2205.02022  [pdf, other

    cs.CL

    A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

    Authors: David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi , et al. (20 additional authors not shown)

    Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models… ▽ More

    Submitted 22 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022 (added evaluation data for amh, kin, nya, sna, xho)

  20. arXiv:2201.10066  [pdf, other

    cs.CL cs.DB

    Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

    Authors: Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite

    Abstract: In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficie… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages plus appendix and references

  21. Joint Learning On The Hierarchy Representation for Fine-Grained Human Action Recognition

    Authors: Mei Chee Leong, Hui Li Tan, Haosong Zhang, Liyuan Li, Feng Lin, Joo Hwee Lim

    Abstract: Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recogni… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Camera ready for IEEE ICIP 2021

    Journal ref: 2021 IEEE International Conference on Image Processing (ICIP)

  22. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  23. arXiv:1911.13248  [pdf, other

    cs.HC cs.LG

    To Trust, or Not to Trust? A Study of Human Bias in Automated Video Interview Assessments

    Authors: Chee Wee Leong, Katrina Roohr, Vikram Ramanarayanan, Michelle P. Martin-Raugh, Harrison Kell, Rutuja Ubale, Yao Qian, Zydrune Mladineo, Laura McCulla

    Abstract: Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we investigate whether human ratings themselves can be trusted at their face value when scoring video-based structured interviews, and whether such ratings can impact… ▽ More

    Submitted 27 November, 2019; originally announced November 2019.

    Comments: ICCV Workshop on Interpreting and Explaining Visual Artificial Intelligence Models, Seoul, South Korea, 2019

    ACM Class: I.2.0; H.1.2

  24. arXiv:1710.03394  [pdf

    cs.SE eess.SY

    Incorporating Epistemic Uncertainty into the Safety Assurance of Socio-Technical Systems

    Authors: Chris Leong, Tim Kelly, Rob Alexander

    Abstract: In system development, epistemic uncertainty is an ever-present possibility when reasoning about the causal factors during hazard analysis. Such uncertainty is common when complicated systems interact with one another, and it is dangerous because it impairs hazard analysis and thus increases the chance of overlooking unsafe situations. Uncertainty around causation thus needs to be managed well. Un… ▽ More

    Submitted 9 October, 2017; originally announced October 2017.

    Comments: In Proceedings CREST 2017, arXiv:1710.02770

    Journal ref: EPTCS 259, 2017, pp. 56-71

  25. Predicting proximity with ambient mobile sensors for non-invasive health diagnostics

    Authors: Sylvester Olubolu Orimaye, Foo Chuan Leong, Chen Hui Lee, Eddy Cheng Han Ng

    Abstract: Modern smart phones are becoming helpful in the areas of Internet-Of-Things (IoT) and ambient health intelligence. By learning data from several mobile sensors, we detect nearness of the human body to a mobile device in a three-dimensional space with no physical contact with the device for non-invasive health diagnostics. We show that the human body generates wave patterns that interact with other… ▽ More

    Submitted 9 December, 2015; originally announced December 2015.

    Comments: Accepted and presented at the 12th IEEE Malaysia International Conference on Communications, 23-25 November, 2015, Kuching, Sarawak, Malaysia