Skip to main content

Showing 1–50 of 216 results for author: Hsu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13578  [pdf, other

    cs.CL

    Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

    Authors: Han-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung Fan

    Abstract: In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Throug… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Findings at ACL 2024

  2. arXiv:2406.10923  [pdf, other

    cs.CV cs.CL cs.LG

    Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

    Authors: Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

    Abstract: Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reaso… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: Project page: https://ander1119.github.io/TiM

  3. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinat… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound

  4. arXiv:2406.07777  [pdf, other

    cs.LG

    Unifying Interpretability and Explainability for Alzheimer's Disease Progression Prediction

    Authors: Raja Farrukh Ali, Stephanie Milani, John Woods, Emmanuel Adenij, Ayesha Farooq, Clayton Mansel, Jeffrey Burns, William Hsu

    Abstract: Reinforcement learning (RL) has recently shown promise in predicting Alzheimer's disease (AD) progression due to its unique ability to model domain knowledge. However, it is not clear which RL algorithms are well-suited for this task. Furthermore, these methods are not inherently explainable, limiting their applicability in real-world clinical scenarios. Our work addresses these two important ques… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Previous versions accepted to NeurIPS 2023's XAIA and AAAI 2024's XAI4DRL workshops

  5. arXiv:2406.06251  [pdf, other

    eess.AS cs.CL

    Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

    Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

    Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  6. arXiv:2406.04377  [pdf, other

    eess.IV cs.LG

    Combining Graph Neural Network and Mamba to Capture Local and Global Tissue Spatial Relationships in Whole Slide Images

    Authors: Ruiwen Ding, Kha-Dinh Luong, Erika Rodriguez, Ana Cristina Araujo Lemos da Silva, William Hsu

    Abstract: In computational pathology, extracting spatial features from gigapixel whole slide images (WSIs) is a fundamental task, but due to their large size, WSIs are typically segmented into smaller tiles. A critical aspect of this analysis is aggregating information from these tiles to make predictions at the WSI level. We introduce a model that combines a message-passing graph neural network (GNN) with… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  7. arXiv:2406.00761  [pdf, other

    cs.LG cs.AI

    Shared-unique Features and Task-aware Prioritized Sampling on Multi-task Reinforcement Learning

    Authors: Po-Shao Lin, Jia-Fong Yeh, Yi-Ting Chen, Winston H. Hsu

    Abstract: We observe that current state-of-the-art (SOTA) methods suffer from the performance imbalance issue when performing multi-task reinforcement learning (MTRL) tasks. While these methods may achieve impressive performance on average, they perform extremely poorly on a few tasks. To address this, we propose a new and effective method called STARS, which consists of two novel strategies: a shared-uniqu… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: The first two authors contribute equally

  8. arXiv:2405.18357  [pdf, other

    cs.CL

    Faithful Logical Reasoning via Symbolic Chain-of-Thought

    Authors: Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, Wynne Hsu

    Abstract: While the recent Chain-of-Thought (CoT) technique enhances the reasoning ability of large language models (LLMs) with the theory of mind, it might still struggle in handling logical reasoning that relies much on symbolic expressions and rigid deducing rules. To strengthen the logical reasoning capability of LLMs, we propose a novel Symbolic Chain-of-Thought, namely SymbCoT, a fully LLM-based frame… ▽ More

    Submitted 11 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted by ACL 2024 (main proceeding)

  9. arXiv:2405.17507  [pdf, other

    cs.LG cs.AI cs.NI

    Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

    Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

    Abstract: Traditional traffic prediction, limited by the scope of sensor data, falls short in comprehensive traffic management. Mobile networks offer a promising alternative using network activity counts, but these lack crucial directionality. Thus, we present the TeltoMob dataset, featuring undirected telecom counts and corresponding directional flows, to predict directional mobility flows on roadways. To… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: 8 Figures, 5 Tables. Just accepted by IJCAI (to appear)

  10. arXiv:2405.16545  [pdf, other

    cs.RO

    VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

    Authors: Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh, Han-Yuan Hsu, Yi-Ting Chen, Winston H. Hsu

    Abstract: We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  11. arXiv:2405.13237  [pdf

    eess.IV cs.CV

    Spatial Matching of 2D Mammography Images and Specimen Radiographs: Towards Improved Characterization of Suspicious Microcalcifications

    Authors: Noor Nakhaei, Chrysostomos Marasinou, Akinyinka Omigbodun, Nina Capiro, Bo Li, Anne Hoyt, William Hsu

    Abstract: Accurate characterization of suspicious microcalcifications is critical to determine whether these calcifications are associated with invasive disease. Our overarching objective is to enable the joint characterization of microcalcifications and surrounding breast tissue using mammography images and digital histopathology images. Towards this goal, we investigate a template matching-based approach… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Journal ref: Medical Imaging 2021: Computer-Aided Diagnosis (Vol. 11597, pp. 511-516). SPIE

  12. arXiv:2405.11478  [pdf, other

    cs.CV eess.IV

    Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

    Authors: Igor Morawski, Kai He, Shusil Dangi, Winston H. Hsu

    Abstract: Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguisti… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: Accepted to CVPR 2024 Workshop NTIRE: New Trends in Image Restoration and Enhancement workshop and Challenges

  13. arXiv:2405.08586  [pdf, other

    cs.CV

    Cross-Domain Feature Augmentation for Domain Generalization

    Authors: Yingnan Liu, Yingtian Zou, Rui Qiao, Fusheng Liu, Mong Li Lee, Wynne Hsu

    Abstract: Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature spa… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024); Code is available at https://github.com/NancyQuris/XDomainMix

  14. arXiv:2404.09956  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

    Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

    Abstract: Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models… ▽ More

    Submitted 16 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: https://github.com/declare-lab/tango

  15. arXiv:2403.18330  [pdf, other

    cs.CV cs.LG

    Tracking-Assisted Object Detection with Event Cameras

    Authors: Ting-Kang Yen, Igor Morawski, Shusil Dangi, Kai He, Chung-Yi Lin, Jia-Fong Yeh, Hung-Ting Su, Winston Hsu

    Abstract: Event-based object detection has recently garnered attention in the computer vision community due to the exceptional properties of event cameras, such as high dynamic range and no motion blur. However, feature asynchronism and sparsity cause invisible objects due to no relative motion to the camera, posing a significant challenge in the task. Prior works have studied various memory mechanisms to p… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  16. arXiv:2403.14402  [pdf, other

    cs.SD cs.CL eess.AS

    XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

    Authors: HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

    Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  17. arXiv:2403.12991  [pdf, other

    cs.CV cs.LG

    Tel2Veh: Fusion of Telecom Data and Vehicle Flow to Predict Camera-Free Traffic via a Spatio-Temporal Framework

    Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

    Abstract: Vehicle flow, a crucial indicator for transportation, is often limited by detector coverage. With the advent of extensive mobile network coverage, we can leverage mobile user activities, or cellular traffic, on roadways as a proxy for vehicle flow. However, as counts of cellular traffic may not directly align with vehicle flow due to data from various user types, we present a new task: predicting… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: 4 pages, 5 figures, 4 tables. Accepted by WWW'24, to appear

  18. arXiv:2403.06392  [pdf, other

    cs.LG

    Towards Robust Out-of-Distribution Generalization Bounds via Sharpness

    Authors: Yingtian Zou, Kenji Kawaguchi, Yingnan Liu, Jiashuo Liu, Mong-Li Lee, Wynne Hsu

    Abstract: Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalizat… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 40 pages, 9 figures, ICLR 2024 Spotlight Presentation

  19. arXiv:2403.03170  [pdf, other

    cs.MM cs.AI cs.CL cs.CV cs.CY

    SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

    Authors: Peng Qi, Zehong Yan, Wynne Hsu, Mong Li Lee

    Abstract: Misinformation is a prevalent societal issue due to its potential high risks. Out-of-context (OOC) misinformation, where authentic images are repurposed with false text, is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments, which is essential for debunking misinformation. W… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: To appear in CVPR 2024

  20. arXiv:2402.03860  [pdf, other

    cs.RO

    AED: Adaptable Error Detection for Few-shot Imitation Policy

    Authors: Jia-Fong Yeh, Kuo-Han Hung, Pang-Chi Lo, Chi-Ming Chung, Tsung-Han Wu, Hung-Ting Su, Yi-Ting Chen, Winston H. Hsu

    Abstract: We introduce a new task called Adaptable Error Detection (AED), which aims to identify behavior errors in few-shot imitation (FSI) policies based on visual observations in novel environments. The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsis… ▽ More

    Submitted 25 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  21. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  22. arXiv:2401.03138  [pdf, other

    cs.LG cs.AI

    TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

    Authors: ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

    Abstract: To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that inte… ▽ More

    Submitted 6 January, 2024; originally announced January 2024.

    Comments: 7 pages, 7 figures, 4 tables. Accepted by AAAI-24-IAAI, to appear

  23. arXiv:2312.15821  [pdf, other

    cs.SD cs.LG eess.AS

    Audiobox: Unified Audio Generation with Natural Language Prompts

    Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

    Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

  24. arXiv:2311.02772  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

    Authors: Sungho Jeon, Ching-Feng Yeh, Hakan Inan, Wei-Ning Hsu, Rashi Rungta, Yashar Mehdad, Daniel Bikel

    Abstract: In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech tr… ▽ More

    Submitted 8 February, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

    Comments: 5 pages; accepted to Self-supervision in Audio, Speech and Beyond (SASB) workshop in ICASSP24

  25. arXiv:2311.02332  [pdf, other

    cs.LG cs.CV

    Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

    Authors: Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles Kahn, Olivier Gevaert, Arvind Rao

    Abstract: Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing… ▽ More

    Submitted 19 January, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

  26. arXiv:2310.16338  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Generative Pre-training for Speech with Flow Matching

    Authors: Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

    Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there… ▽ More

    Submitted 25 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  27. arXiv:2310.13615  [pdf, other

    cs.CL

    Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning

    Authors: An-Zi Yen, Wei-Ling Hsu

    Abstract: Due to the remarkable language understanding and generation abilities of large language models (LLMs), their use in educational applications has been explored. However, little work has been done on investigating the pedagogical ability of LLMs in hel** students to learn mathematics. In this position paper, we discuss the challenges associated with employing LLMs to enhance students' mathematical… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 Findings

  28. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on map** speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  29. arXiv:2310.03821  [pdf, other

    cs.CV cs.RO

    WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

    Authors: Tsung-Lin Tsou, Tsung-Han Wu, Winston H. Hsu

    Abstract: In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplo… ▽ More

    Submitted 7 February, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Accepted to ICRA 2024. Code is available at https://github.com/jacky121298/WLST

  30. arXiv:2309.17020  [pdf, other

    eess.AS cs.SD

    Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

    Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

    Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ASRU 2023 SPARKS Workshop

  31. arXiv:2308.05725  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

    Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

    Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

  32. arXiv:2308.03243  [pdf, other

    cs.LG

    Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

    Authors: Chien Cheng Chyou, Hung-Ting Su, Winston H. Hsu

    Abstract: Adversarial robustness poses a critical challenge in the deployment of deep learning models for real-world applications. Traditional approaches to adversarial training and supervised detection rely on prior knowledge of attack types and access to labeled training data, which is often impractical. Existing unsupervised adversarial detection methods identify whether the target model works properly,… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

    Comments: AdvML in ICML 2023 code:https://github.com/CycleBooster/Unsupervised-adversarial-detection-without-extra-model

  33. arXiv:2307.13069  [pdf, other

    cs.CV cs.LG

    General-Purpose Multi-Modal OOD Detection Framework

    Authors: Viet Duong, Qiong Wu, Zhengyi Zhou, Eric Zavesky, Jiahe Chen, Xiangzhou Liu, Wen-Ling Hsu, Huajie Shao

    Abstract: Out-of-distribution (OOD) detection identifies test samples that differ from the training data, which is critical to ensuring the safety and reliability of machine learning (ML) systems. While a plethora of methods have been developed to detect uni-modal OOD samples, only a few have focused on multi-modal OOD detection. Current contrastive learning-based methods primarily study multi-modal OOD det… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  34. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  35. arXiv:2305.19011  [pdf, other

    eess.AS cs.CL cs.LG

    MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models

    Authors: Yu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston Hsu, Hung-yi Lee

    Abstract: SUPERB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks. However, it incurs high computational costs due to the large datasets and diverse tasks. In this paper, we introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.… ▽ More

    Submitted 14 November, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to IEEE ASRU 2023

  36. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  37. arXiv:2305.10005  [pdf, other

    cs.CL

    DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

    Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

    Abstract: In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with… ▽ More

    Submitted 16 January, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

  38. arXiv:2304.11112  [pdf

    cs.NI physics.optics

    Adaptive beamforming for optical wireless communication via fiber modal control

    Authors: Chao Li, Yiwen Zhang, Xinda Yan, Yuzhe Wang, Xuebing Zhang, Jian Cui, Lei Zhu, Juhao Li, Zilun Li, Shaohua Yu, Zizheng Cao, A. M. J. Koonen, Chia Wei Hsu

    Abstract: High-speed optical wireless communication can address the exponential growth in data traffic. Adaptive beamforming customized for the target location is crucial, but existing solutions such as liquidcrystal spatial light modulators and microelectromechanical systems require costly micro/nano manufacturing, delicate alignment, and a high degree of mechanical stability. These challenges reflect the… ▽ More

    Submitted 26 April, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: 17 pages, 7 figures

  39. arXiv:2304.03754  [pdf, other

    cs.CL cs.CV

    Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

    Authors: Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang

    Abstract: Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior pe… ▽ More

    Submitted 7 April, 2023; originally announced April 2023.

    Comments: CVPR 2023 Workshop L3D-IVU

  40. arXiv:2303.16637  [pdf, other

    cs.CV

    MuRAL: Multi-Scale Region-based Active Learning for Object Detection

    Authors: Yi-Syuan Liou, Tsung-Han Wu, Jia-Fong Yeh, Wen-Chin Chen, Winston H. Hsu

    Abstract: Obtaining large-scale labeled object detection dataset can be costly and time-consuming, as it involves annotating images with bounding boxes and class labels. Thus, some specialized active learning methods have been proposed to reduce the cost by selecting either coarse-grained samples or fine-grained instances from unlabeled data for labeling. However, the former approaches suffer from redundant… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

  41. arXiv:2303.13752  [pdf, other

    cs.LG cs.CV eess.IV

    Leveraging Old Knowledge to Continually Learn New Classes in Medical Images

    Authors: Evelyn Chee, Mong Li Lee, Wynne Hsu

    Abstract: Class-incremental continual learning is a core step towards develo** artificial intelligence systems that can continuously adapt to changes in the environment by learning new concepts without forgetting those previously learned. This is especially needed in the medical domain where continually learning from new incoming data is required to classify an expanded set of diseases. In this work, we f… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Accepted to AAAI23

  42. arXiv:2303.11131  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

    Authors: Maryam Fazel-Zarandi, Wei-Ning Hsu

    Abstract: Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-super… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  43. arXiv:2303.04027  [pdf, other

    cs.MM cs.RO

    BIRD-PCC: Bi-directional Range Image-based Deep LiDAR Point Cloud Compression

    Authors: Chia-Sheng Liu, Jia-Fong Yeh, Hao Hsu, Hung-Ting Su, Ming-Sui Lee, Winston H. Hsu

    Abstract: The large amount of data collected by LiDAR sensors brings the issue of LiDAR point cloud compression (PCC). Previous works on LiDAR PCC have used range image representations and followed the predictive coding paradigm to create a basic prototype of a coding framework. However, their prediction methods give an inaccurate result due to the negligence of invalid pixels in range images and the omissi… ▽ More

    Submitted 8 March, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  44. arXiv:2303.00628  [pdf, ps, other

    cs.CL eess.AS

    MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

    Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang

    Abstract: We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translati… ▽ More

    Submitted 7 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

  45. arXiv:2302.13571  [pdf, other

    cs.LG cs.AI

    FLAG: Fast Label-Adaptive Aggregation for Multi-label Classification in Federated Learning

    Authors: Shih-Fang Chang, Benny Wei-Yun Hsu, Tien-Yu Chang, Vincent S. Tseng

    Abstract: Federated learning aims to share private data to maximize the data utility without privacy leakage. Previous federated learning research mainly focuses on multi-class classification problems. However, multi-label classification is a crucial research problem close to real-world data properties. Nevertheless, a limited number of federated learning studies explore this research problem. Existing stud… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: 16 pages, 6 figures, and 2 tables

  46. arXiv:2302.06419  [pdf, other

    eess.AS cs.AI cs.CL

    AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

    Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

    Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on pr… ▽ More

    Submitted 21 January, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

    Comments: 2023 ASRU

  47. arXiv:2301.03728  [pdf, other

    cs.CL cs.AI cs.LG

    Scaling Laws for Generative Mixed-Modal Language Models

    Authors: Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

    Abstract: Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modaliti… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

  48. arXiv:2301.00652  [pdf, other

    eess.AS cs.CL

    Efficient Speech Representation Learning with Low-Bit Quantization

    Authors: Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Abdelrahman Mohamed

    Abstract: With the development of hardware for machine learning, newer models often come at the cost of both increased sizes and computational complexity. In effort to improve the efficiency for these models, we apply and investigate recent quantization techniques on speech representation learning models. The quantization techniques were evaluated on the SUPERB benchmark. On the ASR task, with aggressive qu… ▽ More

    Submitted 14 December, 2022; originally announced January 2023.

    Comments: 7 pages

  49. arXiv:2212.11565  [pdf, other

    cs.CV

    Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-a… ▽ More

    Submitted 17 March, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

    Comments: Preprint

  50. arXiv:2212.11377  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

    Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi

    Abstract: Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.