Skip to main content

Showing 1–50 of 145 results for author: Raj, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16850  [pdf, other

    cs.CV cs.RO

    From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking

    Authors: Xiaohao Xu, Tianyi Zhang, Sibo Wang, Xiang Li, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Xiaonan Huang

    Abstract: Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Map** (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world wit… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 50 pages. arXiv admin note: substantial text overlap with arXiv:2402.08125

  2. arXiv:2406.09750  [pdf, other

    cs.CV cs.AI

    ControlVAR: Exploring Controllable Visual Autoregressive Modeling

    Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, Bhiksha Raj

    Abstract: Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 24 pages, 19 figures, 4 tables

  3. arXiv:2406.01432  [pdf, other

    cs.CV

    ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models

    Authors: Thanh-Dat Truong, Xin Li, Bhiksha Raj, Jackson Cothren, Khoa Luu

    Abstract: The Vision-Language Foundation Model has recently shown outstanding performance in various perception learning tasks. The outstanding performance of the vision-language model mainly relies on large-scale pre-training datasets and different data augmentation techniques. However, the domain generalization problem of the vision-language foundation model needs to be addressed. This problem has limited… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  4. arXiv:2406.01429  [pdf, other

    cs.CV

    EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding

    Authors: Thanh-Dat Truong, Utsav Prabhu, Dongyi Wang, Bhiksha Raj, Susan Gauch, Jeyamkondan Subbiah, Khoa Luu

    Abstract: Unsupervised Domain Adaptation has been an efficient approach to transferring the semantic segmentation model across data distributions. Meanwhile, the recent Open-vocabulary Semantic Scene understanding based on large-scale vision language models is effective in open-set settings because it can learn diverse concepts and categories. However, these prior methods fail to generalize across different… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  5. arXiv:2405.20494  [pdf, other

    cs.CV cs.AI cs.LG

    Slight Corruption in Pre-training Data Makes Better Diffusion Models

    Authors: Hao Chen, Yu** Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, **dong Wang, Bhiksha Raj

    Abstract: Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pair… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 50 pages, 33 figures, 4 tables

  6. arXiv:2405.14855  [pdf, other

    cs.CV cs.AI

    Synergistic Global-space Camera and Human Reconstruction from Videos

    Authors: Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang

    Abstract: Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras an… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  7. arXiv:2405.01207  [pdf, ps, other

    cs.LG cs.CR cs.SD eess.AS

    Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features

    Authors: Francisco Teixeira, Karla Pizzi, Raphael Olivier, Alberto Abad, Bhiksha Raj, Isabel Trancoso

    Abstract: Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data. This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. To the best of our knowledge, this appr… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Trustworthy Speech Processing, Satellite Workshop at ICASSP 2024

  8. arXiv:2403.06869  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Learning with Noisy Foundation Models

    Authors: Hao Chen, **dong Wang, Zihan Wang, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj

    Abstract: Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and ana… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 18 pages, 10 figures, 6 tables, preprint. arXiv admin note: substantial text overlap with arXiv:2309.17002

  9. arXiv:2403.04924  [pdf, other

    cs.CV

    $\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

    Authors: Xiang Li, Kai Qiu, **glu Wang, Xiaohao Xu, Rita Singh, Kashu Yamazak, Hao Chen, Xiaonan Huang, Bhiksha Raj

    Abstract: Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses t… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: Code and dataset will be released at https://github.com/lxa9867/r2bench

  10. arXiv:2402.11452  [pdf, other

    cs.CL

    AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

    Authors: Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, Huaxiu Yao

    Abstract: Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Spec… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: 17 pages, 4 figures, 11 tables

  11. arXiv:2402.10427  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Evaluating and Improving Continual Learning in Spoken Language Understanding

    Authors: Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj

    Abstract: Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model's stability, plasticity, and generalizability as fundamental aspects o… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

  12. arXiv:2402.09585  [pdf, other

    cs.SD eess.AS

    Domain Adaptation for Contrastive Audio-Language Models

    Authors: Soham Deshmukh, Rita Singh, Bhiksha Raj

    Abstract: Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-shot capabilities at test time. The zero-shot performance of ALM improves by using suitable text prompts for each domain. The text prompts are usually hand-crafted through an ad-hoc process and lead to a drop in ALM generalization and out-of-distribution performance. Existing approaches to improve domain performan… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  13. arXiv:2402.08125  [pdf, other

    cs.RO cs.AI cs.CV cs.MM

    Customizable Perturbation Synthesis for Robust SLAM Benchmarking

    Authors: Xiaohao Xu, Tianyi Zhang, Sibo Wang, Xiang Li, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Xiaonan Huang

    Abstract: Robustness is a crucial factor for the successful deployment of robots in unstructured environments, particularly in the domain of Simultaneous Localization and Map** (SLAM). Simulation-based benchmarks have emerged as a highly scalable approach for robustness evaluation compared to real-world data collection. However, crafting a challenging and controllable noisy world with diverse perturbation… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

    Comments: 40 pages

  14. arXiv:2402.01922  [pdf, other

    cs.LG cs.AI

    A General Framework for Learning from Weak Supervision

    Authors: Hao Chen, **dong Wang, Lei Feng, Xiang Li, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj

    Abstract: Weakly supervised learning generally faces challenges in applicability to various scenarios with diverse weak supervision and in scalability due to the complexity of existing algorithms, thereby hindering the practical deployment. This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulati… ▽ More

    Submitted 5 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: 24 pages, 20 tables, 9 figures

  15. arXiv:2402.01909  [pdf, other

    cs.LG cs.AI cs.CY

    On Catastrophic Inheritance of Large Foundation Models

    Authors: Hao Chen, Bhiksha Raj, Xing Xie, **dong Wang

    Abstract: Large foundation models (LFMs) are claiming incredible performances. Yet great concerns have been raised about their mythic and uninterpreted potentials not only in machine learning, but also in various other disciplines. In this position paper, we propose to identify a neglected issue deeply rooted in LFMs: Catastrophic Inheritance, describing the weaknesses and limitations inherited from biased… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  16. arXiv:2402.00282  [pdf, other

    eess.AS cs.SD

    PAM: Prompting Audio-Language Models for Audio Quality Assessment

    Authors: Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

    Abstract: While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calcu… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  17. arXiv:2401.06806  [pdf, ps, other

    cs.CL cs.AI

    AugSumm: towards generalizable speech summarization using synthetic labels from large language model

    Authors: Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe

    Abstract: Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech. Given variations in information captured and phrasing, recordings can be summarized in multiple ways. Therefore, it is more reasonable to consider a probabilistic distribution of all potential summaries rather than a single summary. However, conventional SSUM models are mostly trained and evaluated with a si… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: This work has been submitted to the IEEE ICASSP for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 5 pages

  18. arXiv:2311.15965  [pdf, other

    cs.CV

    FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

    Authors: Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj, Jackson Cothren, Khoa Luu

    Abstract: Continual Learning in semantic scene segmentation aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and… ▽ More

    Submitted 9 May, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  19. arXiv:2311.15080  [pdf, other

    cs.CV cs.AI cs.LG cs.MM cs.SD eess.AS

    Weakly-Supervised Audio-Visual Segmentation

    Authors: Shentong Mo, Bhiksha Raj

    Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotat… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  20. Token Prediction as Implicit Classification to Identify LLM-Generated Text

    Authors: Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj

    Abstract: This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task and directly fine-tune the base LM to perform it. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experi… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023, Main Conference

  21. arXiv:2310.09449  [pdf, other

    cs.CV cs.LG

    Pairwise Similarity Learning is SimPLE

    Authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf

    Abstract: In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples w… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: Published in ICCV 2023 (Project page: https://simple.is.tue.mpg.de/)

  22. arXiv:2310.07161  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

    Authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj

    Abstract: Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured ex… ▽ More

    Submitted 21 November, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

  23. arXiv:2310.04445  [pdf, other

    cs.CL cs.AI cs.LG

    LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

    Authors: Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh

    Abstract: It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private targe… ▽ More

    Submitted 21 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

  24. arXiv:2310.02699  [pdf, other

    eess.AS cs.AI

    Continual Contrastive Spoken Language Understanding

    Authors: Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna, Alessio Brutti, Bhiksha Raj

    Abstract: Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from sc… ▽ More

    Submitted 4 June, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted to ACL Findings 2024

  25. arXiv:2310.02298  [pdf, other

    cs.SD cs.AI eess.AS

    Prompting Audios Using Acoustic Properties For Emotion Representation

    Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

    Abstract: Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emoti… ▽ More

    Submitted 6 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2211.07737

  26. arXiv:2310.00900  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

    Authors: Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu

    Abstract: Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by p… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  27. arXiv:2310.00808  [pdf, other

    cs.CV

    Completing Visual Objects via Bridging Generation and Segmentation

    Authors: Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu

    Abstract: This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components. Our method, named MaskComp, delineates the completion process through iterative stages of generation and segmentation. In each iteration, the object mask is provided as an additional condition to boost image generation, and, in return, the gene… ▽ More

    Submitted 2 February, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

  28. arXiv:2310.00706  [pdf, other

    cs.CL cs.SD eess.AS

    Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

    Authors: Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh

    Abstract: Modern speech synthesis systems have improved significantly, with synthetic speech being indistinguishable from real speech. However, efficient and holistic evaluation of synthetic speech still remains a significant challenge. Human evaluation using Mean Opinion Score (MOS) is ideal, but inefficient due to high costs. Therefore, researchers have developed auxiliary automatic metrics like Word Erro… ▽ More

    Submitted 1 October, 2023; originally announced October 2023.

  29. arXiv:2310.00132  [pdf, other

    cs.CV

    QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition

    Authors: Xiang Li, **glu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu, Bhiksha Raj

    Abstract: Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved, establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence… ▽ More

    Submitted 19 April, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

  30. arXiv:2309.17002  [pdf, other

    cs.LG cs.AI cs.CV

    Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

    Authors: Hao Chen, **dong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj

    Abstract: Pre-training on large-scale datasets and then fine-tuning on downstream tasks have become a standard practice in deep learning. However, pre-training data often contain label noise that may adversely affect the generalization of the model. This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. More specifically, through extensive… ▽ More

    Submitted 11 March, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICLR 2024 Spotlight

  31. arXiv:2309.13227  [pdf, other

    cs.LG cs.SD eess.AS

    Importance of negative sampling in weak label learning

    Authors: Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj

    Abstract: Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open prob… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  32. arXiv:2309.07372  [pdf, other

    eess.AS cs.SD

    Training Audio Captioning Models without Audio

    Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

    Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  33. arXiv:2308.03956  [pdf, other

    cs.LG cs.NE

    Fixed Inter-Neuron Covariability Induces Adversarial Robustness

    Authors: Muhammad Ahmed Shah, Bhiksha Raj

    Abstract: The vulnerability to adversarial perturbations is a major flaw of Deep Neural Networks (DNNs) that raises question about their reliability when in real-world scenarios. On the other hand, human perception, which DNNs are supposed to emulate, is highly robust to such perturbations, indicating that there may be certain features of the human perception that make it robust but are not represented in t… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

  34. arXiv:2308.00854  [pdf, other

    cs.CV cs.AI

    Training on Foveated Images Improves Robustness to Adversarial Attacks

    Authors: Muhammad A. Shah, Bhiksha Raj

    Abstract: Deep neural networks (DNNs) have been shown to be vulnerable to adversarial attacks -- subtle, perceptually indistinguishable perturbations of inputs that change the response of the model. In the context of vision, we hypothesize that an important contributor to the robustness of human visual perception is constant exposure to low-fidelity visual stimuli in our peripheral vision. To investigate th… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

  35. arXiv:2307.13953  [pdf, other

    cs.CV cs.SD eess.AS

    The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

    Authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj

    Abstract: This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiolo… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: Interspeech 2023

  36. arXiv:2307.13948  [pdf, other

    cs.CV cs.SD eess.AS

    Rethinking Voice-Face Correlation: A Geometry View

    Authors: Xiang Li, Yandong Wen, Muqiao Yang, **glu Wang, Rita Singh, Bhiksha Raj

    Abstract: Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric mea… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: ACM Multimedia 2023

  37. arXiv:2307.08217  [pdf, other

    cs.CL cs.SD eess.AS

    BASS: Block-wise Adaptation for Speech Summarization

    Authors: Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj

    Abstract: End-to-end speech summarization has been shown to improve performance over cascade baselines. However, such models are difficult to train on very large inputs (dozens of minutes or hours) owing to compute restrictions and are hence trained with truncated model inputs. Truncation leads to poorer models, and a solution to this problem rests in block-wise modeling, i.e., processing a portion of the i… ▽ More

    Submitted 16 July, 2023; originally announced July 2023.

    Comments: Accepted at Interspeech 2023

  38. arXiv:2306.09613  [pdf, other

    cs.CV

    UTOPIA: Unconstrained Tracking Objects without Preliminary Examination via Cross-Domain Adaptation

    Authors: Pha Nguyen, Kha Gia Quach, John Gauch, Samee U. Khan, Bhiksha Raj, Khoa Luu

    Abstract: Multiple Object Tracking (MOT) aims to find bounding boxes and identities of targeted objects in consecutive video frames. While fully-supervised MOT methods have achieved high accuracy on existing datasets, they cannot generalize well on a newly obtained dataset or a new unseen domain. In this work, we first address the MOT problem from the cross-domain point of view, imitating the process of new… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  39. arXiv:2305.19406  [pdf, other

    cs.CV

    PaintSeg: Training-free Segmentation via Painting

    Authors: Xiang Li, Chung-Ching Lin, Yinpeng Chen, Zicheng Liu, **glu Wang, Bhiksha Raj

    Abstract: The paper introduces PaintSeg, a new unsupervised method for segmenting objects without any training. We propose an adversarial masked contrastive painting (AMCP) process, which creates a contrast between the original image and a painted image in which a masked area is painted using off-the-shelf generative models. During the painting process, inpainting and outpainting are alternated, with the fo… ▽ More

    Submitted 4 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

  40. arXiv:2305.15700  [pdf, other

    cs.CV

    Fairness Continual Learning Approach to Semantic Scene Understanding in Open-World Environments

    Authors: Thanh-Dat Truong, Hoang-Quan Nguyen, Bhiksha Raj, Khoa Luu

    Abstract: Continual semantic segmentation aims to learn new classes while maintaining the information from the previous classes. Although prior studies have shown impressive progress in recent years, the fairness concern in the continual semantic segmentation needs to be better addressed. Meanwhile, fairness is one of the most vital factors in deploying the deep learning model, especially in human-related o… ▽ More

    Submitted 1 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023

  41. arXiv:2305.12715  [pdf, other

    cs.LG cs.AI cs.CV

    Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

    Authors: Hao Chen, Ankit Shah, **dong Wang, Ran Tao, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj

    Abstract: Learning with reduced labeling standards, such as noisy label, partial label, and multiple label candidates, which we generically refer to as \textit{imprecise} labels, is a commonplace challenge in machine learning tasks. Previous methods tend to propose specific designs for every emerging imprecise label configuration, which is usually unsustainable when multiple configurations of imprecision co… ▽ More

    Submitted 29 September, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 29 pages, 3 figures, 16 tables, preprint

  42. arXiv:2305.07969  [pdf, other

    cs.CL

    GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content

    Authors: Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj

    Abstract: This paper presents a novel approach for detecting ChatGPT-generated vs. human-written text using language models. To this end, we first collected and released a pre-processed dataset named OpenGPTText, which consists of rephrased content generated using ChatGPT. We then designed, implemented, and trained two different models for text classification, using Robustly Optimized BERT Pretraining Appro… ▽ More

    Submitted 17 May, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

  43. arXiv:2304.02135  [pdf, other

    cs.CV

    FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

    Authors: Thanh-Dat Truong, Ngan Le, Bhiksha Raj, Jackson Cothren, Khoa Luu

    Abstract: Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed. In addition, fairness is one of the most critical aspects when deploying the segmentation models into human-related real-world applications, e.g., autonomous driving, as any unfair predictions could inf… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPR'23

  44. arXiv:2303.09048  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

    Authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Hojeong Lee, Ankit Shah, Shuo Han, Yunyang Zeng, Amanda Shu, Haohui Liu, Xuankai Chang, Hamza Khalid, Minseon Gwak, Kawon Lee, Minjeong Kim, Bhiksha Raj

    Abstract: In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and plat… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Under review at European Association for Signal Processing. 5 pages

  45. arXiv:2303.03591  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms

    Authors: Ankit Shah, Shuyi Chen, Kejun Zhou, Yue Chen, Bhiksha Raj

    Abstract: General-purpose embedding is highly desirable for few-shot even zero-shot learning in many application scenarios, including audio tasks. In order to understand representations better, we conducted a thorough error analysis and visualization of HEAR 2021 submission results. Inspired by the analysis, this work experiments with different front-end audio preprocessing methods, including Constant-Q Tra… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

    Comments: Technical report, 10 pages

  46. arXiv:2302.09719  [pdf, ps, other

    eess.AS cs.SD

    Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

    Authors: Laurie M. Heller, Benjamin Elizalde, Bhiksha Raj, Soham Deshmukh

    Abstract: Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely data-driven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informe… ▽ More

    Submitted 23 February, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

    Comments: 4 pages. Summary of Special Session planned for 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://2023.ieeeicassp.org/ Second version has corrected spelling of an author's name

  47. arXiv:2302.08095  [pdf, other

    cs.SD cs.CL eess.AS

    PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

    Authors: Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  48. arXiv:2302.08088  [pdf, other

    cs.CL cs.SD eess.AS

    TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

    Authors: Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj

    Abstract: Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  49. arXiv:2301.10921  [pdf, other

    cs.LG cs.AI cs.CV

    SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning

    Authors: Hao Chen, Ran Tao, Yue Fan, Yidong Wang, **dong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, Marios Savvides

    Abstract: The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholdi… ▽ More

    Submitted 15 March, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

    Comments: Accepted by ICLR 2023

  50. arXiv:2301.00891  [pdf, other

    cs.CL

    Understanding Political Polarisation using Language Models: A dataset and method

    Authors: Samiran Gode, Supreeth Bare, Bhiksha Raj, Hyungon Yoo

    Abstract: Our paper aims to analyze political polarization in US political system using Language Models, and thereby help candidates make an informed decision. The availability of this information will help voters understand their candidates views on the economy, healthcare, education and other social issues. Our main contributions are a dataset extracted from Wikipedia that spans the past 120 years and a L… ▽ More

    Submitted 2 January, 2023; originally announced January 2023.