Skip to main content

Showing 1–50 of 343 results for author: Chang, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, **chuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 20 pages

  2. arXiv:2406.12723  [pdf, other

    cs.LG

    BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

    Authors: Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang

    Abstract: As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin… ▽ More

    Submitted 24 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2406.11579  [pdf, other

    cs.CV

    Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

    Authors: Han-Hung Lee, Yiming Zhang, Angel X. Chang

    Abstract: We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirement… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2406.08641  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

    Authors: Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, **chuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

    Abstract: ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  5. arXiv:2406.07725  [pdf, ps, other

    cs.SD eess.AS

    The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

    Authors: Xuankai Chang, Jiatong Shi, **chuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin **

    Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: This manuscript has been accepted by Interspeech2024

  6. arXiv:2406.06999  [pdf, other

    cs.CV

    Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

    Authors: Junfei Yi, Jianxu Mao, Tengfei Liu, Mingjie Li, Hanyu Gu, Hui Zhang, Xiaojun Chang, Yaonan Wang

    Abstract: Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowle… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  7. arXiv:2406.02990  [pdf, other

    cs.CV

    Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

    Authors: Gexin Huang, Chenfei Wu, Mingjie Li, Xiaojun Chang, Ling Chen, Ying Sun, Shen Zhao, Xiaodan Liang, Liang Lin

    Abstract: Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, and 3 tables

  8. arXiv:2405.17761  [pdf, other

    cs.LG math.OC

    Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient

    Authors: Hao Di, Haishan Ye, Yueling Zhang, Xiangyu Chang, Guang Dai, Ivor W. Tsang

    Abstract: Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods. However, in composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random gradient estimation. To reduce this variance, prior works require… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  9. arXiv:2405.17537  [pdf, other

    cs.AI cs.CL cs.CV

    BIOSCAN-CLIP: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

    Authors: ZeMing Gong, Austin T. Wang, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

    Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for the taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, DNA barcodes, and textual data in a unified embedding space. This allows for… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 16 pages with 9 figures

  10. arXiv:2405.10255  [pdf, other

    cs.CV cs.RO

    When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

    Authors: Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, **dong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

    Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context lear… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  11. arXiv:2405.06747  [pdf, other

    cs.SD cs.LG eess.AS

    Music Emotion Prediction Using Recurrent Neural Networks

    Authors: Xinyu Chang, Xiangyu Zhang, Haoruo Zhang, Yulu Ran

    Abstract: This study explores the application of recurrent neural networks to recognize emotions conveyed in music, aiming to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states. We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these c… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 15 pages, 13 figures

  12. arXiv:2405.05010  [pdf, other

    cs.CV

    ${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields

    Authors: Ning Wang, Lefei Zhang, Angel X Chang

    Abstract: Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (${M^2D}$NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-mo… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  13. arXiv:2405.04046  [pdf

    cs.CR

    MBCT: A Monero-Based Covert Transmission Approach with On-chain Dynamic Session Key Negotiation

    Authors: Zhenshuai Yue, Haoran Zhu, Xiaolin Chang, Jelena Mišić, Vojislav B. Mišić, Junchao Fan

    Abstract: Traditional covert transmission (CT) approaches have been hindering CT application while blockchain technology offers new avenue. Current blockchain-based CT approaches require off-chain negotiation of critical information and often overlook the dynamic session keys updating, which increases the risk of message and key leakage. Additionally, in some approaches the covert transactions exhibit obvio… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  14. arXiv:2405.01030  [pdf

    cs.CR cs.SE

    Towards Trust Proof for Secure Confidential Virtual Machines

    Authors: **gkai Mao, Haoran Zhu, Junchao Fan, Lin Li, Xiaolin Chang

    Abstract: The Virtual Machine (VM)-based Trusted-Execution-Environment (TEE) technology, like AMD Secure-Encrypted-Virtualization (SEV), enables the establishment of Confidential VMs (CVMs) to protect data privacy. But CVM lacks ways to provide the trust proof of its running state, degrading the user confidence of using CVM. The technology of virtual Trusted Platform Module (vTPM) can be used to generate tr… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  15. arXiv:2404.15189  [pdf, other

    cs.AI

    Text2Grasp: Grasp synthesis by text prompts of object gras** parts

    Authors: Xiaoyun Chang, Yi Sun

    Abstract: The hand plays a pivotal role in human ability to grasp and manipulate objects and controllable grasp synthesis is the key for successfully performing downstream tasks. Existing methods that use human intention or task-level language as control signals for gras** inherently face ambiguity. To address this challenge, we propose a grasp synthesis method guided by text prompts of object gras** pa… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  16. arXiv:2404.14886  [pdf, other

    cs.LG

    GCEPNet: Graph Convolution-Enhanced Expectation Propagation for Massive MIMO Detection

    Authors: Qincheng Lu, Sitao Luan, Xiao-Wen Chang

    Abstract: Massive MIMO (multiple-input multiple-output) detection is an important topic in wireless communication and various machine learning based methods have been developed recently for this task. Expectation propagation (EP) and its variants are widely used for MIMO detection and have achieved the best performance. However, EP-based solvers fail to capture the correlation between unknown variables, lea… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  17. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  18. arXiv:2404.07847  [pdf, other

    cs.CV

    The Effectiveness of a Simplified Model Structure for Crowd Counting

    Authors: Lei Chen, Xinghang Gao, Fei Chao, Xiang Chang, Chih Min Lin, Xingen Gao, Shaopeng Lin, Hongyi Zhang, Juqiang Lin

    Abstract: In the field of crowd counting research, many recent deep learning based methods have demonstrated robust capabilities for accurately estimating crowd sizes. However, the enhancement in their performance often arises from an increase in the complexity of the model structure. This paper discusses how to construct high-performance crowd counting models using only simple structures. We proposes the F… ▽ More

    Submitted 18 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

  19. arXiv:2404.05657  [pdf, other

    cs.CV

    MLP Can Be A Good Transformer Learner

    Authors: Sihao Lin, Pumeng Lyu, Dongrui Liu, Tao Tang, Xiaodan Liang, Andy Song, Xiaojun Chang

    Abstract: Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective remo… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: efficient transformer

  20. arXiv:2404.03384  [pdf, other

    cs.CV

    LongVLM: Efficient Long Video Understanding via Large Language Models

    Authors: Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang

    Abstract: Empowered by Large Language Models (LLMs), recent advancements in VideoLLMs have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still… ▽ More

    Submitted 10 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  21. arXiv:2403.16116  [pdf, other

    cs.CV cs.AI cs.LG

    Self-Supervised Multi-Frame Neural Scene Flow

    Authors: Dongrui Liu, Daqi Liu, Xueqian Li, Sihao Lin, Hongwei xie, Bing Wang, Xiaojun Chang, Lei Chu

    Abstract: Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability,… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  22. arXiv:2403.16097  [pdf, other

    cs.AI cs.LO cs.SE

    Can Language Models Pretend Solvers? Logic Code Simulation with LLMs

    Authors: Minyu Chen, Guoqiang Li, Ling-I Wu, Ruibang Liu, Yuxin Su, Xi Chang, Jianxin Xue

    Abstract: Transformer-based large language models (LLMs) have demonstrated significant potential in addressing logic problems. capitalizing on the great capabilities of LLMs for code-related activities, several frameworks leveraging logical solvers for logic reasoning have been proposed recently. While existing research predominantly focuses on viewing LLMs as natural language logic solvers or translators,… ▽ More

    Submitted 28 March, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: 12 pages, 8 figures

  23. arXiv:2403.14174  [pdf, other

    cs.CV

    Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

    Authors: **g**g Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang, Meng Wang

    Abstract: Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive in… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  24. arXiv:2403.13289  [pdf, other

    cs.CV

    Text-to-3D Shape Generation

    Authors: Han-Hung Lee, Manolis Savva, Angel X. Chang

    Abstract: Recent years have seen an explosion of work and interest in text-to-3D shape generation. Much of the progress is driven by advances in 3D representations, large-scale pretraining and representation learning for text and image data enabling generative AI models, and differentiable rendering. Computational systems that can perform text-to-3D shape generation have captivated the popular imagination a… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  25. arXiv:2403.12301  [pdf, other

    cs.CV

    R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

    Authors: Qirui Wu, Sonia Raychaudhuri, Daniel Ritchie, Manolis Savva, Angel X Chang

    Abstract: We introduce the Reality-linked 3D Scenes (R3DS) dataset of synthetic 3D scenes mirroring the real-world scene arrangements from Matterport3D panoramas. Compared to prior work, R3DS has more complete and densely populated scenes with objects linked to real-world observations in panoramas. R3DS also provides an object support hierarchy, and matching object sets (e.g., same chairs around a dining ta… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  26. arXiv:2403.11519  [pdf, other

    cs.CR

    Efficient and Privacy-Preserving Federated Learning based on Full Homomorphic Encryption

    Authors: Yuqi Guo, Lin Li, Zhongxiang Zheng, Hanrui Yun, Ruoyan Zhang, Xiaolin Chang, Zhixuan Gao

    Abstract: Since the first theoretically feasible full homomorphic encryption (FHE) scheme was proposed in 2009, great progress has been achieved. These improvements have made FHE schemes come off the paper and become quite useful in solving some practical problems. In this paper, we propose a set of novel Federated Learning Schemes by utilizing the latest homomorphic encryption technologies, so as to improv… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  27. arXiv:2403.08568  [pdf, other

    cs.CV cs.LG

    Consistent Prompting for Rehearsal-Free Continual Learning

    Authors: Zhanxin Gao, Jun Cen, Xiaobin Chang

    Abstract: Continual learning empowers models to adapt autonomously to the ever-changing environment or data streams without forgetting old knowledge. Prompt-based approaches are built on frozen pre-trained models to learn the task-specific prompts and classifiers efficiently. Existing prompt-based methods are inconsistent between training and testing, limiting their effectiveness. Two types of inconsistency… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR2024

  28. arXiv:2403.07376  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

    Authors: Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, Xiaodan Liang

    Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offlin… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  29. arXiv:2403.04161  [pdf, other

    cs.LG cs.CV cs.NE

    SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS

    Authors: Yameng Peng, Andy Song, Haytham M. Fayek, Vic Ciesielski, Xiaojun Chang

    Abstract: Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its… ▽ More

    Submitted 24 June, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

    Comments: ICLR2024 Spotlight

  30. arXiv:2403.01475  [pdf, other

    cs.LG cs.AI cs.SI

    Representation Learning on Heterophilic Graph with Directional Neighborhood Attention

    Authors: Qincheng Lu, Jiaqi Zhu, Sitao Luan, Xiao-Wen Chang

    Abstract: Graph Attention Network (GAT) is one of the most popular Graph Neural Network (GNN) architecture, which employs the attention mechanism to learn edge weights and has demonstrated promising performance in various applications. However, since it only incorporates information from immediate neighborhood, it lacks the ability to capture long-range and global graph information, leading to unsatisfactor… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  31. arXiv:2403.01326  [pdf, other

    cs.CV

    DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

    Authors: Guangrun Wang, Changlin Li, Liuchun Yuan, Jiefeng Peng, Xiaoyu Xian, Xiaodan Liang, Xiaojun Chang, Liang Lin

    Abstract: Neural Architecture Search (NAS), aiming at automatically designing neural architectures by machines, has been considered a key step toward automatic machine learning. One notable NAS branch is the weight-sharing NAS, which significantly improves search efficiency and allows NAS algorithms to run on ordinary computers. Despite receiving high expectations, this category of methods suffers from low… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: T-PAMI

  32. arXiv:2402.18086  [pdf, other

    cs.CV

    Generalizable Two-Branch Framework for Image Class-Incremental Learning

    Authors: Chao Wu, Xiaobin Chang, Ruixuan Wang

    Abstract: Deep neural networks often severely forget previously learned knowledge when learning new knowledge. Various continual learning (CL) methods have been proposed to handle such a catastrophic forgetting issue from different perspectives and achieved substantial improvements. In this paper, a novel two-branch continual learning framework is proposed to further enhance most existing CL methods. Specif… ▽ More

    Submitted 13 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 5 pages,3 figures,accepted by ICASSP 2024

  33. arXiv:2402.16021  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  34. "It Must Be Gesturing Towards Me": Gesture-Based Interaction between Autonomous Vehicles and Pedestrians

    Authors: Xiang Chang, Zihe Chen, Xiaoyan Dong, Yuxin Cai, Tingmin Yan, Haolin Cai, Zherui Zhou, Guyue Zhou, Jiangtao Gong

    Abstract: Interacting with pedestrians understandably and efficiently is one of the toughest challenges faced by autonomous vehicles (AVs) due to the limitations of current algorithms and external human-machine interfaces (eHMIs). In this paper, we design eHMIs based on gestures inspired by the most popular method of interaction between pedestrians and human drivers. Eight common gestures were selected to c… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 26 pages,22 figures

    MSC Class: H.5.2

    Journal ref: CHI2024

  35. MatchNAS: Optimizing Edge AI in Sparse-Label Data Contexts via Automating Deep Neural Network Porting for Mobile Deployment

    Authors: Hongtao Huang, Xiaojun Chang, Wen Hu, Lina Yao

    Abstract: Recent years have seen the explosion of edge intelligence with powerful Deep Neural Networks (DNNs). One popular scheme is training DNNs on powerful cloud servers and subsequently porting them to mobile devices after being lightweight. Conventional approaches manually specialized DNNs for various edge platforms and retrain them with real-world data. However, as the number of platforms increases, t… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  36. arXiv:2402.08769  [pdf, other

    cs.LG cs.DC

    FLASH: Federated Learning Across Simultaneous Heterogeneities

    Authors: Xiangyu Chang, Sk Miraj Ahmed, Srikanth V. Krishnamurthy, Basak Guler, Ananthram Swami, Samet Oymak, Amit K. Roy-Chowdhury

    Abstract: The key premise of federated learning (FL) is to train ML models across a diverse set of data-owners (clients), without exchanging local data. An overarching challenge to this date is client heterogeneity, which may arise not only from variations in data distribution, but also in data quality, as well as compute/communication latency. An integrated view of these diverse and concurrent sources of h… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  37. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, **chuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 16 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  38. arXiv:2401.15815  [pdf, ps, other

    eess.SP cs.CE math.OC

    Success probability of the $L_0$-regularized box-constrained Babai point and column permutation strategies

    Authors: Xiao-Wen Chang, Yingzi XU

    Abstract: We consider the success probability of the $L_0$-regularized box-constrained Babai point, which is a suboptimal solution to the $L_0$-regularized box-constrained integer least squares problem and can be used for MIMO detection. First, we derive formulas for the success probability of both $L_0$-regularized and unregularized box-constrained Babai points. Then we investigate the properties of the… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 37 pages, 1 figure including 2 subfigures

  39. arXiv:2401.04130  [pdf, other

    cs.LG cs.AI

    Plug-and-Play Transformer Modules for Test-Time Adaptation

    Authors: Xiangyu Chang, Sk Miraj Ahmed, Srikanth V. Krishnamurthy, Basak Guler, Ananthram Swami, Samet Oymak, Amit K. Roy-Chowdhury

    Abstract: Parameter-efficient tuning (PET) methods such as LoRA, Adapter, and Visual Prompt Tuning (VPT) have found success in enabling adaptation to new domains by tuning small modules within a transformer model. However, the number of domains encountered during test time can be very large, and the data is usually unlabeled. Thus, adaptation to new domains is challenging; it is also impractical to generate… ▽ More

    Submitted 8 February, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

  40. arXiv:2401.00405  [pdf, other

    cs.CV

    Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

    Authors: Qirui Wu, Daniel Ritchie, Manolis Savva, Angel X. Chang

    Abstract: Single-view 3D shape retrieval is a challenging task that is increasingly important with the growth of available 3D data. Prior work that has studied this task has not focused on evaluating how realistic occlusions impact performance, and how shape retrieval methods generalize to scenarios where either the target 3D shape database contains unseen shapes, or the input image contains unseen objects.… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

  41. arXiv:2312.16478  [pdf, other

    cs.LG

    Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

    Authors: Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, **gdong Wang

    Abstract: Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep ne… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  42. arXiv:2312.15867  [pdf, other

    cs.CL cs.CR

    Punctuation Matters! Stealthy Backdoor Attack for Language Models

    Authors: Xuan Sheng, Zhicheng Li, Zhaoyang Han, Xiangmao Chang, Piji Li

    Abstract: Recent studies have pointed out that natural language processing (NLP) models are vulnerable to backdoor attacks. A backdoored model produces normal outputs on the clean samples while performing improperly on the texts with triggers that the adversary injects. However, previous studies on textual backdoor attack pay little attention to stealthiness. Moreover, some attack methods even cause grammat… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: NLPCC 2023

  43. arXiv:2312.13746  [pdf, other

    cs.CV

    Video Recognition in Portrait Mode

    Authors: Mingfei Han, Linjie Yang, Xiaojie **, Jiashi Feng, Xiaojun Chang, Heng Wang

    Abstract: The creation of new datasets often presents new challenges for video recognition and can inspire novel ideas while addressing these challenges. While existing datasets mainly comprise landscape mode videos, our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format. With the growing popularity of smartphones and… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: See mingfei.info/PMV for data and code information

  44. arXiv:2312.10300  [pdf, other

    cs.CV

    Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

    Authors: Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang

    Abstract: A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic und… ▽ More

    Submitted 18 December, 2023; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: See https://mingfei.info/shot2story for updates and more information

  45. arXiv:2312.02226  [pdf, other

    cs.CV

    Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

    Authors: Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, **gdong Wang

    Abstract: Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal in… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

  46. arXiv:2311.10101  [pdf, other

    cs.CR cs.DS cs.LG stat.ML stat.OT

    Gaussian Differential Privacy on Riemannian Manifolds

    Authors: Yangdi Jiang, Xiaotian Chang, Yi Liu, Lei Ding, Linglong Kong, Bei Jiang

    Abstract: We develop an advanced approach for extending Gaussian Differential Privacy (GDP) to general Riemannian manifolds. The concept of GDP stands out as a prominent privacy definition that strongly warrants extension to manifold settings, due to its central limit properties. By harnessing the power of the renowned Bishop-Gromov theorem in geometric analysis, we propose a Riemannian Gaussian distributio… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  47. arXiv:2311.04150   

    cs.HC

    What Makes a Fantastic Passenger-Car Driver in Urban Contexts?

    Authors: Yueteng Yu, Zhijie Yi, Xinyu Yang, Mengdi Chu, Junrong Lu, Xiang Chang, Yiyao Liu, **gli Qin, Ye **, Jialin Song, Xingrui Gu, Jirui Yuan, Guyue Zhou, Jiangtao Gong

    Abstract: The accurate evaluation of the quality of driving behavior is crucial for optimizing and implementing autonomous driving technology in practice. However, there is no comprehensive understanding of good driving behaviors currently. In this paper, we sought to understand driving behaviors from the perspectives of both drivers and passengers. We invited 10 expert drivers and 14 novice drivers to comp… ▽ More

    Submitted 12 November, 2023; v1 submitted 7 November, 2023; originally announced November 2023.

    Comments: Part of the content of the paper will be modified. One of the authors has recommended its withdrawal due to personal reasons

  48. arXiv:2311.02401  [pdf, other

    cs.LG

    BarcodeBERT: Transformers for Biodiversity Analysis

    Authors: Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Scott C. Lowe, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Graham W. Taylor

    Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across… ▽ More

    Submitted 4 November, 2023; originally announced November 2023.

    Comments: Main text: 5 pages, Total: 9 pages, 2 figures, accepted at the 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)

  49. arXiv:2311.01686  [pdf, other

    cs.CV cs.LG

    Disentangled Representation Learning with Transmitted Information Bottleneck

    Authors: Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, **gdong Wang, Qinghua Zheng

    Abstract: Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop;… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  50. arXiv:2310.18954  [pdf, other

    cs.CV cs.AI

    Mask Propagation for Efficient Video Semantic Segmentation

    Authors: Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang, Bohan Zhuang

    Abstract: Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation frame… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023