Skip to main content

Showing 1–50 of 107 results for author: Ni, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12712  [pdf, other

    cs.CV

    Self-Localized Collaborative Perception

    Authors: Zhenyang Ni, Zixing Lei, Yifan Lu, Dingju Wang, Chen Feng, Yanfeng Wang, Siheng Chen

    Abstract: Collaborative perception has garnered considerable attention due to its capacity to address several inherent challenges in single-agent perception, including occlusion and out-of-range issues. However, existing collaborative perception systems heavily rely on precise localization systems to establish a consistent spatial coordinate system between agents. This reliance makes them susceptible to lar… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2406.08377  [pdf, other

    cs.CV

    DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

    Authors: Juncheng Wu, Zhangkai Ni, Hanli Wang, Wenhan Yang, Yuyin Zhou, Shiqi Wang

    Abstract: Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through t… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2406.05478  [pdf, other

    cs.CV cs.AI

    Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

    Authors: Zanlin Ni, Yulin Wang, Ren** Zhou, Jiayi Guo, **yi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang

    Abstract: The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their infe… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted by CVPR2024

  4. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  5. arXiv:2406.04295  [pdf, other

    cs.CV

    Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

    Authors: Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

    Abstract: Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: GitHub: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment

  6. arXiv:2406.03287  [pdf, other

    cs.NE cs.CL cs.LG

    SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

    Authors: Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, Guoqi Li

    Abstract: Towards energy-efficient artificial intelligence similar to the human brain, the bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation. Recently, large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models. However, the binary spikes in… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  7. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  8. arXiv:2406.00627  [pdf, other

    cs.CL

    Prompt Framework for Role-playing: Generation and Evaluation

    Authors: Xun Liu, Zhengwei Ni

    Abstract: Large language models (LLM) have demonstrated remarkable abilities in generating natural language, understanding user instruction, and mimicking human language use. These capabilities have garnered considerable interest in applications such as role-playing. However, the process of collecting individual role scripts (or profiles) data and manually evaluating the performance can be costly. We introd… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  9. arXiv:2405.19765  [pdf, other

    cs.CV cs.AI

    Towards Unified Multi-granularity Text Detection with Interactive Attention

    Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, **gdong Wang

    Abstract: Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: ICML 2024

  10. arXiv:2405.18790  [pdf, other

    cs.CV cs.MM eess.IV

    Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

    Authors: Zhangkai Ni, Yue Liu, Keyan Ding, Wenhan Yang, Hanli Wang, Shiqi Wang

    Abstract: Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propos… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted to IEEE Transactions on Multimedia 2024

  11. arXiv:2405.06525  [pdf, other

    cs.CV

    Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

    Authors: Xiaowen Ma, Zhenliang Ni, Xinghao Chen

    Abstract: Vanilla pixel-level classifiers for semantic segmentation are based on a certain paradigm, involving the inner product of fixed prototypes obtained from the training set and pixel features in the test image. This approach, however, encounters significant limitations, i.e., feature deviation in the semantic domain and information loss in the spatial domain. The former struggles with large intra-cla… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  12. arXiv:2405.06228  [pdf, other

    cs.CV

    Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

    Authors: Zhenliang Ni, Xinghao Chen, Yingjie Zhai, Yehui Tang, Yunhe Wang

    Abstract: Semantic segmentation is an important task for many applications but it is still quite challenging to achieve advanced performance with limited computational costs. In this paper, we present CGRSeg, an efficient yet competitive segmentation framework based on context-guided spatial feature reconstruction. A Rectangular Self-Calibration Module is carefully designed for spatial feature reconstructio… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  13. arXiv:2405.02965  [pdf, other

    cs.AI cs.RO

    Robust Collaborative Perception without External Localization and Clock Devices

    Authors: Zixing Lei, Zhenyang Ni, Ruize Han, Shuo Tang, Dingju Wang, Chen Feng, Siheng Chen, Yanfeng Wang

    Abstract: A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and… ▽ More

    Submitted 31 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

    Comments: 6pages, accepted to ICRA 2024

  14. arXiv:2404.12916  [pdf, other

    cs.CR

    Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models

    Authors: Zhenyang Ni, Rui Ye, Yuxi Wei, Zhen Xiang, Yanfeng Wang, Siheng Chen

    Abstract: Vision-Large-Language-models(VLMs) have great application prospects in autonomous driving. Despite the ability of VLMs to comprehend and make decisions in complex scenarios, their integration into safety-critical autonomous driving systems poses serious security risks. In this paper, we propose BadVLMDriver, the first backdoor attack against VLMs for autonomous driving that can be launched in prac… ▽ More

    Submitted 22 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  15. arXiv:2404.09574  [pdf, other

    cs.LG cs.AI

    Predicting and Analyzing Pedestrian Crossing Behavior at Unsignalized Crossings

    Authors: Chi Zhang, Janis Sprenger, Zhongjun Ni, Christian Berger

    Abstract: Understanding and predicting pedestrian crossing behavior is essential for enhancing automated driving and improving driving safety. Predicting gap selection behavior and the use of zebra crossing enables driving systems to proactively respond and prevent potential conflicts. This task is particularly challenging at unsignalized crossings due to the ambiguous right of way, requiring pedestrians to… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 8 pages, 10 figures, 4 tables. Accepted in 2024 IEEE Intelligent Vehicles Symposium (IV)

    MSC Class: 68T40; 68T45 ACM Class: I.2.10

  16. arXiv:2403.17898  [pdf, other

    cs.CV

    Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians

    Authors: Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai

    Abstract: The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularl… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Project page: https://city-super.github.io/octree-gs/

  17. arXiv:2403.11703  [pdf, other

    cs.CV cs.AI

    LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

    Authors: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

    Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in t… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Preprint

  18. arXiv:2403.08203  [pdf, other

    q-bio.NC cs.LG eess.IV

    Learnable Community-Aware Transformer for Brain Connectome Analysis with Token Clustering

    Authors: Yanting Yang, Beidi Zhao, Zhuohao Ni, Yize Zhao, Xiaoxiao Li

    Abstract: Neuroscientific research has revealed that the complex brain network can be organized into distinct functional communities, each characterized by a cohesive group of regions of interest (ROIs) with strong interconnections. These communities play a crucial role in comprehending the functional organization of the brain and its implications for neurological conditions, including Autism Spectrum Disor… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  19. arXiv:2403.04326  [pdf, other

    eess.SY cs.AI cs.LG

    Edge-based Parametric Digital Twins for Intelligent Building Indoor Climate Modeling

    Authors: Zhongjun Ni, Chi Zhang, Magnus Karlsson, Shaofang Gong

    Abstract: Digital transformation in the built environment generates vast data for develo** data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse ser… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 8 pages, 8 figures, accepted in the 20th IEEE International Conference on Factory Communication Systems

    MSC Class: 68T07 ACM Class: I.5.4

  20. arXiv:2402.18192  [pdf, other

    cs.CV eess.IV

    Misalignment-Robust Frequency Distribution Loss for Image Transformation

    Authors: Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, Lin Ma

    Abstract: This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challeng… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted to Computer Vision and Pattern Recognition Conference (CVPR) 2024

  21. arXiv:2401.09686  [pdf, other

    eess.AS cs.SD

    An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

    Authors: Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou Li

    Abstract: Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform… ▽ More

    Submitted 13 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  22. arXiv:2401.06411  [pdf, other

    cs.ET

    An Efficient and Scalable Clocking Assignment Algorithm for Multi-Threaded Multi-Phase Single Flux Quantum Circuits

    Authors: Robert S. Aviles, Xi Li, Lei Lu, Zhaorui Ni, Peter A. Beerel

    Abstract: A key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing buffers for multi-threaded mu… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

  23. arXiv:2312.14199  [pdf, other

    cs.CR

    Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

    Authors: Geoffrey Fox, Mary P Thomas, Sajal Bhatia, Marisa Brazil, Nicole M Gasparini, Venkatesh Mohan Merwade, Henry J. Neeman, Jeff Carver, Henri Casanova, Vipin Chaudhary, Dirk Colbry, Lonnie Crosby, Prasun Dewan, Jessica Eisma, Nicole M Gasparini, Ahmed Irfan, Kate Kaehey, Qianqian Liu, Zhen Ni, Sushil Prasad, Apan Qasem, Erik Saule, Prabha Sundaravadivel, Karen Tomko

    Abstract: This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the futur… ▽ More

    Submitted 28 December, 2023; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: 38 pages, 3 main sections and 2 Appendix sections, 2 figures, 19 tables; updated version: author corrections

  24. arXiv:2312.09095  [pdf, other

    cs.CV

    ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field

    Authors: Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, Sam Kwong

    Abstract: Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccu… ▽ More

    Submitted 14 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

  25. arXiv:2312.08264  [pdf, other

    eess.SP cs.LG physics.ao-ph

    Kunyu: A High-Performing Global Weather Model Beyond Regression Losses

    Authors: Zekun Ni

    Abstract: Over the past year, data-driven global weather forecasting has emerged as a new alternative to traditional numerical weather prediction. This innovative approach yields forecasts of comparable accuracy at a tiny fraction of computational costs. Regrettably, as far as I know, existing models exclusively rely on regression losses, producing forecasts with substantial blurring. Such blurring, althoug… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 12 pages, 5 figures

  26. arXiv:2312.06568  [pdf, other

    cs.LG cs.AI cs.CR

    Sparse but Strong: Crafting Adversarially Robust Graph Lottery Tickets

    Authors: Subhajit Dutta Chowdhury, Zhiyu Ni, Qingyuan Peng, Souvik Kundu, Pierluigi Nuzzo

    Abstract: Graph Lottery Tickets (GLTs), comprising a sparse adjacency matrix and a sparse graph neural network (GNN), can significantly reduce the inference latency and compute footprint compared to their dense counterparts. Despite these benefits, their performance against adversarial structure perturbations remains to be fully explored. In this work, we first investigate the resilience of GLTs against dif… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted at NeurIPS 2023 GLFrontiers Workshop

  27. arXiv:2312.05966  [pdf, other

    cs.LG cs.CV

    Fake It Till Make It: Federated Learning with Consensus-Oriented Generation

    Authors: Rui Ye, Yaxin Du, Zhenyang Ni, Siheng Chen, Yanfeng Wang

    Abstract: In federated learning (FL), data heterogeneity is one key bottleneck that causes model divergence and limits performance. Addressing this, existing methods often regard data heterogeneity as an inherent property and propose to mitigate its adverse effects by correcting models. In this paper, we seek to break this inherent property by generating data to complement the original dataset to fundamenta… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: 27 pages

  28. arXiv:2312.04410  [pdf, other

    cs.CV

    Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

    Authors: Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi

    Abstract: Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves benef… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: GitHub: https://github.com/SHI-Labs/Smooth-Diffusion

  29. arXiv:2311.01092  [pdf, other

    cs.CV

    Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

    Authors: Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

    Abstract: The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction… ▽ More

    Submitted 3 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

  30. arXiv:2311.00897  [pdf, other

    cs.SD cs.CL eess.AS

    On The Open Prompt Challenge In Conditional Audio Generation

    Authors: Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra

    Abstract: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two ke… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 5 pages, 3 figures, 4 tables

  31. arXiv:2310.20496  [pdf, other

    cs.LG

    BasisFormer: Attention-based Time Series Forecasting with Learnable and Interpretable Basis

    Authors: Zelin Ni, Hang Yu, Shizhan Liu, Jianguo Li, Weiyao Lin

    Abstract: Bases have become an integral part of modern deep learning-based models for time series forecasting due to their ability to act as feature extractors or future references. To be effective, a basis must be tailored to the specific set of time series data and exhibit distinct correlation with each time series within the set. However, current state-of-the-art methods are limited in their ability to s… ▽ More

    Submitted 18 January, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023(poster)

  32. arXiv:2310.19069  [pdf, other

    cs.LG cs.DC

    Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach

    Authors: Zhou Ni, Morteza Hashemi

    Abstract: Federated learning (FL) offers a decentralized training approach for machine learning models, prioritizing data privacy. However, the inherent heterogeneity in FL networks, arising from variations in data distribution, size, and device capabilities, poses challenges in user federation. Recognizing this, Personalized Federated Learning (PFL) emphasizes tailoring learning processes to individual dat… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

  33. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  34. arXiv:2310.00746  [pdf, other

    cs.CL cs.AI

    RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

    Authors: Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, Junran Peng

    Abstract: The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-p… ▽ More

    Submitted 18 June, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: 30 pages, repo at https://github.com/InteractiveNLP-Team/RoleLLM-public

  35. arXiv:2309.10537  [pdf, other

    eess.AS cs.MM cs.SD

    FoleyGen: Visually-Guided Audio Generation

    Authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  36. arXiv:2309.08804  [pdf, other

    eess.AS cs.SD

    Stack-and-Delay: a new codebook pattern for music generation

    Authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra

    Abstract: In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  37. arXiv:2309.08773  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Enhance audio generation controllability through representation similarity regularization

    Authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

    Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regula… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages

  38. arXiv:2309.07988  [pdf, other

    cs.LG cs.AR cs.SD eess.AS

    Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

    Authors: Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear pr… ▽ More

    Submitted 18 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  39. arXiv:2309.07726  [pdf, other

    cs.RO

    GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

    Authors: Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng

    Abstract: Recent works have shown that Large Language Models (LLMs) can facilitate the grounding of instructions for robotic task planning. Despite this progress, most existing works have primarily focused on utilizing raw images to aid LLMs in understanding environmental information. However, this approach not only limits the scope of observation but also typically necessitates extensive multimodal data co… ▽ More

    Submitted 10 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: 8 pages, 10 figures

  40. arXiv:2308.02552  [pdf, other

    cs.CV

    Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

    Authors: Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian

    Abstract: Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a wide… ▽ More

    Submitted 7 August, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

    Journal ref: ACM MM 2023

  41. A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue Information Extraction

    Authors: Zefa Hu, Ziyi Ni, **g Shi, Shuang Xu, Bo Xu

    Abstract: This paper focuses on term-status pair extraction from medical dialogues (MD-TSPE), which is essential in diagnosis dialogue systems and the automatic scribe of electronic medical records (EMRs). In the past few years, works on MD-TSPE have attracted increasing research attention, especially after the remarkable progress made by generative methods. However, these generative methods output a whole… ▽ More

    Submitted 19 February, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

    Comments: Published in Machine Intelligence Research, https://link.springer.com/article/10.1007/s11633-023-1461-5

  42. arXiv:2306.06672  [pdf, other

    cs.CL cs.AI eess.AS

    Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

    Authors: William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has led to great strides in speech processing. However, the resources needed to train these models has become prohibitively large as they continue to scale. Currently, only a few groups with substantial resources are capable of creating SSL models, which harms reproducibility. In this work, we optimize HuBERT SSL to fit in academic constraints. We reproduce HuBERT in… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  43. Human-Object Interaction Prediction in Videos through Gaze Following

    Authors: Zhifan Ni, Esteve Valls Mascaró, Hyemin Ahn, Dongheui Lee

    Abstract: Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos.… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted by CVIU https://doi.org/10.1016/j.cviu.2023.103741

  44. arXiv:2305.19972  [pdf, other

    eess.AS cs.AI cs.CL

    VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

    Authors: Ziyi Ni, Minglun Han, Feilong Chen, Linghui Meng, **g Shi, Pin Lv, Bo Xu

    Abstract: Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision a… ▽ More

    Submitted 18 December, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted to ICASSP 2024

  45. arXiv:2305.19107  [pdf, ps, other

    cs.CV

    Voxel2Hemodynamics: An End-to-end Deep Learning Method for Predicting Coronary Artery Hemodynamics

    Authors: Ziyu Ni, Linda Wei, Lijian Xu, Simon Yu, Qing Xia, Hongsheng Li, Shaoting Zhang

    Abstract: Local hemodynamic forces play an important role in determining the functional significance of coronary arterial stenosis and understanding the mechanism of coronary disease progression. Computational fluid dynamics (CFD) have been widely performed to simulate hemodynamics non-invasively from coronary computed tomography angiography (CCTA) images. However, accurate computational analysis is still l… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 8pages

  46. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  47. arXiv:2305.08541  [pdf, other

    cs.SD eess.AS

    Ripple sparse self-attention for monaural speech enhancement

    Authors: Qiquan Zhang, Hongxu Zhu, Qi Song, Xinyuan Qian, Zhaoheng Ni, Haizhou Li

    Abstract: The use of Transformer represents a recent success in speech enhancement. However, as its core component, self-attention suffers from quadratic complexity, which is computationally prohibited for long speech recordings. Moreover, it allows each time frame to attend to all time frames, neglecting the strong local correlations of speech signals. This study presents a simple yet effective sparse self… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: 5 pages, ICASSP 2023 published

  48. arXiv:2305.07437  [pdf, other

    cs.LG cs.CV

    Continual Vision-Language Representation Learning with Off-Diagonal Information

    Authors: Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, Qi Tian

    Abstract: Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirical… ▽ More

    Submitted 1 June, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Journal ref: ICML 2023

  49. arXiv:2305.05898  [pdf, other

    cs.AI cs.MA cs.NE

    Mixture of personality improved Spiking actor network for efficient multi-agent cooperation

    Authors: Xiyun Li, Ziyi Ni, **gqing Ruan, Linghui Meng, **g Shi, Tielin Zhang, Bo Xu

    Abstract: Adaptive human-agent and agent-agent cooperation are becoming more and more critical in the research area of multi-agent reinforcement learning (MARL), where remarked progress has been made with the help of deep neural networks. However, many established algorithms can only perform well during the learning paradigm but exhibit poor generalization during cooperation with other unseen partners. The… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

    Comments: 20 pages, 7 figures

  50. Leveraging Deep Learning and Digital Twins to Improve Energy Performance of Buildings

    Authors: Zhongjun Ni, Chi Zhang, Magnus Karlsson, Shaofang Gong

    Abstract: Digital transformation in buildings accumulates massive operational data, which calls for smart solutions to utilize these data to improve energy performance. This study has proposed a solution, namely Deep Energy Twin, for integrating deep learning and digital twins to better understand building energy use and identify the potential for improving energy efficiency. Ontology was adopted to create… ▽ More

    Submitted 16 May, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: 6 pages, 5 figures, accepted in the 3rd IEEE International Conference on Industrial Electronics for Sustainable Energy Systems

    MSC Class: 68T07 ACM Class: I.5.4