Skip to main content

Showing 1–50 of 512 results for author: Song, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01079  [pdf, ps, other

    stat.ML cs.AI cs.LG

    On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

    Authors: Jerry Yao-Chieh Hu, Weimin Wu, Zhuoru Li, Zhao Song, Han Liu

    Abstract: We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we deri… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2407.00658  [pdf, other

    cs.RO

    A Fast Online Omnidirectional Quadrupedal Jum** Framework Via Virtual-Model Control and Minimum Jerk Trajectory Generation

    Authors: Linzhu Yue, Lingwei Zhang, Zhitao Song, Hongbo Zhang, **hu Dong, Xuanqi Zeng, Yun-Hui Liu

    Abstract: Exploring the limits of quadruped robot agility, particularly in the context of rapid and real-time planning and execution of omnidirectional jump trajectories, presents significant challenges due to the complex dynamics involved, especially when considering significant impulse contacts. This paper introduces a new framework to enable fast, omnidirectional jum** capabilities for quadruped robots… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: IROS2024 paper,7 pages,8 figures

    MSC Class: 68T40 ACM Class: I.2.9

  3. arXiv:2406.17841  [pdf, other

    quant-ph cs.AI

    Probing many-body Bell correlation depth with superconducting qubits

    Authors: Ke Wang, Weikang Li, Shibo Xu, Mengyao Hu, Jiachen Chen, Yaozu Wu, Chuanyu Zhang, Feitong **, Xuhao Zhu, Yu Gao, Ziqi Tan, Aosai Zhang, Ning Wang, Yiren Zou, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Zixuan Song, **feng Deng, Hang Dong, Xu Zhang, Pengfei Zhang, Wenjie Jiang , et al. (10 additional authors not shown)

    Abstract: Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 11 pages,6 figures + 14 pages, 6 figures

  4. arXiv:2406.16382  [pdf, other

    cs.CL

    UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models

    Authors: Zhanyue Qin, Haochuan Wang, Deyuan Liu, Ziyang Song, Cunhang Fan, Zhao Lv, **lin Wu, Zhen Lei, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

    Abstract: Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  5. arXiv:2406.15762  [pdf, other

    cs.LG stat.ML

    Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

    Authors: Zhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. Wang

    Abstract: Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  6. arXiv:2406.14036  [pdf, other

    cs.LG cs.AI cs.CL

    Toward Infinite-Long Prefix in Transformer

    Authors: Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang

    Abstract: Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning fro… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  7. arXiv:2406.13664  [pdf, other

    cs.AI

    Root-KGD: A Novel Framework for Root Cause Diagnosis Based on Knowledge Graph and Industrial Data

    Authors: Jiyu Chen, **chuan Qian, Xinmin Zhang, Zhihuan Song

    Abstract: With the development of intelligent manufacturing and the increasing complexity of industrial production, root cause diagnosis has gradually become an important research direction in the field of industrial fault diagnosis. However, existing research methods struggle to effectively combine domain knowledge and industrial data, failing to provide accurate, online, and reliable root cause diagnosis… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  8. arXiv:2406.12726  [pdf, other

    cs.SD cs.AI eess.AS

    ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

    Authors: Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

    Abstract: Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  9. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, **peng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  10. arXiv:2406.10907  [pdf, other

    cs.CV

    SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

    Authors: Lin Liu, Ziying Song, Qiming Xia, Feiyang Jia, Caiyan Jia, Lei Yang, Hongyu Pan

    Abstract: LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficie… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2401.02702

  11. arXiv:2406.06858  [pdf, other

    cs.LG cs.DC

    FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

    Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin **, Xin Liu

    Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  12. arXiv:2406.06619  [pdf, other

    eess.AS cs.AI cs.CL

    LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

    Authors: Zheshu Song, Jianheng Zhuo, Yifan Yang, Ziyang Ma, Shixiong Zhang, Xie Chen

    Abstract: Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-W… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, conference

  13. arXiv:2406.05974  [pdf, other

    eess.IV cs.CV

    Inter-slice Super-resolution of Magnetic Resonance Images by Pre-training and Self-supervised Fine-tuning

    Authors: Xin Wang, Zhiyun Song, Yitao Zhu, Sheng Wang, Lichi Zhang, Dinggang Shen, Qian Wang

    Abstract: In clinical practice, 2D magnetic resonance (MR) sequences are widely adopted. While individual 2D slices can be stacked to form a 3D volume, the relatively large slice spacing can pose challenges for both image visualization and subsequent analysis tasks, which often require isotropic voxel spacing. To reduce slice spacing, deep-learning-based super-resolution techniques are widely investigated.… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: ISBI 2024

  14. arXiv:2406.05766  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

    Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li

    Abstract: Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matche… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  15. arXiv:2406.03136  [pdf, ps, other

    cs.LG cs.AI cs.CC stat.ML

    Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models

    Authors: Jerry Yao-Chieh Hu, Maojiang Su, En-Jui Kuo, Zhao Song, Han Liu

    Abstract: We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of n… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  16. arXiv:2406.01124  [pdf, other

    cs.LG cs.CL

    Latent Logic Tree Extraction for Event Sequence Explanation from LLMs

    Authors: Zitao Song, Chao Yang, Chaojie Wang, Bo An, Shuang Li

    Abstract: Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences. Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence. Built on the temporal point process model for events, our method employs the likelihood function a… ▽ More

    Submitted 28 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  17. arXiv:2405.19647  [pdf, other

    cs.LG

    FTS: A Framework to Find a Faithful TimeSieve

    Authors: Songning Lai, Ninghui Feng, Haochen Sui, Ze Ma, Hao Wang, Zichen Song, Hang Zhao, Yutao Yue

    Abstract: The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds and minute input noise perturbations. Recognizing these challenges, we embark on a quest to define the c… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  18. arXiv:2405.19265  [pdf, other

    cs.CL

    AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

    Authors: Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

    Abstract: Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enh… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Preprint with 20 pages and 20 figures. Source code and models at https://github.com/InternLM/AlchemistCoder

  19. arXiv:2405.18014  [pdf, other

    cs.AI

    Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

    Authors: Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, Wei Yang

    Abstract: The essence of multi-modal fusion lies in exploiting the complementary information inherent in diverse modalities. However, prevalent fusion methods rely on traditional neural architectures and are inadequately equipped to capture the dynamics of interactions across modalities, particularly in presence of complex intra- and inter-modality correlations. Recent advancements in State Space Models (SS… ▽ More

    Submitted 29 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

  20. arXiv:2405.17719  [pdf, other

    cs.CV

    EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

    Authors: Boshen Xu, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin **

    Abstract: Egocentric video-language pretraining is a crucial paradigm to advance the learning of egocentric hand-object interactions (EgoHOI). Despite the great success on existing testbeds, these benchmarks focus more on closed-set visual concepts or limited scenarios. Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminis… ▽ More

    Submitted 3 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: Code: https://github.com/xuboshen/EgoNCEpp

  21. arXiv:2405.16873  [pdf, other

    cs.CV

    ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection

    Authors: Ziying Song, Feiyang Jia, Hongyu Pan, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lin Liu, Yang Ji, Lei Yang, Li Wang

    Abstract: In the field of 3D object detection tasks, fusing heterogeneous features from LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is a widely adopted paradigm. However, existing methods are often compromised by imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera BEV fusion. Moreover, such inaccuracies result in errors in depth estimation for the… ▽ More

    Submitted 5 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  22. arXiv:2405.16848  [pdf, other

    cs.CV

    A re-calibration method for object detection with multi-modal alignment bias in autonomous driving

    Authors: Zhihang Song, Lihui Peng, Jianming Hu, Danya Yao, Yi Zhang

    Abstract: Multi-modal object detection in autonomous driving has achieved great breakthroughs due to the usage of fusing complementary information from different sensors. The calibration in fusion between sensors such as LiDAR and camera is always supposed to be precise in previous work. However, in reality, calibration matrices are fixed when the vehicles leave the factory, but vibration, bumps, and data l… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 10 pages, 6 figures

  23. arXiv:2405.16418  [pdf, other

    cs.LG cs.AI cs.CV

    Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective

    Authors: Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou

    Abstract: Diffusion models have made rapid progress in generating high-quality samples across various domains. However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixtur… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  24. arXiv:2405.16411  [pdf, other

    cs.LG cs.AI cs.CL

    Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers

    Authors: Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou

    Abstract: Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $Ω(n^3)$ time complexity of tensor attention poses a significant obstacle to its practical implementation in transformers, where $n$ is the input sequence length. In this work, we prove that the… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  25. arXiv:2405.15705  [pdf, other

    cs.AR eess.SY

    Sums: Sniffing Unknown Multiband Signals under Low Sampling Rates

    Authors: **bo Peng, Zhe Chen, Zheng Lin, Haoxuan Yuan, Zihan Fang, Lingzhong Bao, Zihang Song, Ying Li, **g Ren, Yue Gao

    Abstract: Due to sophisticated deployments of all kinds of wireless networks (e.g., 5G, Wi-Fi, Bluetooth, LEO satellite, etc.), multiband signals distribute in a large bandwidth (e.g., from 70 MHz to 8 GHz). Consequently, for network monitoring and spectrum sharing applications, a sniffer for extracting physical layer information, such as structure of packet, with low sampling rate (especially, sub-Nyquist… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 12 pages, 9 figures

  26. arXiv:2405.15542  [pdf, other

    cs.NI cs.DC cs.LG eess.SP

    SATSense: Multi-Satellite Collaborative Framework for Spectrum Sensing

    Authors: Haoxuan Yuan, Zhe Chen, Zheng Lin, **bo Peng, Zihan Fang, Yuhang Zhong, Zihang Song, Yue Gao

    Abstract: Low Earth Orbit satellite Internet has recently been deployed, providing worldwide service with non-terrestrial networks. With the large-scale deployment of both non-terrestrial and terrestrial networks, limited spectrum resources will not be allocated enough. Consequently, dynamic spectrum sharing is crucial for their coexistence in the same spectrum, where accurate spectrum sensing is essential.… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 13 pages, 16 figures

  27. arXiv:2405.15289  [pdf, other

    cs.CV

    Learning Invariant Causal Mechanism from Vision-Language Models

    Authors: Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

    Abstract: Pre-trained large-scale models have become a major research focus, but their effectiveness is limited in real-world applications due to diverse data distributions. In contrast, humans excel at decision-making across various domains by learning reusable knowledge that remains invariant despite environmental changes in a complex world. Although CLIP, as a successful vision-language pre-trained model… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  28. arXiv:2405.15232  [pdf, other

    cs.CV cs.CL

    DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

    Authors: Run Luo, Yunshui Li, Longze Chen, Wanwei He, Ting-En Lin, Ziqiang Liu, Lei Zhang, Zikai Song, Xiaobo Xia, Tongliang Liu, Min Yang, Binyuan Hui

    Abstract: The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation, they often face challenges when confronted with out-of-distribution data. This is primarily due to their reliance on image encoders trained to encode images int… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 25 pages

  29. arXiv:2405.14093  [pdf, other

    cs.RO cs.CL cs.CV

    A Survey on Vision-Language-Action Models for Embodied AI

    Authors: Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King

    Abstract: Deep learning has demonstrated remarkable success across many domains, including computer vision, natural language processing, and reinforcement learning. Representative artificial neural networks in these fields span convolutional neural networks, Transformers, and deep Q-networks. Built upon unimodal neural networks, numerous multi-modal models have been introduced to address a range of tasks su… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 15 pages, a survey of vision-language-action models

  30. arXiv:2405.08205  [pdf, other

    cs.LG

    Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

    Authors: Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong **, Yang Yang, Lei Li

    Abstract: Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  31. arXiv:2405.06693  [pdf, other

    q-bio.BM cs.LG

    SurfPro: Functional Protein Design Based on Continuous Surface

    Authors: Zhenqiao Song, Tinglin Huang, Lei Li, Wengong **

    Abstract: How can we design proteins with desired functions? We are motivated by a chemical intuition that both geometric structure and biochemical properties are critical to a protein's function. In this paper, we propose SurfPro, a new method to generate functional proteins given a desired surface and its associated biochemical properties. SurfPro comprises a hierarchical encoder that progressively models… ▽ More

    Submitted 17 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  32. arXiv:2405.06003  [pdf, ps, other

    stat.ML cs.LG

    Binary Hypothesis Testing for Softmax Models and Leverage Score Models

    Authors: Yeqi Gao, Yuzhou Gu, Zhao Song

    Abstract: Softmax distributions are widely used in machine learning, including Large Language Models (LLMs) where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing i… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  33. arXiv:2405.05219  [pdf, other

    cs.LG cs.AI

    Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers

    Authors: Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin

    Abstract: Large Language Models (LLMs) have profoundly changed the world. Their self-attention mechanism is the key to the success of transformers in LLMs. However, the quadratic computational cost $O(n^2)$ to the length $n$ input sequence is the notorious obstacle for further improvement and scalability in the longer context. In this work, we leverage the convolution-like structure of attention matrices to… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 55 pages

  34. arXiv:2405.03251  [pdf, ps, other

    cs.LG cs.AI

    Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond

    Authors: Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song

    Abstract: The softmax activation function plays a crucial role in the success of large language models (LLMs), particularly in the self-attention mechanism of the widely adopted Transformer architecture. However, the underlying learning dynamics that contribute to the effectiveness of softmax remain largely unexplored. As a step towards better understanding, this paper provides a theoretical study of the op… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 53 pages

  35. arXiv:2405.01920  [pdf

    cs.CV

    Lightweight Change Detection in Heterogeneous Remote Sensing Images with Online All-Integer Pruning Training

    Authors: Chengyang Zhang, Weiming Li, Gang Li, Huina Song, Zhaohui Song, Xueqian Wang, Antonio Plaza

    Abstract: Detection of changes in heterogeneous remote sensing images is vital, especially in response to emergencies like earthquakes and floods. Current homogenous transformation-based change detection (CD) methods often suffer from high computation and memory costs, which are not friendly to edge-computation devices like onboard CD devices at satellites. To address this issue, this paper proposes a new l… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  36. arXiv:2405.01053  [pdf, other

    cs.LG cs.AI

    Explicitly Modeling Universality into Self-Supervised Learning

    Authors: **gyao Wang, Wenwen Qiang, Zeen Song, Lingyu Si, Jiangmeng Li, Changwen Zheng, Bing Su

    Abstract: The goal of universality in self-supervised learning (SSL) is to learn universal representations from unlabeled data and achieve excellent performance on all samples and tasks. However, these methods lack explicit modeling of the universality in the learning objective, and the related theoretical understanding remains limited. This may cause models to overfit in data-scarce situations and generali… ▽ More

    Submitted 23 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: 28 pages, submitted to ICML24 with 7766

  37. arXiv:2405.00734  [pdf, other

    eess.SP cs.AI cs.LG

    EEG-MACS: Manifold Attention and Confidence Stratification for EEG-based Cross-Center Brain Disease Diagnosis under Unreliable Annotations

    Authors: Zhenxi Song, Ruihan Qin, Huixia Ren, Zhen Liang, Yi Guo, Min Zhang, Zhiguo Zhang

    Abstract: Cross-center data heterogeneity and annotation unreliability significantly challenge the intelligent diagnosis of diseases using brain signals. A notable example is the EEG-based diagnosis of neurodegenerative diseases, which features subtler abnormal neural dynamics typically observed in small-group settings. To advance this area, in this work, we introduce a transferable framework employing Mani… ▽ More

    Submitted 29 April, 2024; originally announced May 2024.

  38. arXiv:2404.18074  [pdf, other

    cs.AI cs.HC

    MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot

    Authors: Zirui Song, Yaohang Li, Meng Fang, Zhenhao Chen, Zecheng Shi, Yuan Huang, Ling Chen

    Abstract: Autonomous virtual agents are often limited by their singular mode of interaction with real-world environments, restricting their versatility. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with operating systems. The framework introduces a team collaboration ch… ▽ More

    Submitted 4 May, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

    Comments: In processing

  39. arXiv:2404.17811  [pdf

    cs.RO

    Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

    Authors: Jian Shen, Jiaxin Huang, Zhigong Song

    Abstract: Dual-arm robots have great application prospects in intelligent manufacturing due to their human-like structure when deployed with advanced intelligence algorithm. However, the previous visuomotor policy suffers from perception deficiencies in environments where features of images are impaired by the various conditions, such as abnormal lighting, occlusion and shadow etc. The Focal CVAE framework… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

    Comments: 14 pages,5 figures

  40. arXiv:2404.15954  [pdf, other

    cs.IR cs.LG

    Mixed Supervised Graph Contrastive Learning for Recommendation

    Authors: Weizhi Zhang, Liangwei Yang, Zihe Song, Henry Peng Zou, Ke Xu, Yuanjie Zhu, Philip S. Yu

    Abstract: Recommender systems (RecSys) play a vital role in online platforms, offering users personalized suggestions amidst vast information. Graph contrastive learning aims to learn from high-order collaborative filtering signals with unsupervised augmentation on the user-item bipartite graph, which predominantly relies on the multi-task learning framework involving both the pair-wise recommendation loss… ▽ More

    Submitted 25 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  41. arXiv:2404.15592  [pdf, other

    cs.CV cs.AI cs.CL cs.IR cs.LG

    ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction

    Authors: Henry Peng Zou, Vinay Samuel, Yue Zhou, Weizhi Zhang, Liancheng Fang, Zihe Song, Philip S. Yu, Cornelia Caragea

    Abstract: Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction.… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  42. arXiv:2404.14688  [pdf, other

    cs.LG cs.AI cs.CE math.DS math.NA

    FMint: Bridging Human Designed and Data Pretrained Models for Differential Equation Foundation Model

    Authors: Zezheng Song, Jiaxin Yuan, Haizhao Yang

    Abstract: In this paper, we propose a pre-trained foundation model \textbf{FMint} (\textbf{F}oundation \textbf{M}odel based on \textbf{In}i\textbf{t}ialization), designed to speed up large-scale simulations of various differential equations with high accuracy via error correction. Human-designed simulation algorithms excel at capturing the fundamental physics of engineering problems, but often need to balan… ▽ More

    Submitted 22 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

  43. arXiv:2404.13785  [pdf, ps, other

    cs.LG

    How to Inverting the Leverage Score Distribution?

    Authors: Zhihang Li, Zhao Song, Weixin Wang, Junze Yin, Zheng Yu

    Abstract: Leverage score is a fundamental problem in machine learning and theoretical computer science. It has extensive applications in regression analysis, randomized algorithms, and neural network inversion. Despite leverage scores are widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score problem. We analyze to invert the leverage score distributions back to… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  44. arXiv:2404.13544  [pdf, other

    cs.CR

    Faster Post-Quantum TLS 1.3 Based on ML-KEM: Implementation and Assessment

    Authors: Jieyu Zheng, Haoliang Zhu, Yifan Dong, Zhenyu Song, Zhenhao Zhang, Yafang Yang, Yunlei Zhao

    Abstract: TLS is extensively utilized for secure data transmission over networks. However, with the advent of quantum computers, the security of TLS based on traditional public-key cryptography is under threat. To counter quantum threats, it is imperative to integrate post-quantum algorithms into TLS. Most PQ-TLS research focuses on integration and evaluation, but few studies address the improvement of PQ-T… ▽ More

    Submitted 22 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

    Comments: update the title

  45. arXiv:2404.12777  [pdf, other

    cs.CV

    EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation

    Authors: Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang

    Abstract: In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-res… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  46. arXiv:2404.10253  [pdf, other

    cs.DC

    Kilometer-Level Coupled Modeling Using 40 Million Cores: An Eight-Year Journey of Model Development

    Authors: Xiaohui Duan, Yuxuan Li, Zhao Liu, Bin Yang, Juepeng Zheng, Haohuan Fu, Shaoqing Zhang, Shiming Xu, Yang Gao, Wei Xue, Di Wei, Xiao**g Lv, Lifeng Yan, Haopeng Huang, Haitian Lu, Lingfeng Wan, Haoran Lin, Qixin Chang, Chenlin Li, Quanjie He, Zeyu Song, Xuantong Wang, Yangyang Yu, Xilong Fan, Zhaopeng Qu , et al. (16 additional authors not shown)

    Abstract: With current and future leading systems adopting heterogeneous architectures, adapting existing models for heterogeneous supercomputers is of urgent need for improving model resolution and reducing modeling uncertainty. This paper presents our three-week effort on porting a complex earth system model, CESM 2.2, to a 40-million-core Sunway supercomputer. Taking a non-intrusive approach that tries t… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 18 pages, 13 figures

  47. arXiv:2404.09790  [pdf, other

    cs.CV

    NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results

    Authors: Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, **hua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou , et al. (63 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge i… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

  48. arXiv:2404.08195  [pdf, other

    cs.CV

    Tackling Ambiguity from Perspective of Uncertainty Inference and Affinity Diversification for Weakly Supervised Semantic Segmentation

    Authors: Zhiwei Yang, Yucong Meng, Kexue Fu, Shuo Wang, Zhijian Song

    Abstract: Weakly supervised semantic segmentation (WSSS) with image-level labels intends to achieve dense tasks without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, especially the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, widely suffers from ambiguity while being barely noticed by previous literature. In this wor… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  49. arXiv:2404.06668  [pdf

    cs.LG cs.AI physics.ao-ph

    Forecasting the Future with Future Technologies: Advancements in Large Meteorological Models

    Authors: Hailong Shu, Yue Wang, Weiwei Song, Huichuang Guo, Zhen Song

    Abstract: The field of meteorological forecasting has undergone a significant transformation with the integration of large models, especially those employing deep learning techniques. This paper reviews the advancements and applications of these models in weather prediction, emphasizing their role in transforming traditional forecasting methods. Models like FourCastNet, Pangu-Weather, GraphCast, ClimaX, and… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: 5 pages

  50. arXiv:2404.02690  [pdf, other

    cs.LG cs.AI cs.CL

    Attention is Naturally Sparse with Gaussian Distributed Input

    Authors: Yichuan Deng, Zhao Song, Chiwun Yang

    Abstract: The computational intensity of Large Language Models (LLMs) is a critical bottleneck, primarily due to the $O(n^2)$ complexity of the attention mechanism in transformer architectures. Addressing this, sparse attention emerges as a key innovation, aiming to reduce computational load while maintaining model performance. This study presents a rigorous theoretical analysis of the sparsity in attention… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.