Skip to main content

Showing 1–50 of 204 results for author: Zhao, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19645  [pdf, other

    cs.NE

    Directly Training Temporal Spiking Neural Network with Sparse Surrogate Gradient

    Authors: Yang Li, Feifei Zhao, Dongcheng Zhao, Yi Zeng

    Abstract: Brain-inspired Spiking Neural Networks (SNNs) have attracted much attention due to their event-based computing and energy-efficient features. However, the spiking all-or-none nature has prevented direct training of SNNs for various applications. The surrogate gradient (SG) algorithm has recently enabled spiking neural networks to shine in neuromorphic hardware. However, introducing surrogate gradi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.19632  [pdf, other

    cs.CV

    PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation

    Authors: Deyi Ji, Wenwei **, Hongtao Lu, Feng Zhao

    Abstract: The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these is… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: IJCAI 2024

  3. arXiv:2406.19030  [pdf, other

    cs.CV

    Using diffusion model as constraint: Empower Image Restoration Network Training with Diffusion Model

    Authors: Jiangtong Tan, Feng Zhao

    Abstract: Image restoration has made marvelous progress with the advent of deep learning. Previous methods usually rely on designing powerful network architecture to elevate performance, however, the natural visual effect of the restored results is limited by color and texture distortions. Besides the visual perceptual quality, the semantic perception recovery is an important but often overlooked perspectiv… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  4. arXiv:2406.19008  [pdf, other

    cs.DS

    VertiMRF: Differentially Private Vertical Federated Data Synthesis

    Authors: Fangyuan Zhao, Zitao Li, Xuebin Ren, Bolin Ding, Shusen Yang, Yaliang Li

    Abstract: Data synthesis is a promising solution to share data for various downstream analytic tasks without exposing raw data. However, without a theoretical privacy guarantee, a synthetic dataset would still leak some sensitive information. Differential privacy is thus widely adopted to safeguard data synthesis by strictly limiting the released information. This technique is advantageous yet presents sign… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  5. arXiv:2406.14965  [pdf, other

    cs.NI

    Energy-Aware Random Access Networks: Connection-Based versus Packet-Based

    Authors: Anshan Yuan, Fangming Zhao, Xinghua Sun

    Abstract: Characterizing and comparing the optimal energy efficiency in energy-aware machine-to-machine (M2M) random access networks remains a challenge due to the distributed nature of the access behavior of nodes. To address this issue, this letter focuses on the energy efficiency limits of two typical random access schemes, i.e., connection-based Aloha and packet-based Aloha, based on which we conducted… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  6. arXiv:2406.14795  [pdf, other

    cs.RO eess.SY

    Design and Control of a Low-cost Non-backdrivable End-effector Upper Limb Rehabilitation Device

    Authors: Fulan Li, Yunfei Guo, Wenda Xu, Weide Zhang, Fangyun Zhao, Baiyu Wang, Huaguang Du, Chengkun Zhang

    Abstract: This paper presents the development of an upper limb end-effector based rehabilitation device for stroke patients, offering assistance or resistance along any 2-dimensional trajectory during physical therapy. It employs a non-backdrivable ball-screw-driven mechanism for enhanced control accuracy. The control system features three novel algorithms: First, the Implicit Euler velocity control algorit… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 12 pages, 15 figures

  7. arXiv:2406.12030  [pdf, other

    cs.CV cs.AI cs.CL

    SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

    Authors: Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, **lan Fu, Zhenfei Yin, Senjie **, Yu Qiao, Xuan**g Huang, Feng Zhao, Tao Gui, **g Shao

    Abstract: The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To addr… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  8. arXiv:2406.10475  [pdf, other

    cs.CV

    Discrete Latent Perspective Learning for Segmentation and Detection

    Authors: Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei **, Hongtao Lu, Jie** Ye

    Abstract: In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework,… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: ICML 2024 Spotlight

  9. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  10. arXiv:2405.18326  [pdf, other

    cs.CV

    VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

    Authors: Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang

    Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this,… ▽ More

    Submitted 7 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Project Page: https://zhengjun-ai.github.io/viton-dit-page/

  11. arXiv:2405.17830  [pdf, other

    cs.CL

    More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs

    Authors: Chengyuan Liu, Shihang Wang, Yangyang Kang, Lizhi Qing, Fubang Zhao, Changlong Sun, Kun Kuang, Fei Wu

    Abstract: The performance on general tasks decreases after Large Language Models (LLMs) are fine-tuned on domain-specific tasks, the phenomenon is known as Catastrophic Forgetting (CF). However, this paper presents a further challenge for real application of domain-specific LLMs beyond CF, called General Capabilities Integration (GCI), which necessitates the integration of both the general capabilities and… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  12. arXiv:2405.16451  [pdf, other

    cs.CV

    From Macro to Micro: Boosting micro-expression recognition via pre-training on macro-expression videos

    Authors: Hanting Li, Hong**g Niu, Feng Zhao

    Abstract: Micro-expression recognition (MER) has drawn increasing attention in recent years due to its potential applications in intelligent medical and lie detection. However, the shortage of annotated data has been the major obstacle to further improve deep-learning based MER methods. Intuitively, utilizing sufficient macro-expression data to promote MER performance seems to be a feasible solution. Howeve… ▽ More

    Submitted 4 June, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

    Comments: 18 pages

  13. arXiv:2405.14129  [pdf, other

    cs.CL cs.AI cs.CV

    AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

    Authors: Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, Xinyu Dai

    Abstract: Multimodal Large Language Models (MLLMs) are widely regarded as crucial in the exploration of Artificial General Intelligence (AGI). The core of MLLMs lies in their capability to achieve cross-modal alignment. To attain this goal, current MLLMs typically follow a two-phase training paradigm: the pre-training phase and the instruction-tuning phase. Despite their success, there are shortcomings in t… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Code and models are available at $\href{https://aligngpt-vl.github.io/}{\textit{this https URL}}$

  14. arXiv:2405.09310  [pdf, other

    cs.RO

    GrainGrasp: Dexterous Grasp Generation with Fine-grained Contact Guidance

    Authors: Fuqiang Zhao, Dzmitry Tsetserukou, Qian Liu

    Abstract: One goal of dexterous robotic gras** is to allow robots to handle objects with the same level of flexibility and adaptability as humans. However, it remains a challenging task to generate an optimal gras** strategy for dexterous hands, especially when it comes to delicate manipulation and accurate adjustment the desired gras** poses for objects of varying shapes and sizes. In this paper, we… ▽ More

    Submitted 15 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: This paper is accepted by the ICRA2024

  15. arXiv:2404.09748  [pdf, other

    cs.CV cs.GR

    LetsGo: Large-Scale Garage Modeling and Rendering via LiDAR-Assisted Gaussian Primitives

    Authors: Jiadi Cui, Junming Cao, Fuqiang Zhao, Zhipeng He, Yifan Chen, Yuhui Zhong, Lan Xu, Yujiao Shi, Yingliang Zhang, **gyi Yu

    Abstract: Large garages are ubiquitous yet intricate scenes that present unique challenges due to their monotonous colors, repetitive patterns, reflective surfaces, and transparent vehicle glass. Conventional Structure from Motion (SfM) methods for camera pose estimation and 3D reconstruction often fail in these environments due to poor correspondence construction. To address these challenges, we introduce… ▽ More

    Submitted 21 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: Project Page: https://zhaofuq.github.io/LetsGo/

  16. arXiv:2404.01154  [pdf, other

    cs.CV cs.AI

    Uncovering the Text Embedding in Text-to-Image Diffusion Models

    Authors: Hu Yu, Hao Luo, Fan Wang, Feng Zhao

    Abstract: The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for co… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  17. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  18. arXiv:2403.19655  [pdf, other

    cs.CV

    GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling

    Authors: Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, Baining Guo

    Abstract: We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive Ga… ▽ More

    Submitted 23 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Update for digital avatar creation and text-to-3D synthesis; Project Page: https://gaussiancube.github.io/

  19. arXiv:2403.16037  [pdf, other

    cs.IR

    Knowledge-aware Dual-side Attribute-enhanced Recommendation

    Authors: Taotian Pang, Xingyu Lou, Fei Zhao, Zhen Wu, Kuiyao Dong, Qiuying Peng, Yue Qi, Xinyu Dai

    Abstract: \textit{Knowledge-aware} recommendation methods (KGR) based on \textit{graph neural networks} (GNNs) and \textit{contrastive learning} (CL) have achieved promising performance. However, they fall short in modeling fine-grained user preferences and further fail to leverage the \textit{preference-attribute connection} to make predictions, leading to sub-optimal performance. To address the issue, we… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  20. arXiv:2403.15317  [pdf, other

    cs.CV cs.AI

    Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

    Authors: Hongzhi Gao, Zheng Chen, Zehui Chen, Lin Chen, Jiaming Liu, Shanghang Zhang, Feng Zhao

    Abstract: Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization. In t… ▽ More

    Submitted 25 March, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: Accepted by AAAI2024

  21. arXiv:2403.12881  [pdf, other

    cs.CL

    Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

    Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao

    Abstract: Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent re… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Technical Report

  22. arXiv:2403.10830  [pdf, other

    cs.CV

    View-Centric Multi-Object Tracking with Homographic Matching in Moving UAV

    Authors: Deyi Ji, Siqi Gao, Lanyun Zhu, Qi Zhu, Yiru Zhao, Peng Xu, Hongtao Lu, Feng Zhao, Jie** Ye

    Abstract: In this paper, we address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios, where irregular flight trajectories, such as hovering, turning left/right, and moving up/down, lead to significantly greater complexity compared to fixed-camera MOT. Specifically, changes in the scene background not only render traditional frame-to-frame object IOU association… ▽ More

    Submitted 14 May, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

  23. arXiv:2403.09057  [pdf, other

    cs.CL cs.AI

    A Continued Pretrained LLM Approach for Automatic Medical Note Generation

    Authors: Dong Yuan, Eti Rastogi, Gautam Naik, Sree Prasanna Rajagopal, Sagar Goyal, Fen Zhao, Bharath Chintagunta, Jeff Ward

    Abstract: LLMs are revolutionizing NLP tasks. However, the use of the most advanced LLMs, such as GPT-4, is often prohibitively expensive for most specialized fields. We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an a… ▽ More

    Submitted 3 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to NAACL 2024

  24. arXiv:2403.06414  [pdf, other

    cs.CL

    Evolving Knowledge Distillation with Large Language Models and Active Learning

    Authors: Chengyuan Liu, Yangyang Kang, Fubang Zhao, Kun Kuang, Zhuoren Jiang, Changlong Sun, Fei Wu

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks. However, their computational costs are prohibitively high. To address this issue, previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data. Nonetheless, these works have mainly focused on the direct use of LLMs for text generation and labeling, w… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: Accepted by COLING 2024

  25. arXiv:2402.18784  [pdf, other

    cs.AI q-bio.NC

    Brain-inspired and Self-based Artificial Intelligence

    Authors: Yi Zeng, Feifei Zhao, Yuxuan Zhao, Dongcheng Zhao, Enmeng Lu, Qian Zhang, Yuwei Wang, Hui Feng, Zhuoya Zhao, Jihang Wang, Qingqun Kong, Yinqian Sun, Yang Li, Guobin Shen, Bing Han, Yiting Dong, Wenxuan Pan, Xiang He, Aorigele Bao, ** Wang

    Abstract: The question "Can machines think?" and the Turing Test to assess whether machines could achieve human-level intelligence is one of the roots of AI. With the philosophical argument "I think, therefore I am", this paper challenge the idea of a "thinking machine" supported by current AIs since there is no sense of self in them. Current artificial intelligence is only seemingly intelligent information… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  26. arXiv:2402.18021  [pdf, other

    cs.RO

    Online Time-Optimal Trajectory Generation for Two Quadrotors with Multi-Waypoints Constraints

    Authors: Fangguo Zhao, Jiahao Mei, ** Zhou, Jiming Chen, Shuo Li

    Abstract: The autonomous quadrotor's flying speed has kept increasing in the past 5 years, especially in the field of autonomous drone racing. However, the majority of the research mainly focuses on the aggressive flight of a single quadrotor. In this letter, we propose a novel method called Pairwise Model Predictive Control (PMPC) that can guide two quadrotors online to fly through the waypoints with minim… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  27. arXiv:2402.17411   

    cs.CL

    Consistency Matters: Explore LLMs Consistency From a Black-Box Perspective

    Authors: Fufangchen Zhao, Guoqiang **, Jiaheng Huang, Rui Zhao, Fei Tan

    Abstract: Nowadays both commercial and open-source academic LLM have become the mainstream models of NLP. However, there is still a lack of research on LLM consistency, meaning that throughout the various stages of LLM research and deployment, its internal parameters and capabilities should remain unchanged. This issue exists in both the industrial and academic sectors. The solution to this problem is often… ▽ More

    Submitted 2 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: This paper is not ready

  28. arXiv:2402.17319  [pdf, other

    cs.CV

    A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track

    Authors: Zehui Chen, Qiuchen Wang, Zhenyu Li, Jiaming Liu, Shanghang Zhang, Feng Zhao

    Abstract: In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We propose a vanilla framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Technical Report

  29. arXiv:2402.16980  [pdf, other

    cs.CV

    Saliency-Aware Automatic Buddhas Statue Recognition

    Authors: Yong Qi, Fanghan Zhao

    Abstract: Buddha statues, as a symbol of many religions, have significant cultural implications that are crucial for understanding the culture and history of different regions, and the recognition of Buddha statues is therefore the pivotal link in the field of Buddha study. However, the Buddha statue recognition requires extensive time and effort from knowledgeable professionals, making it a costly task to… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  30. arXiv:2402.11572  [pdf, other

    cs.CL

    Cobra Effect in Reference-Free Image Captioning Metrics

    Authors: Zheng Ma, Changxin Wang, Yawen Ouyang, Fei Zhao, Jianbing Zhang, Shujian Huang, Jiajun Chen

    Abstract: Evaluating the compatibility between textual descriptions and corresponding images represents a core endeavor within multi-modal research. In recent years, a proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. Empirical evidence has substantiated that these innovative approaches exhibit a higher correlation with human judgment, marking a sign… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Comments: pre-print version

  31. arXiv:2402.11570  [pdf, other

    cs.RO

    Imitation Learning-Based Online Time-Optimal Control with Multiple-Waypoint Constraints for Quadrotors

    Authors: ** Zhou, Jiahao Mei, Fangguo Zhao, Jiming Chen, Shuo Li

    Abstract: Over the past decade, there has been a remarkable surge in utilizing quadrotors for various purposes due to their simple structure and aggressive maneuverability, such as search and rescue, delivery and autonomous drone racing, etc. One of the key challenges preventing quadrotors from being widely used in these scenarios is online waypoint-constrained time-optimal trajectory generation and control… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

  32. arXiv:2402.09801  [pdf, other

    cs.CL cs.CV

    EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models

    Authors: Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai

    Abstract: Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algor… ▽ More

    Submitted 23 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

  33. arXiv:2402.06326  [pdf, other

    cs.AI cs.LG cs.SI

    Prompt Learning on Temporal Interaction Graphs

    Authors: Xi Chen, Siwei Zhang, Yun Xiong, Xixi Wu, Jiawei Zhang, Xiangguo Sun, Yao Zhang, Feng Zhao, Yulin Kang

    Abstract: Temporal Interaction Graphs (TIGs) are widely utilized to represent real-world systems. To facilitate representation learning on TIGs, researchers have proposed a series of TIG models. However, these models are still facing two tough gaps between the pre-training and downstream predictions in their ``pre-train, predict'' training paradigm. First, the temporal discrepancy between the pre-training a… ▽ More

    Submitted 6 March, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: 11 pages, 8 figures

  34. arXiv:2402.03082  [pdf, other

    cs.CV cs.LG

    Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing

    Authors: Yan Shu, Weichao Zeng, Zhenhang Li, Fangmin Zhao, Yu Zhou

    Abstract: Visual text, a pivotal element in both document and scene images, speaks volumes and attracts significant attention in the computer vision domain. Beyond visual text detection and recognition, the field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models. However, challenges persist due to the unique properties and features that dist… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  35. arXiv:2401.11880  [pdf, other

    cs.CL cs.AI cs.CR cs.MA

    PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

    Authors: Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, **g Shao

    Abstract: Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative… ▽ More

    Submitted 17 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  36. arXiv:2401.09112  [pdf, other

    cs.CV

    Stream Query Denoising for Vectorized HD Map Construction

    Authors: Shuo Wang, Fan Jia, Yingfei Liu, Yucheng Zhao, Zehui Chen, Tiancai Wang, Chi Zhang, Xiangyu Zhang, Feng Zhao

    Abstract: To enhance perception performance in complex and extensive scenarios within the realm of autonomous driving, there has been a noteworthy focus on temporal modeling, with a particular emphasis on streaming methods. The prevailing trend in streaming models involves the utilization of stream queries for the propagation of temporal information. Despite the prevalence of this approach, the direct appli… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

  37. arXiv:2312.17428  [pdf, other

    cs.CV

    ChangeNet: Multi-Temporal Asymmetric Change Detection Dataset

    Authors: Deyi Ji, Siqi Gao, Mingyuan Tao, Hongtao Lu, Feng Zhao

    Abstract: Change Detection (CD) has been attracting extensive interests with the availability of bi-temporal datasets. However, due to the huge cost of multi-temporal images acquisition and labeling, existing change detection datasets are small in quantity, short in temporal, and low in practicability. Therefore, a large-scale practical-oriented dataset covering wide temporal phases is urgently needed to fa… ▽ More

    Submitted 11 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024 Oral/Lecture

  38. arXiv:2312.16391  [pdf, other

    cs.RO

    Toward Spatial Temporal Consistency of Joint Visual Tactile Perception in VR Applications

    Authors: Fuqiang Zhao, Kehan Zhang, Qian Liu, Zhuoyi Lyu

    Abstract: With the development of VR technology, especially the emergence of the metaverse concept, the integration of visual and tactile perception has become an expected experience in human-machine interaction. Therefore, achieving spatial-temporal consistency of visual and tactile information in VR applications has become a necessary factor for realizing this experience. The state-of-the-art vibrotactile… ▽ More

    Submitted 28 December, 2023; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: This paper is accepted by the IEEE Haptic Symposium 2024

  39. arXiv:2312.14862  [pdf, other

    cs.CL cs.AI

    YAYI 2: Multilingual Open-Source Large Language Models

    Authors: Yin Luo, Qingchao Kong, Nan Xu, Jia Cao, Bao Hao, Baoyu Qu, Bo Chen, Chao Zhu, Chenyang Zhao, Donglei Zhang, Fan Feng, Feifei Zhao, Hailong Sun, Hanxuan Yang, Haojun Pan, Hongyu Liu, Jianbin Guo, Jiangtao Du, **gyi Wang, Junfeng Li, Lei Sun, Liduo Liu, Lifeng Dong, Lili Liu, Lin Wang , et al. (28 additional authors not shown)

    Abstract: As the latest advancements in natural language processing, large language models (LLMs) have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and ga… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

  40. arXiv:2312.14033  [pdf, other

    cs.CL

    T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

    Authors: Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, **gming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, Feng Zhao

    Abstract: Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instructi… ▽ More

    Submitted 14 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: Project: https://open-compass.github.io/T-Eval

  41. arXiv:2312.10888  [pdf, other

    cs.NI

    Age-Threshold Slotted ALOHA for Optimizing Information Freshness in Mobile Networks

    Authors: Fangming Zhao, Nikolaos Pappas, Chuan Ma, Xinghua Sun, Tony Q. S. Quek, Howard H. Yang

    Abstract: We optimize the Age of Information (AoI) in mobile networks using the age-threshold slotted ALOHA (TSA) protocol. The network comprises multiple source-destination pairs, where each source sends a sequence of status update packets to its destination over a shared spectrum. The TSA protocol stipulates that a source node must remain silent until its AoI reaches a predefined threshold, after which th… ▽ More

    Submitted 5 June, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: 21 pages. Update version after peer review

  42. arXiv:2312.10321  [pdf, other

    cs.DB cs.CL

    LLM-SQL-Solver: Can LLMs Determine SQL Equivalence?

    Authors: Fuheng Zhao, Lawrence Lim, Ishtiyaque Ahmad, Divyakant Agrawal, Amr El Abbadi

    Abstract: Judging the equivalence between two SQL queries is a fundamental problem with many practical applications in data management and SQL generation (i.e., evaluating the quality of generated SQL queries in text-to-SQL task). While the research community has reasoned about SQL equivalence for decades, it poses considerable difficulties and no complete solutions exist. Recently, Large Language Models (L… ▽ More

    Submitted 19 June, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

  43. arXiv:2311.12793  [pdf, other

    cs.CV

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Authors: Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin

    Abstract: In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world kno… ▽ More

    Submitted 28 November, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Project: https://ShareGPT4V.github.io

  44. arXiv:2311.06015  [pdf

    cs.RO cs.AI

    RSG: Fast Learning Adaptive Skills for Quadruped Robots by Skill Graph

    Authors: Hongyin Zhang, Diyuan Shi, Zifeng Zhuang, Han Zhao, Zhenyu Wei, Feng Zhao, Sibo Gai, Shangke Lyu, Donglin Wang

    Abstract: Develo** robotic intelligent systems that can adapt quickly to unseen wild situations is one of the critical challenges in pursuing autonomous robotics. Although some impressive progress has been made in walking stability and skill learning in the field of legged robots, their ability to fast adaptation is still inferior to that of animals in nature. Animals are born with massive skills needed t… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  45. arXiv:2311.03150  [pdf, other

    cs.MA

    A Brain-inspired Theory of Collective Mind Model for Efficient Social Cooperation

    Authors: Zhuoya Zhao, Feifei Zhao, Shiwen Wang, Yinqian Sun, Yi Zeng

    Abstract: Social intelligence manifests the capability, often referred to as the Theory of Mind (ToM), to discern others' behavioral intentions, beliefs, and other mental states. ToM is especially important in multi-agent and human-machine interaction environments because each agent needs to understand the mental states of other agents in order to better respond, interact, and collaborate. Recent research i… ▽ More

    Submitted 7 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

  46. arXiv:2310.14605  [pdf, other

    cs.CL cs.MM

    M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis

    Authors: Fei Zhao, Chunhui Li, Zhen Wu, Yawen Ouyang, Jianbing Zhang, Xinyu Dai

    Abstract: Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained Sentiment Analysis task, which has attracted growing research interests recently. Existing work mainly utilizes image information to improve the performance of MABSA task. However, most of the studies overestimate the importance of images since there are many noise images unrelated to the text in the dataset, which will have a ne… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023

  47. arXiv:2310.13951  [pdf, other

    cs.CV

    Fuzzy-NMS: Improving 3D Object Detection with Fuzzy Classification in NMS

    Authors: Li Wang, Xinyu Zhang, Fachuan Zhao, Chuze Wu, Yichen Wang, Ziying Song, Lei Yang, Jun Li, Hua** Liu

    Abstract: Non-maximum suppression (NMS) is an essential post-processing module used in many 3D object detection frameworks to remove overlap** candidate bounding boxes. However, an overreliance on classification scores and difficulties in determining appropriate thresholds can affect the resulting accuracy directly. To address these issues, we introduce fuzzy learning into NMS and propose a novel generali… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

  48. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  49. arXiv:2310.08442  [pdf, other

    cs.CV cs.AI

    Debias the Training of Diffusion Models

    Authors: Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, Feng Zhao

    Abstract: Diffusion models have demonstrated compelling generation quality by optimizing the variational lower bound through a simple denoising score matching loss. In this paper, we provide theoretical evidence that the prevailing practice of using a constant loss weight strategy in diffusion models leads to biased estimation during the training phase. Simply optimizing the denoising network to predict Gau… ▽ More

    Submitted 3 November, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: University of Science and Technology of China, Alibaba Group, The Chinese University of Hong Kong

  50. arXiv:2310.05589  [pdf, other

    cs.CL cs.MM

    DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

    Authors: Shangyu Xing, Fei Zhao, Zhen Wu, Chunhui Li, Jianbing Zhang, Xinyu Dai

    Abstract: Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity.… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted by ACM MM 2023