Skip to main content

Showing 1–50 of 427 results for author: Zhao, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Cheng**g Wu, Ting Liu, Luoqi Liu, Xinyu Liu, **g Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, **gnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  2. arXiv:2406.16038  [pdf, other

    cs.CV

    LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

    Authors: Delin Qu, Qizhi Chen, **rui Zhang, Xianqiang Gao, Bin Zhao, Dong Wang, Xuelong Li

    Abstract: This paper aims to advance the progress of physical world interactive scene reconstruction by extending the interactive object reconstruction from single object level to complex scene level. To this end, we first construct one simulated and one real scene-level physical interaction dataset containing 28 scenes with multiple interactive objects per scene. Furthermore, to accurately model the intera… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  3. arXiv:2406.15836  [pdf, other

    cs.LG cs.AI cs.MA

    Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

    Authors: Yang Zhang, Chenjia Bai, Bin Zhao, Junchi Yan, Xiu Li, Xuelong Li

    Abstract: Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  4. arXiv:2406.15056  [pdf, ps, other

    cs.IT eess.SP

    Continuous Aperture Array (CAPA)-Based Wireless Communications: Capacity Characterization

    Authors: Boqun Zhao, Chongjun Ouyang, Xingqi Zhang, Yuanwei Liu

    Abstract: The capacity limits of continuous-aperture array (CAPA)-based wireless communications are characterized. To this end, an analytically tractable transmission framework is established for both uplink and downlink CAPA systems. Based on this framework, closed-form expressions for the single-user channel capacity are derived. The results are further extended to a multiuser case by characterizing the c… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  5. arXiv:2406.14534  [pdf, other

    eess.IV cs.CV

    Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration

    Authors: Long Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming **, Yuen-Chun Jeremy Teoh, **g Qin, Pheng-Ann Heng

    Abstract: A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations betwe… ▽ More

    Submitted 27 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted by MICCAI 2024

  6. arXiv:2406.13939  [pdf, other

    cs.CV

    2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

    Authors: Bin Cao, Yisi Zhang, Xuanxu Lin, Xingjian He, Bo Zhao, **g Liu

    Abstract: Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, moti… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  7. arXiv:2406.13642  [pdf, other

    cs.CV

    SpatialBot: Precise Spatial Understanding with Vision Language Models

    Authors: Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, Bo Zhao

    Abstract: Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related… ▽ More

    Submitted 27 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

  8. arXiv:2406.12742  [pdf, other

    cs.CV cs.AI cs.CL

    Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

    Authors: Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales

    Abstract: The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: First three authors contributed equally. Dataset: https://huggingface.co/datasets/VLLMs/MIRB

  9. arXiv:2406.10638  [pdf, other

    cs.CV

    Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions

    Authors: Yexin Liu, Zhengyang Liang, Yueze Wang, Muyang He, Jian Li, Bo Zhao

    Abstract: Multimodal Large Language Models (MLLMs) have exhibited impressive capabilities in visual understanding and reasoning, providing sightly reasonable answers, such as image descriptions. This has spurred extensive research on the evaluation of MLLMs. Most evaluation benchmarks assume that incorrect answers indicate a lack of understanding of the visual content. However, our findings reveal that, in… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  10. arXiv:2406.08478  [pdf, other

    cs.CV cs.CL

    What If We Recaption Billions of Web Images with LLaMA-3?

    Authors: Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie

    Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community eff… ▽ More

    Submitted 18 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: First five authors contributed equally

  11. arXiv:2406.06829  [pdf, other

    cs.LG stat.ML

    Personalized Binomial DAGs Learning with Network Structured Covariates

    Authors: Boxin Zhao, Weishi Wang, Dingyuan Zhu, Ziqi Liu, Dong Wang, Zhiqiang Zhang, Jun Zhou, Mladen Kolar

    Abstract: The causal dependence in data is often characterized by Directed Acyclic Graphical (DAG) models, widely used in many areas. Causal discovery aims to recover the DAG structure using observational data. This paper focuses on causal discovery with multi-variate count data. We are motivated by real-world web visit data, recording individual user visits to multiple websites. Building a causal diagram c… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  12. arXiv:2406.05000  [pdf, other

    cs.CV

    AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

    Authors: Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, Xudong Mao

    Abstract: Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while Drea… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  13. arXiv:2406.04898  [pdf, other

    cs.CV

    Labeled Data Selection for Category Discovery

    Authors: Bingchen Zhao, Nico Lang, Serge Belongie, Oisin Mac Aodha

    Abstract: Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a res… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  14. arXiv:2406.04316  [pdf, other

    cs.CV

    Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

    Authors: Jiyao Zhang, Weiyao Huang, Bo Peng, Mingdong Wu, Fei Hu, Zijian Chen, Bo Zhao, Hao Dong

    Abstract: 6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  15. arXiv:2406.04292  [pdf, other

    cs.IR cs.CL cs.CV

    VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

    Authors: Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Abstract: Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal mul… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference

  16. arXiv:2406.04264  [pdf, other

    cs.CV cs.AI cs.CL

    MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

    Authors: Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Bo Zhang, Tiejun Huang, Zheng Liu

    Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres… ▽ More

    Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  17. arXiv:2406.02801  [pdf, other

    cs.SI

    SenTopX: Benchmark for User Sentiment on Various Topics

    Authors: Hina Qayyum, Muhammad Ikram, Benjamin Zhao, Ian Wood, Mohamad Ali Kaafar, Nicolas Kourtellis

    Abstract: Toxic sentiment analysis on Twitter (X) often focuses on specific topics and events such as politics and elections. Datasets of toxic users in such research are typically gathered through lexicon-based techniques, providing only a cross-sectional view. his approach has a tight confine for studying toxic user behavior and effective platform moderation. To identify users consistently spreading toxic… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  18. arXiv:2406.00644  [pdf, other

    cs.CV

    Ultrasound Report Generation with Cross-Modality Feature Alignment via Unsupervised Guidance

    Authors: Jun Li, Tongkun Su, Baoliang Zhao, Faqin Lv, Qiong Wang, Nassir Navab, Ying Hu, Zhongliang Jiang

    Abstract: Automatic report generation has arisen as a significant research area in computer-aided diagnosis, aiming to alleviate the burden on clinicians by generating reports automatically based on medical images. In this work, we propose a novel framework for automatic ultrasound report generation, leveraging a combination of unsupervised and supervised learning methods to aid the report generation proces… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  19. arXiv:2406.00439  [pdf, other

    cs.RO cs.CV

    Learning Manipulation by Predicting Interaction

    Authors: Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, ** Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

    Abstract: Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI

  20. arXiv:2405.21070  [pdf, other

    cs.CV cs.CL cs.LG

    Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

    Authors: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

    Abstract: Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to stud… ▽ More

    Submitted 14 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  21. GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

    Authors: Boming Zhao, Yuan Li, Ziyu Sun, Lin Zeng, Yujun Shen, Rui Ma, Yinda Zhang, Hujun Bao, Zhaopeng Cui

    Abstract: Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a n… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Accepted to SIGGRAPH 2024 Conference. Project Page: https://zju3dv.github.io/gaussian-prediction/

  22. arXiv:2405.19586  [pdf, other

    cs.CV cs.LG cs.RO

    SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

    Authors: Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

    Abstract: Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: ICML 2024. Project page: https://sam-embodied.github.io

  23. arXiv:2405.17188  [pdf, other

    cs.CV

    The SkatingVerse Workshop & Challenge: Methods and Results

    Authors: Jian Zhao, Lei **, Jianshu Li, Zheng Zhu, Yinglei Teng, Jiaojiao Zhao, Sadaf Gulshad, Zheng Wang, Bo Zhao, Xiangbo Shu, Yunchao Wei, Xuecheng Nie, Xiaojie **, Xiaodan Liang, Shin'ichi Satoh, Yandong Guo, Cewu Lu, Junliang Xing, Jane Shen Shengmei

    Abstract: The SkatingVerse Workshop & Challenge aims to encourage research in develo** novel and accurate methods for human action understanding. The SkatingVerse dataset used for the SkatingVerse Challenge has been publicly released. There are two subsets in the dataset, i.e., the training subset and testing subset. The training subsets consists of 19,993 RGB video sequences, and the testing subsets cons… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  24. arXiv:2405.13382  [pdf, other

    cs.CV

    VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

    Authors: Yongxin Guo, **gyu Liu, Mingda Li, Xiaoying Tang, Xi Chen, Bo Zhao

    Abstract: Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within video… ▽ More

    Submitted 1 July, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  25. arXiv:2405.10739  [pdf, other

    cs.CV cs.AI

    Efficient Multimodal Large Language Models: A Survey

    Authors: Yizhang **, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, e… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  26. arXiv:2405.10547  [pdf, other

    cs.SI

    GPTs Window Shop**: An analysis of the Landscape of Custom ChatGPT Models

    Authors: Benjamin Zi Hao Zhao, Muhammad Ikram, Mohamed Ali Kaafar

    Abstract: OpenAI's ChatGPT initiated a wave of technical iterations in the space of Large Language Models (LLMs) by demonstrating the capability and disruptive power of LLMs. OpenAI has prompted large organizations to respond with their own advancements and models to push the LLM performance envelope. OpenAI has prompted large organizations to respond with their own advancements and models to push the LLM p… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 9 pages

  27. arXiv:2405.06865  [pdf, other

    cs.CV cs.CR

    Disrupting Style Mimicry Attacks on Video Imagery

    Authors: Josephine Passananti, Stanley Wu, Shawn Shan, Haitao Zheng, Ben Y. Zhao

    Abstract: Generative AI models are often used to perform mimicry attacks, where a pretrained model is fine-tuned on a small sample of images to learn to mimic a specific artist of interest. While researchers have introduced multiple anti-mimicry protection tools (Mist, Glaze, Anti-Dreambooth), recent evidence points to a growing trend of mimicry models using videos as sources of training data. This paper pr… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  28. arXiv:2405.05387  [pdf, ps, other

    cs.IT

    Channel Capacity of Near-Field Multiuser Communications

    Authors: Boqun Zhao, Chongjun Ouyang, Xingqi Zhang, Yuanwei Liu

    Abstract: The channel capacity of near-field (NF) communications is characterized by considering three types of multiuser channels: i) multiple access channel (MAC), ii) broadcast channel (BC), and iii) multicast channel (MC). For NF MAC and BC, closed-form expressions are derived for the sum-rate capacity as well as the capacity region under a two-user scenario. These results are further extended to scenar… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  29. arXiv:2405.02561  [pdf, other

    cs.LG math.NA

    Understanding the Difficulty of Solving Cauchy Problems with PINNs

    Authors: Tao Wang, Bo Zhao, Sicun Gao, Rose Yu

    Abstract: Physics-Informed Neural Networks (PINNs) have gained popularity in scientific computing in recent years. However, they often fail to achieve the same level of accuracy as classical methods in solving differential equations. In this paper, we identify two sources of this issue in the case of Cauchy problems: the use of $L^2$ residuals as objective functions and the approximation gap of neural netwo… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

    Comments: 13 pages and 18 figures

  30. arXiv:2405.00354  [pdf, other

    cs.CV

    CrossMatch: Enhance Semi-Supervised Medical Image Segmentation with Perturbation Strategies and Knowledge Distillation

    Authors: Bin Zhao, Chunshi Wang, Shuxue Ding

    Abstract: Semi-supervised learning for medical image segmentation presents a unique challenge of efficiently using limited labeled data while leveraging abundant unlabeled data. Despite advancements, existing methods often do not fully exploit the potential of the unlabeled data for enhancing model robustness and accuracy. In this paper, we introduce CrossMatch, a novel framework that integrates knowledge d… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  31. Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

    Authors: Chenjia Bai, Lingxiao Wang, Jianye Hao, Zhuoran Yang, Bin Zhao, Zhen Wang, Xuelong Li

    Abstract: Offline Reinforcement Learning (RL) has shown promising results in learning a task-specific policy from a fixed dataset. However, successful offline RL often relies heavily on the coverage and quality of the given dataset. In scenarios where the dataset for a specific task is limited, a natural approach is to improve offline RL with datasets from other tasks, namely, to conduct Multi-Task Data Sha… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Accepted by Artificial Intelligence (AIJ)

  32. arXiv:2404.18620  [pdf, other

    cs.CV

    FlexiFilm: Long Video Generation with Flexible Conditions

    Authors: Yichen Ouyang, jianhao Yuan, Hao Zhao, Gaoang Wang, Bo zhao

    Abstract: Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapte… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 9 pages, 9 figures

  33. arXiv:2404.18284  [pdf, other

    cs.CV

    S3-SLAM: Sparse Tri-plane Encoding for Neural Implicit SLAM

    Authors: Zhiyao Zhang, Yunzhou Zhang, Yanmin Wu, Bin Zhao, Xingshuo Wang, Rui Tian

    Abstract: With the emergence of Neural Radiance Fields (NeRF), neural implicit representations have gained widespread applications across various domains, including simultaneous localization and map**. However, current neural implicit SLAM faces a challenging trade-off problem between performance and the number of parameters. To address this problem, we propose sparse tri-plane encoding, which efficiently… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  34. arXiv:2404.17984  [pdf, other

    cs.CR cs.AI

    Privacy-Preserving, Dropout-Resilient Aggregation in Decentralized Learning

    Authors: Ali Reza Ghavamipour, Benjamin Zi Hao Zhao, Fatih Turkmen

    Abstract: Decentralized learning (DL) offers a novel paradigm in machine learning by distributing training across clients without central aggregation, enhancing scalability and efficiency. However, DL's peer-to-peer model raises challenges in protecting against inference attacks and privacy leaks. By forgoing central bottlenecks, DL demands privacy-preserving aggregation methods to protect data from 'honest… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  35. arXiv:2404.17970  [pdf, other

    cs.CR cs.AI

    Privacy-Preserving Aggregation for Decentralized Learning with Byzantine-Robustness

    Authors: Ali Reza Ghavamipour, Benjamin Zi Hao Zhao, Oguzhan Ersoy, Fatih Turkmen

    Abstract: Decentralized machine learning (DL) has been receiving an increasing interest recently due to the elimination of a single point of failure, present in Federated learning setting. Yet, it is threatened by the looming threat of Byzantine clients who intentionally disrupt the learning process by broadcasting arbitrary model updates to other clients, seeking to degrade the performance of the global mo… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  36. arXiv:2404.16645  [pdf, other

    cs.CL cs.AI

    Tele-FLM Technical Report

    Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang

    Abstract: Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  37. arXiv:2404.15381  [pdf, other

    cs.LG cs.AI

    Advances and Open Challenges in Federated Learning with Foundation Models

    Authors: Chao Ren, Han Yu, Hongyi Peng, Xiaoli Tang, Anran Li, Yulan Gao, Alysa Ziying Tan, Bo Zhao, Xiaoxiao Li, Zengxiang Li, Qiang Yang

    Abstract: The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI), offering enhanced capabilities while addressing concerns of privacy, data decentralization, and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic rel… ▽ More

    Submitted 29 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Survey of Federated Foundation Models (FedFM)

  38. arXiv:2404.13892  [pdf, other

    cs.SD cs.AI eess.AS

    Retrieval-Augmented Audio Deepfake Detection

    Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, **g Xiao, Jianzong Wang

    Abstract: With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired… ▽ More

    Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 International Conference on Multimedia Retrieval (ICMR 2024)

  39. arXiv:2404.11897  [pdf, other

    cs.CV

    AG-NeRF: Attention-guided Neural Radiance Fields for Multi-height Large-scale Outdoor Scene Rendering

    Authors: **gfeng Guo, Xiaohan Zhang, Baozhu Zhao, Qi Liu

    Abstract: Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the trainin… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  40. arXiv:2404.11614  [pdf, other

    cs.CV

    Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

    Authors: Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu

    Abstract: Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dy… ▽ More

    Submitted 18 April, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

    Comments: Our demo page is available at: https://animate-your-word.github.io/demo/

  41. arXiv:2404.09990  [pdf, other

    cs.CV cs.AI

    HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

    Authors: Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, Cihang Xie

    Abstract: This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expande… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Project Page: https://thefllood.github.io/HQEdit_web

  42. arXiv:2404.08995  [pdf, other

    cs.LG cs.AI cs.CV

    Beyond Known Clusters: Probe New Prototypes for Efficient Generalized Class Discovery

    Authors: Ye Wang, Yaxiong Wang, Yujiao Wu, Bingchen Zhao, Xueming Qian

    Abstract: Generalized Class Discovery (GCD) aims to dynamically assign labels to unlabelled data partially based on knowledge learned from labelled data, where the unlabelled data may come from known or novel classes. The prevailing approach generally involves clustering across all data and learning conceptions by prototypical contrastive learning. However, existing methods largely hinge on the performance… ▽ More

    Submitted 30 April, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

    Comments: 9 pages, 7 figures

  43. arXiv:2404.08343  [pdf, ps, other

    cs.IT eess.SP

    On the Impact of Reactive Region on the Near-Field Channel Gain

    Authors: Chongjun Ouyang, Zhaolin Wang, Boqun Zhao, Xingqi Zhang, Yuanwei Liu

    Abstract: The near-field channel gain is analyzed by considering both radiating and reactive components of the electromagnetic field. Novel expressions are derived for the channel gains of spatially-discrete (SPD) and continuous-aperture (CAP) arrays, which are more accurate than conventional results that neglect the reactive region. To gain further insights, asymptotic analyses are carried out in the large… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 7 figures

  44. arXiv:2404.07989  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.SD eess.AS

    Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

    Authors: Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Dong Wang, Zhigang Wang, Bin Zhao, Shanghang Zhang, Peng Gao, Hongsheng Li, Xuelong Li

    Abstract: Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantl… ▽ More

    Submitted 30 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point

  45. arXiv:2404.05892  [pdf, other

    cs.CL cs.AI

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

    Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV, Jan Kocoń, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanisław Woźniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao , et al. (3 additional authors not shown)

    Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokeni… ▽ More

    Submitted 10 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  46. arXiv:2404.00578  [pdf, other

    cs.CV

    M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

    Authors: Fan Bai, Yuxin Du, Tiejun Huang, Max Q. -H. Meng, Bo Zhao

    Abstract: Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: MLLM, 3D medical image analysis

  47. arXiv:2404.00226  [pdf, other

    cs.CV cs.CL

    Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

    Authors: Tongkun Su, Jun Li, Xi Zhang, Haibo **, Hao Chen, Qiong Wang, Faqin Lv, Baoliang Zhao, Yin Hu

    Abstract: Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. To the best of our knowledge, we are the first to utilize Visual Ques… ▽ More

    Submitted 8 April, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

  48. arXiv:2403.19907  [pdf, ps, other

    cs.LG cs.AI

    Beyond the Known: Novel Class Discovery for Open-world Graph Learning

    Authors: Yucheng **, Yun Xiong, Juncheng Fang, Xixi Wu, Dongxiao He, Xing Jia, Bingchen Zhao, Philip Yu

    Abstract: Node classification on graphs is of great importance in many applications. Due to the limited labeling capability and evolution in real-world open scenarios, novel classes can emerge on unlabeled testing nodes. However, little attention has been paid to novel class discovery on graphs. Discovering novel classes is challenging as novel and known class nodes are correlated by edges, which makes thei… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  49. arXiv:2403.19079  [pdf, other

    cs.CV

    A Real-Time Framework for Domain-Adaptive Underwater Object Detection with Image Enhancement

    Authors: Junjie Wen, **qiang Cui, Benyun Zhao, Bingxin Han, Xuchen Liu, Zhi Gao, Ben M. Chen

    Abstract: In recent years, significant progress has been made in the field of underwater image enhancement (UIE). However, its practical utility for high-level vision tasks, such as underwater object detection (UOD) in Autonomous Underwater Vehicles (AUVs), remains relatively unexplored. It may be attributed to several factors: (1) Existing methods typically employ UIE as a pre-processing step, which inevit… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: accepted by ICRA24

  50. arXiv:2403.16788  [pdf, other

    cs.CV

    HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation

    Authors: Linglin **g, Yiming Ding, Yunpeng Gao, Zhigang Wang, Xu Yan, Dong Wang, Gerald Schaefer, Hui Fang, Bin Zhao, Xuelong Li

    Abstract: Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions, which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data, previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However, this will inevitably introduce noise, and… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.