Skip to main content

Showing 1–50 of 278 results for author: Fan, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01007  [pdf, other

    cs.CV

    GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

    Authors: Huijie Fan, Tinghui Zhao, Qiang Wang, Baojie Fan, Yandong Tang, LianQing Liu

    Abstract: In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2406.16198  [pdf, other

    cs.LG cs.AR

    Hardware-Aware Neural Dropout Search for Reliable Uncertainty Prediction on FPGA

    Authors: Zehuan Zhang, Hongxiang Fan, Hao Mark Chen, Lukasz Dudziak, Wayne Luk

    Abstract: The increasing deployment of artificial intelligence (AI) for critical decision-making amplifies the necessity for trustworthy AI, where uncertainty estimation plays a pivotal role in ensuring trustworthiness. Dropout-based Bayesian Neural Networks (BayesNNs) are prominent in this field, offering reliable uncertainty estimates. Despite their effectiveness, existing dropout-based BayesNNs typically… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Design Automation Conference (DAC) 2024

  3. arXiv:2406.14593  [pdf, other

    cs.LG

    Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA

    Authors: Hao Mark Chen, Liam Castelli, Martin Ferianc, Hongyu Zhou, Shuanglong Liu, Wayne Luk, Hongxiang Fan

    Abstract: Reliable uncertainty estimation plays a crucial role in various safety-critical applications such as medical diagnosis and autonomous driving. In recent years, Bayesian neural networks (BayesNNs) have gained substantial research and industrial interests due to their capability to make accurate predictions with reliable uncertainty estimation. However, the algorithmic complexity and the resulting h… ▽ More

    Submitted 24 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2308.06849

  4. arXiv:2406.08324  [pdf, other

    cs.CV

    LaMOT: Language-Guided Multi-Object Tracking

    Authors: Yunhao Li, Xiaoqiong Liu, Luke Liu, Heng Fan, Libo Zhang

    Abstract: Vision-Language MOT is a crucial tracking problem and has drawn increasing attention recently. It aims to track objects based on human language commands, replacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. Despite various efforts, a key challenge lies in the lack of a clear understanding of why language is used for tracking, which hi… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2406.04999  [pdf, other

    cs.CV

    ProMotion: Prototypes As Motion Learners

    Authors: Yawen Lu, Dongfang Liu, Qifan Wang, Cheng Han, Yiming Cui, Zhiwen Cao, Xueling Zhang, Yingjie Victor Chen, Heng Fan

    Abstract: In this work, we introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We adopt a prototypical perspective, establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural desi… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 11 pages

  6. arXiv:2405.20851  [pdf, other

    cs.CV

    MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

    Authors: Shurong Yang, Huadong Li, Juhao Wu, Minhao **g, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan

    Abstract: Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performa… ▽ More

    Submitted 18 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  7. arXiv:2405.19914  [pdf, other

    cs.CV

    Towards RGB-NIR Cross-modality Image Registration and Beyond

    Authors: Huadong Li, Shichao Dong, ** Wang, Rong Fu, Minhao **g, Jiajun Liang, Haoqiang Fan, Renhe Ji

    Abstract: This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration, which is crucial for many downstream vision tasks to fully leverage the complementary information present in visible and infrared images. In this field, researchers face two primary challenges - the absence of a correctly-annotated benchmark with viewpoint variations for evaluating RGB-NIR cross-mo… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 18 pages, 7 figures

  8. arXiv:2405.18628  [pdf, other

    cs.LG cs.CL

    Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

    Authors: Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan

    Abstract: The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments,… ▽ More

    Submitted 2 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: The code for this implementation is available at https://github.com/hmarkc/parallel-prompt-decoding

  9. arXiv:2405.17660  [pdf, other

    cs.CV

    LoReTrack: Efficient and Accurate Low-Resolution Transformer Tracking

    Authors: Shaohua Dong, Yunhe Feng, Qing Yang, Yuewei Lin, Heng Fan

    Abstract: High-performance Transformer trackers have shown excellent results, yet they often bear a heavy computational load. Observing that a smaller input can immediately and conveniently reduce computations without changing the model, an easy solution is to adopt the low-resolution input for efficient Transformer tracking. Albeit faster, this hurts tracking accuracy much due to information loss in low re… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  10. arXiv:2405.15684  [pdf, other

    cs.CV cs.AI

    Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

    Authors: Yue Zhang, Hehe Fan, Yi Yang

    Abstract: To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every de… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  11. arXiv:2405.14506  [pdf, other

    cs.CV cs.AI

    SIAVC: Semi-Supervised Framework for Industrial Accident Video Classification

    Authors: Zuoyong Li, Qinghua Lin, Haoyi Fan, Tiesong Zhao, David Zhang

    Abstract: Semi-supervised learning suffers from the imbalance of labeled and unlabeled training data in the video surveillance scenario. In this paper, we propose a new semi-supervised learning method called SIAVC for industrial accident video classification. Specifically, we design a video augmentation module called the Super Augmentation Block (SAB). SAB adds Gaussian noise and randomly masks video frames… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  12. arXiv:2405.13911  [pdf, other

    cs.CV cs.AI cs.CL

    TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment

    Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

    Abstract: Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduc… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 32 pages, 12 figures, 11 tables

  13. arXiv:2405.00181  [pdf, other

    cs.CV cs.AI

    Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

    Authors: Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, **g Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, Xiaofeng Tao

    Abstract: Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?"… ▽ More

    Submitted 6 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

    Comments: Accepted in CVPR2024, Codebase: https://github.com/fesvhtr/CUVA

  14. arXiv:2404.16687  [pdf, other

    cs.CV

    NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

    Authors: Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng , et al. (89 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Conte… ▽ More

    Submitted 7 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

  15. arXiv:2404.15267  [pdf, other

    cs.CV

    From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

    Authors: Zehuan Huang, Hongxing Fan, Lipeng Wang, Lu Sheng

    Abstract: Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, inc… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  16. arXiv:2404.13192  [pdf, other

    cs.CL cs.AI

    Heterogeneous Subgraph Transformer for Fake News Detection

    Authors: Yuchen Zhang, Xiaoxiao Ma, Jia Wu, Jian Yang, Hao Fan

    Abstract: Fake news is pervasive on social media, inflicting substantial harm on public discourse and societal well-being. We investigate the explicit structural information and textual features of news pieces by constructing a heterogeneous graph concerning the relations among news topics, entities, and content. Through our study, we reveal that fake news can be effectively detected in terms of the atypica… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  17. arXiv:2404.11313  [pdf, other

    eess.IV cs.AI

    NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

    Authors: Xin Li, Kun Yuan, Ya**g Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

  18. arXiv:2404.10630  [pdf, other

    cs.CL cs.LG

    HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

    Authors: Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan

    Abstract: Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (su… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  19. arXiv:2404.00254  [pdf, other

    cs.LG cs.CE q-bio.BM q-bio.QM

    Clustering for Protein Representation Learning

    Authors: Ruijie Quan, Wenguan Wang, Fan Ma, Hehe Fan, Yi Yang

    Abstract: Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: Accepted to CVPR2024

  20. arXiv:2403.17830  [pdf, other

    cs.CV

    Assessment of Multimodal Large Language Models in Alignment with Human Values

    Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, **g Shao

    Abstract: Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining h… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.02692

  21. arXiv:2403.16111  [pdf, other

    cs.CV

    EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

    Authors: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

    Abstract: Current diffusion-based video editing primarily focuses on local editing (\textit{e.g.,} object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of atte… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: Project page: https://knightyxp.github.io/EVA

  22. arXiv:2403.16003  [pdf, other

    cs.CV cs.AI

    Diverse Representation Embedding for Lifelong Person Re-Identification

    Authors: Shiben Liu, Huijie Fan, Qiang Wang, Xiai Chen, Zhi Han, Yandong Tang

    Abstract: Lifelong Person Re-Identification (LReID) aims to continuously learn from successive data streams, matching individuals across multiple cameras. The key challenge for LReID is how to effectively preserve old knowledge while incrementally learning new information, which is caused by task-level domain gaps and limited old task datasets. Existing methods based on CNN backbone are insufficient to expl… ▽ More

    Submitted 2 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: 11 pages,7 Tables,3 Figures

  23. arXiv:2403.13640  [pdf, other

    cs.RO

    LaCE-LHMP: Airflow Modelling-Inspired Long-Term Human Motion Prediction By Enhancing Laminar Characteristics in Human Flow

    Authors: Yufei Zhu, Han Fan, Andrey Rudenko, Martin Magnusson, Erik Schaffernicht, Achim J. Lilienthal

    Abstract: Long-term human motion prediction (LHMP) is essential for safely operating autonomous robots and vehicles in populated environments. It is fundamental for various applications, including motion planning, tracking, human-robot interaction and safety monitoring. However, accurate prediction of human trajectories is challenging due to complex factors, including, for example, social norms and environm… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA)

  24. arXiv:2403.11424  [pdf, other

    cs.CV

    Benchmarking the Robustness of UAV Tracking Against Common Corruptions

    Authors: Xiaoqiong Liu, Yunhe Feng, Shu Hu, Xiaohui Yuan, Heng Fan

    Abstract: The robustness of unmanned aerial vehicle (UAV) tracking is crucial in many tasks like surveillance and robotics. Despite its importance, little attention is paid to the performance of UAV trackers under common corruptions due to lack of a dedicated platform. Addressing this, we propose UAV-C, a large-scale benchmark for assessing robustness of UAV trackers under common corruptions. Specifically,… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  25. arXiv:2403.10588  [pdf, other

    cs.SE cs.AI

    S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

    Authors: Kareem Shaik, Dali Wang, Weijian Zheng, Qinglei Cao, Heng Fan, Peter Schwartz, Yunhe Feng

    Abstract: The understanding of large-scale scientific software poses significant challenges due to its diverse codebase, extensive code length, and target computing architectures. The emergence of generative AI, specifically large language models (LLMs), provides novel pathways for understanding such complex scientific codes. This paper presents S3LLM, an LLM-based framework designed to enable the examinati… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  26. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  27. arXiv:2403.05231  [pdf, other

    cs.CV

    Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

    Authors: Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling

    Abstract: Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of larger Vision Transformers (ViT) for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. Howeve… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  28. arXiv:2403.05021  [pdf, other

    cs.CV

    Beyond MOT: Semantic Multi-Object Tracking

    Authors: Yunhao Li, Hao Wang, Xue Ma, Jiali Yao, Shaohua Dong, Heng Fan, Libo Zhang

    Abstract: Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e.,"where") in videos. Yet, knowing merely "where" is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., "what") from videos, associated with "where", is highly-desired for comprehensive video analysis. Thu… ▽ More

    Submitted 10 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  29. arXiv:2403.03493  [pdf, other

    cs.CV

    VastTrack: Vast Category Visual Object Tracking

    Authors: Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, Libo Zhang

    Abstract: In this paper, we introduce a novel benchmark, dubbed VastTrack, towards facilitating the development of more general visual tracking via encompassing abundant classes and videos. VastTrack possesses several attractive properties: (1) Vast Object Category. In particular, it covers target objects from 2,115 classes, largely surpassing object categories of existing popular benchmarks (e.g., GOT-10k… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: Tech. report

  30. arXiv:2403.00228  [pdf, other

    cs.RO cs.CV

    DISORF: A Distributed Online NeRF Training and Rendering Framework for Mobile Robots

    Authors: Chunlin Li, Ruofan Liang, Hanrui Fan, Zhengen Zhang, Sankeerth Durvasula, Nandita Vijaykumar

    Abstract: We present a framework, DISORF, to enable online 3D reconstruction and visualization of scenes captured by resource-constrained mobile robots and edge devices. To address the limited compute capabilities of edge devices and potentially limited network availability, we design a framework that efficiently distributes computation between the edge device and remote server. We leverage on-device SLAM s… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

  31. arXiv:2402.17797  [pdf, other

    eess.IV cs.CV

    Neural Radiance Fields in Medical Imaging: Challenges and Next Steps

    Authors: Xin Wang, Shu Hu, Heng Fan, Hongtu Zhu, Xin Li

    Abstract: Neural Radiance Fields (NeRF), as a pioneering technique in computer vision, offer great potential to revolutionize medical imaging by synthesizing three-dimensional representations from the projected two-dimensional image data. However, they face unique challenges when applied to medical applications. This paper presents a comprehensive examination of applications of NeRFs in medical imaging, hig… ▽ More

    Submitted 21 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  32. arXiv:2402.09649  [pdf, other

    cs.CE cs.AI q-bio.BM

    ProtChatGPT: Towards Understanding Proteins with Large Language Models

    Authors: Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang

    Abstract: Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have made significant strides in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. In this work, we introduce ProtChatGP… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  33. arXiv:2402.06580  [pdf, other

    cs.LG

    SAE: Single Architecture Ensemble Neural Networks

    Authors: Martin Ferianc, Hongxiang Fan, Miguel Rodrigues

    Abstract: Ensembles of separate neural networks (NNs) have shown superior accuracy and confidence calibration over single NN across tasks. Recent methods compress ensembles within a single network via early exits or multi-input multi-output frameworks. However, the landscape of these methods is fragmented thus far, making it difficult to choose the right approach for a given task. Furthermore, the algorithm… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

    Comments: 32 pages

  34. arXiv:2402.06149  [pdf, other

    cs.CV

    HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

    Authors: Zhenglin Zhou, Fan Ma, Hehe Fan, Yi Yang

    Abstract: Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising outcomes obtained through 2D diffusion priors in recent works, current methods face challenges in achieving high-quality and animated avatars effectively. In this paper, we present $\textbf{HeadStudio}$, a novel framework that utilizes 3D Gaussian splatting to generate realistic and… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: 9 pages, 8 figures

  35. arXiv:2402.03719  [pdf, other

    cs.CL cs.AI

    Empowering Language Models with Active Inquiry for Deeper Understanding

    Authors: **g-Cheng Pang, Heng-Bo Fan, Pengyuan Wang, Jia-Hao Xiao, Nan Tang, Si-Hang Yang, Chengxing Jia, Sheng-Jun Huang, Yang Yu

    Abstract: The rise of large language models (LLMs) has revolutionized the way that we interact with artificial intelligence systems through natural language. However, LLMs often misinterpret user queries because of their uncertain intention, leading to less helpful responses. In natural human interactions, clarification is sought through targeted questioning to uncover obscure information. Thus, in this pap… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  36. arXiv:2401.15987  [pdf, other

    cs.CV

    Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling

    Authors: Yuze Hao, Jianrong Zhang, Tao Zhuo, Fuan Wen, Hehe Fan

    Abstract: Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted to AAAI 2024

  37. arXiv:2401.15071  [pdf, other

    cs.CV

    From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

    Authors: Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, **g Shao, **gyi Deng, **lan Fu, Kexin Huang, Kunchang Li, Lijun Li, Limin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He , et al. (11 additional authors not shown)

    Abstract: Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance unde… ▽ More

    Submitted 29 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

  38. arXiv:2401.11805  [pdf, other

    cs.IT

    Simultaneous Blind Demixing and Super-resolution via Vectorized Hankel Lift

    Authors: Haifeng Wang, **chi Chen, Hulei Fan, Yuxiang Zhao, Li Yu

    Abstract: In this work, we investigate the problem of simultaneous blind demixing and super-resolution. Leveraging the subspace assumption regarding unknown point spread functions, this problem can be reformulated as a low-rank matrix demixing problem. We propose a convex recovery approach that utilizes the low-rank structure of each vectorized Hankel matrix associated with the target matrix. Our analysis r… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  39. arXiv:2401.11715  [pdf, other

    cs.RO eess.SY

    Integrating 3D Slicer with a Dynamic Simulator for Situational Aware Robotic Interventions

    Authors: Manish Sahu, Hisashi Ishida, Laura Connolly, Hongyi Fan, Anton Deguet, Peter Kazanzides, Francis X. Creighton, Russell H. Taylor, Adnan Munawar

    Abstract: Image-guided robotic interventions represent a transformative frontier in surgery, blending advanced imaging and robotics for improved precision and outcomes. This paper addresses the critical need for integrating open-source platforms to enhance situational awareness in image-guided robotic research. We present an open-source toolset that seamlessly combines a physics-based constraint formulation… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: *These authors contributed equally

  40. arXiv:2401.01578  [pdf, other

    cs.CV

    Context-Guided Spatio-Temporal Video Grounding

    Authors: Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang

    Abstract: Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements, current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text, leading to degradation. Addressing this, we propose a novel framework, context-guided STVG (CG-STVG… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  41. arXiv:2312.16023  [pdf, other

    cs.CL cs.MM

    DocMSU: A Comprehensive Benchmark for Document-level Multimodal Sarcasm Understanding

    Authors: Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, Xudong Jiang

    Abstract: Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly fo… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

  42. arXiv:2312.06581  [pdf, other

    cs.LG cs.AI math.RT

    Grokking Group Multiplication with Cosets

    Authors: Dashiell Stander, Qinan Yu, Honglu Fan, Stella Biderman

    Abstract: The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer ful… ▽ More

    Submitted 17 June, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

  43. arXiv:2312.06049  [pdf, other

    cs.CV

    SSPNet: Scale and Spatial Priors Guided Generalizable and Interpretable Pedestrian Attribute Recognition

    Authors: Jifeng Shen, Teng Guo, Xin Zuo, Heng Fan, Wankou Yang

    Abstract: Global feature based Pedestrian Attribute Recognition (PAR) models are often poorly localized when using Grad-CAM for attribute response analysis, which has a significant impact on the interpretability, generalizability and performance. Previous researches have attempted to improve generalization and interpretation through meticulous model design, yet they often have neglected or underutilized eff… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: 39 pages, 11 figures, Accepted by Pattern Recognition

  44. arXiv:2312.05104   

    cs.RO

    An Autonomous Driving Model Integrated with BEV-V2X Perception, Fusion Prediction of Motion and Occupancy, and Driving Planning, in Complex Traffic Intersections

    Authors: Fukang Li, Wenlin Ou, Kunpeng Gao, Yuwen Pang, Yifei Li, Henry Fan

    Abstract: The comprehensiveness of vehicle-to-everything (V2X) recognition enriches and holistically shapes the global Birds-Eye-View (BEV) perception, incorporating rich semantics and integrating driving scene information, thereby serving features of vehicle state prediction, decision-making and driving planning. Utilizing V2X message sets to form BEV map proves to be an effective perception method for con… ▽ More

    Submitted 22 April, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

    Comments: The content of the paper has not received unanimous consent from all the members and requires further evaluation prior to submission

  45. arXiv:2312.04822  [pdf, other

    cs.CV

    SiCP: Simultaneous Individual and Cooperative Perception for 3D Object Detection in Connected and Automated Vehicles

    Authors: Deyuan Qu, Qi Chen, Tianyu Bai, Andy Qin, Hongsheng Lu, Heng Fan, Song Fu, Qing Yang

    Abstract: Cooperative perception for connected and automated vehicles is traditionally achieved through the fusion of feature maps from two or more vehicles. However, the absence of feature maps shared from other vehicles can lead to a significant decline in object detection performance for cooperative perception models compared to standalone 3D detection models. This drawback impedes the adoption of cooper… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  46. arXiv:2312.04813  [pdf, other

    cs.CV

    DARNet: Bridging Domain Gaps in Cross-Domain Few-Shot Segmentation with Dynamic Adaptation

    Authors: Haoran Fan, Qi Fan, Maurice Pagnucco, Yang Song

    Abstract: Few-shot segmentation (FSS) aims to segment novel classes in a query image by using only a small number of supporting images from base classes. However, in cross-domain few-shot segmentation (CD-FSS), leveraging features from label-rich domains for resource-constrained domains poses challenges due to domain discrepancies. This work presents a Dynamically Adaptive Refine (DARNet) method that aims t… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  47. arXiv:2312.03327  [pdf, other

    cs.CV

    Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation

    Authors: Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv

    Abstract: Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: 18 pages; 7 figures

  48. arXiv:2312.01915  [pdf, other

    cs.CV

    A Reliable Representation with Bidirectional Transition Model for Visual Reinforcement Learning Generalization

    Authors: Xiaobo Hu, Youfang Lin, Yue Liu, **wen Wang, Shuo Wang, Hehe Fan, Kai Lv

    Abstract: Visual reinforcement learning has proven effective in solving control tasks with high-dimensional observations. However, extracting reliable and generalizable representations from vision-based observations remains a central challenge. Inspired by the human thought process, when the representation extracted from the observation can predict the future and trace history, the representation is reliabl… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  49. arXiv:2312.00844  [pdf, other

    cs.CV cs.AI

    Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion

    Authors: Huadong Li, Minhao **g, Jiajun Liang, Haoqiang Fan, Renhe Ji

    Abstract: It is widely believed that the dense supervision is better than the sparse supervision in the field of depth completion, but the underlying reasons for this are rarely discussed. In this paper, we find that the challenge of using sparse supervision for training Radar-Camera depth prediction models is the Projection Transformation Collapse (PTC). The PTC implies that sparse supervision leads the mo… ▽ More

    Submitted 8 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

  50. arXiv:2312.00360  [pdf, other

    cs.CV

    Efficient Multimodal Semantic Segmentation via Dual-Prompt Learning

    Authors: Shaohua Dong, Yunhe Feng, Qing Yang, Yan Huang, Dongfang Liu, Heng Fan

    Abstract: Multimodal (e.g., RGB-Depth/RGB-Thermal) fusion has shown great potential for improving semantic segmentation in complex scenes (e.g., indoor/low-light conditions). Existing approaches often fully fine-tune a dual-branch encoder-decoder framework with a complicated feature fusion strategy for achieving multimodal semantic segmentation, which is training-costly due to the massive parameter updates… ▽ More

    Submitted 3 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: 11 pages, 4 figures, 9 tables