Skip to main content

Showing 1–50 of 151 results for author: Ge, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16855  [pdf, other

    cs.CV

    DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

    Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, **g Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia

    Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advan… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Project page: https://dreambenchplus.github.io/

  2. arXiv:2406.15764  [pdf, other

    cs.CV

    TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAM

    Authors: Wenxue Li, Xinyu Xiong, Peng Xia, Lie Ju, Zongyuan Ge

    Abstract: Recent advances in large foundation models, such as the Segment Anything Model (SAM), have demonstrated considerable promise across various tasks. Despite their progress, these models still encounter challenges in specialized medical image analysis, especially in recognizing subtle inter-class differences in Diabetic Retinopathy (DR) lesion segmentation. In this paper, we propose a novel framework… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  3. IG2: Integrated Gradient on Iterative Gradient Path for Feature Attribution

    Authors: Yue Zhuo, Zhiqiang Ge

    Abstract: Feature attribution explains Artificial Intelligence (AI) at the instance level by providing importance scores of input features' contributions to model prediction. Integrated Gradients (IG) is a prominent path attribution method for deep neural networks, involving the integration of gradients along a path from the explained input (explicand) to a counterfactual instance (baseline). Current IG var… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  4. arXiv:2406.07471  [pdf, other

    cs.CV

    OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

    Authors: Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, Jun Cheng, Chi Liu, Kai**g Zhou, Zongyuan Ge

    Abstract: Surgical scene perception via videos are critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets for surgical workflow analysis, which typically face challenges such as small s… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Version 1

  5. arXiv:2406.06384  [pdf, other

    cs.CV

    Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled Representations

    Authors: Peng Xia, Ming Hu, Feilong Tang, Wenxue Li, Wenhao Zheng, Lie Ju, Peibo Duan, Huaxiu Yao, Zongyuan Ge

    Abstract: Diabetic Retinopathy (DR), induced by diabetes, poses a significant risk of visual impairment. Accurate and effective grading of DR aids in the treatment of this condition. Yet existing models experience notable performance degradation on unseen domains due to domain shifts. Previous methods address this issue by simulating domain style through simple visual transformation and mitigating domain no… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Early Accepted by MICCAI 2024

  6. arXiv:2406.06007  [pdf, other

    cs.LG cs.CL cs.CV cs.CY

    CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

    Authors: Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, Huaxiu Yao

    Abstract: Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehen… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  7. arXiv:2405.14295  [pdf, other

    cs.CV

    Focus Anywhere for Fine-grained Multi-page Document Understanding

    Authors: Chenglong Liu, Haoran Wei, **yue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

    Abstract: Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  8. arXiv:2405.11289  [pdf, other

    eess.IV cs.CV

    Diffusion Model Driven Test-Time Image Adaptation for Robust Skin Lesion Classification

    Authors: Ming Hu, Siyuan Yan, Peng Xia, Feilong Tang, Wenxue Li, Peibo Duan, Lin Zhang, Zongyuan Ge

    Abstract: Deep learning-based diagnostic systems have demonstrated potential in skin disease diagnosis. However, their performance can easily degrade on test domains due to distribution shifts caused by input-level corruptions, such as imaging equipment variability, brightness changes, and image blur. This will reduce the reliability of model deployment in real-world scenarios. Most existing solutions focus… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  9. arXiv:2405.02586  [pdf, other

    cs.CV

    Generalizing CLIP to Unseen Domain via Text-Guided Diverse Novel Feature Synthesis

    Authors: Siyuan Yan, Cheng Luo, Zhen Yu, Zongyuan Ge

    Abstract: Vision-language foundation models like CLIP have shown impressive zero-shot generalization, but finetuning on downstream datasets can cause overfitting and loss of its generalization ability on unseen domains. Although collecting additional data from new domains of interest is possible, this method is often impractical due to the challenges in obtaining annotated data. To address this, we propose… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: 24 pages

  10. arXiv:2404.18202  [pdf, other

    cs.AI cs.MM

    WorldGPT: Empowering LLM as Multimodal World Model

    Authors: Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, Yueting Zhuang

    Abstract: World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). W… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  11. arXiv:2404.14019  [pdf

    cs.CV eess.SP stat.AP

    A Multimodal Feature Distillation with CNN-Transformer Network for Brain Tumor Segmentation with Incomplete Modalities

    Authors: Ming Kang, Fung Fung Ting, Raphaƫl C. -W. Phan, Zongyuan Ge, Chee-Ming Ting

    Abstract: Existing brain tumor segmentation methods usually utilize multiple Magnetic Resonance Imaging (MRI) modalities in brain tumor images for segmentation, which can achieve better segmentation performance. However, in clinical applications, some modalities are missing due to resource constraints, leading to severe degradation in the performance of methods applying complete modality segmentation. In th… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    MSC Class: 68U10 (Primary) 68T10; 68T07; 62P10 (Secondary) ACM Class: I.4.6; I.5.1; J.3

  12. arXiv:2404.10501  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Self-Supervised Visual Preference Alignment

    Authors: Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang

    Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard n… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  13. arXiv:2404.09987  [pdf, other

    cs.CV

    OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

    Authors: **yue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

    Abstract: Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorpo… ▽ More

    Submitted 25 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: 14 pages, 9 figures and 6 tables

  14. arXiv:2404.00947  [pdf, other

    cs.IR

    Towards an In-Depth Comprehension of Case Relevance for Better Legal Retrieval

    Authors: Haitao Li, You Chen, Zhekai Ge, Qingyao Ai, Yiqun Liu, Quan Zhou, Shuai Huo

    Abstract: Legal retrieval techniques play an important role in preserving the fairness and equality of the judicial system. As an annually well-known international competition, COLIEE aims to advance the development of state-of-the-art retrieval models for legal texts. This paper elaborates on the methodology employed by the TQM team in COLIEE2024.Specifically, we explored various lexical matching and seman… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 16 pages

  15. arXiv:2403.13417  [pdf, other

    cs.CV

    Diversified and Personalized Multi-rater Medical Image Segmentation

    Authors: Yicheng Wu, Xiangde Luo, Zhe Xu, Xiaoqing Guo, Lie Ju, Zongyuan Ge, Wenjun Liao, Jianfei Cai

    Abstract: Annotation ambiguity due to inherent data uncertainties such as blurred boundaries in medical scans and different observer expertise and preferences has become a major obstacle for training deep-learning based medical image segmentation models. To address it, the common practice is to gather multiple annotations from different experts, leading to the setting of multi-rater medical image segmentati… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  16. arXiv:2403.09274  [pdf, other

    cs.CV

    EventRPG: Event Data Augmentation with Relevance Propagation Guidance

    Authors: Mingyuan Sun, Donghao Zhang, Zongyuan Ge, Jiaxu Wang, Jia Li, Zheng Fang, Ren**g Xu

    Abstract: Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitt… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted by ICLR 2024

  17. arXiv:2403.07630  [pdf, other

    cs.CV cs.AI

    Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation

    Authors: Feilong Tang, Zhongxing Xu, Zhaojun Qu, Wei Feng, Xingjian Jiang, Zongyuan Ge

    Abstract: Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work, we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory, we propose leveraging prototype… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  18. arXiv:2402.17766  [pdf, other

    cs.CV

    ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

    Authors: Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, He Wang, Li Yi, Kaisheng Ma

    Abstract: This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point clo… ▽ More

    Submitted 6 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: Project page: https://qizekun.github.io/shapellm/

  19. arXiv:2402.02498  [pdf, other

    eess.IV cs.AI cs.CV

    Fully Differentiable Correlation-driven 2D/3D Registration for X-ray to CT Image Fusion

    Authors: Minheng Chen, Zhirun Zhang, Shuheng Gu, Zhangyang Ge, Youyong Kong

    Abstract: Image-based rigid 2D/3D registration is a critical technique for fluoroscopic guided surgical interventions. In recent years, some learning-based fully differentiable methods have produced beneficial outcomes while the process of feature extraction and gradient flow transmission still lack controllability and interpretability. To alleviate these problems, in this work, we propose a novel fully dif… ▽ More

    Submitted 15 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: ISBI 2024

  20. arXiv:2401.12503  [pdf, other

    cs.CV

    Small Language Model Meets with Reinforced Vision Vocabulary

    Authors: Haoran Wei, Lingyu Kong, **yue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang

    Abstract: Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI community. However, the relatively large number of parameters (more than 7B) of popular LVLMs makes it difficult to train and deploy on consumer GPUs, discouraging many researchers with limited resources. Imagine how cool it would be to experience all the features of current LVLMs on an old GTX1080ti (our only game card).… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  21. arXiv:2401.03002  [pdf, other

    eess.IV cs.CV

    Prompt-driven Latent Domain Generalization for Medical Image Classification

    Authors: Siyuan Yan, Chi Liu, Zhen Yu, Lie Ju, Dwarikanath Mahapatra, Brigid Betz-Stablein, Victoria Mar, Monika Janda, Peter Soyer, Zongyuan Ge

    Abstract: Deep learning models for medical image analysis easily suffer from distribution shifts caused by dataset artifacts bias, camera variations, differences in the imaging station, etc., leading to unreliable diagnoses in real-world clinical settings. Domain generalization (DG) methods, which aim to train models on multiple domains to perform well on unseen domains, offer a promising direction to solve… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: 10 pages

  22. arXiv:2312.14481  [pdf, other

    cs.CV cs.AI cs.RO

    SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

    Authors: Wenxi Yue, **g Zhang, Kun Hu, Qiuxia Wu, Zongyuan Ge, Yong Xia, Jiebo Luo, Zhiyong Wang

    Abstract: The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity… ▽ More

    Submitted 22 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Technical Report. The source code will be released at https://github.com/wenxi-yue/SurgicalPart-SAM

  23. arXiv:2312.06109  [pdf, other

    cs.CV

    Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

    Authors: Haoran Wei, Lingyu Kong, **yue Chen, Liang Zhao, Zheng Ge, **rong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

    Abstract: Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and e… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  24. arXiv:2312.01943  [pdf, other

    cs.CV cs.GR

    Instance-guided Cartoon Editing with a Large-scale Dataset

    Authors: Jian Lin, Chengze Li, Xueting Liu, Zhong** Ge

    Abstract: Cartoon editing, appreciated by both professional illustrators and hobbyists, allows extensive creative freedom and the development of original narratives within the cartoon domain. However, the existing literature on cartoon editing is complex and leans heavily on manual operations, owing to the challenge of automatic identification of individual character instances. Therefore, an automated segme… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page: https://cartoonsegmentation.github.io/ 10 pages, 10 figures

    ACM Class: I.4.6; I.3.3; I.3.8

  25. arXiv:2312.00589  [pdf, other

    cs.CV

    Merlin:Empowering Multimodal LLMs with Foresight Minds

    Authors: En Yu, Liang Zhao, Yana Wei, **rong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao

    Abstract: Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To addr… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  26. arXiv:2311.14411  [pdf, other

    cs.RO

    Receding Horizon Optimization with PPUM: An Approach for Autonomous Robot Path Planning in Uncertain Environments

    Authors: Zijian Ge, **g**g Jiang, Matthew Coombes, Liang Sun

    Abstract: The ability to understand spatial-temporal patterns for crowds of people is crucial for achieving long-term autonomy of mobile robots deployed in human environments. However, traditional historical data-driven memory models are inadequate for handling anomalies, resulting in poor reasoning by robot in estimating the crowd spatial distribution. In this article, a Receding Horizon Optimization (RHO)… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  27. arXiv:2311.14064  [pdf, other

    cs.CV

    HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

    Authors: Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge

    Abstract: Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting t… ▽ More

    Submitted 14 March, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

  28. arXiv:2311.05316  [pdf, other

    cs.LG cs.AI

    ABIGX: A Unified Framework for eXplainable Fault Detection and Classification

    Authors: Yue Zhuo, **chuan Qian, Zhihuan Song, Zhiqiang Ge

    Abstract: For explainable fault detection and classification (FDC), this paper proposes a unified framework, ABIGX (Adversarial fault reconstruction-Based Integrated Gradient eXplanation). ABIGX is derived from the essentials of previous successful fault diagnosis methods, contribution plots (CP) and reconstruction-based contribution (RBC). It is the first explanation framework that provides variable contri… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

  29. arXiv:2311.01009  [pdf, other

    cs.CV cs.AI

    Revam** AI Models in Dermatology: Overcoming Critical Challenges for Enhanced Skin Lesion Diagnosis

    Authors: Deval Mehta, Brigid Betz-Stablein, Toan D Nguyen, Yaniv Gal, Adrian Bowling, Martin Haskett, Maithili Sashindranath, Paul Bonnington, Victoria Mar, H Peter Soyer, Zongyuan Ge

    Abstract: The surge in develo** deep learning models for diagnosing skin lesions through image analysis is notable, yet their clinical black faces challenges. Current dermatology AI models have limitations: limited number of possible diagnostic outputs, lack of real-world testing on uncommon skin lesions, inability to detect out-of-distribution images, and over-reliance on dermoscopic images. To address t… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  30. arXiv:2310.13347  [pdf, other

    cs.CV cs.AI

    NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity Understanding

    Authors: Ming Hu, Lin Wang, Siyuan Yan, Don Ma, Qingli Ren, Peng Xia, Wei Feng, Peibo Duan, Lie Ju, Zongyuan Ge

    Abstract: The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hinder… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023 Datasets and Benchmarks Track

  31. arXiv:2309.16451  [pdf, other

    cs.CV

    Towards Novel Class Discovery: A Study in Novel Skin Lesions Clustering

    Authors: Wei Feng, Lie Ju, Lin Wang, Kaimin Song, Zongyuan Ge

    Abstract: Existing deep learning models have achieved promising performance in recognizing skin diseases from dermoscopic images. However, these models can only recognize samples from predefined categories, when they are deployed in the clinic, data from new unknown categories are constantly emerging. Therefore, it is crucial to automatically discover and identify new semantic categories from new data. In t… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: 10 pages, 1 figure,Accepted by miccai 2023

  32. arXiv:2309.11499  [pdf, other

    cs.CV cs.CL cs.LG

    DreamLLM: Synergistic Multimodal Comprehension and Creation

    Authors: Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, **rong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi

    Abstract: This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This a… ▽ More

    Submitted 15 March, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: ICLR 2024 (Spotlight)

  33. arXiv:2309.09689  [pdf, other

    cs.CV cs.AI

    Ugly Ducklings or Swans: A Tiered Quadruplet Network with Patient-Specific Mining for Improved Skin Lesion Classification

    Authors: Nathasha Naranpanawa, H. Peter Soyer, Adam Mothershaw, Gayan K. Kulatilleke, Zongyuan Ge, Brigid Betz-Stablein, Shekhar S. Chandra

    Abstract: An ugly duckling is an obviously different skin lesion from surrounding lesions of an individual, and the ugly duckling sign is a criterion used to aid in the diagnosis of cutaneous melanoma by differentiating between highly suspicious and benign lesions. However, the appearance of pigmented lesions, can change drastically from one patient to another, resulting in difficulties in visual separation… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 12 pages, 6 figures

  34. arXiv:2309.08794  [pdf, other

    cs.AI cs.CV

    Privacy-preserving Early Detection of Epileptic Seizures in Videos

    Authors: Deval Mehta, Shobi Sivathamboo, Hugh Simpson, Patrick Kwan, Terence O`Brien, Zongyuan Ge

    Abstract: In this work, we contribute towards the development of video-based epileptic seizure classification by introducing a novel framework (SETR-PKD), which could achieve privacy-preserved early detection of seizures in videos. Specifically, our framework has two significant components - (1) It is built upon optical flow features extracted from the video of a seizure, which encodes the seizure motion se… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to MICCAI 2023

  35. arXiv:2308.11256  [pdf, other

    cs.GT cs.AI cs.LG

    Efficient Last-iterate Convergence Algorithms in Solving Games

    Authors: Linjian Meng, Zhenxing Ge, Wenbin Li, Bo An, Yang Gao

    Abstract: No-regret algorithms are popular for learning Nash equilibrium (NE) in two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs). Many recent works consider the last-iterate convergence no-regret algorithms. Among them, the two most famous algorithms are Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA has high per-itera… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

  36. arXiv:2308.10601  [pdf, other

    cs.CV cs.CR cs.LG eess.IV

    Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

    Authors: Zhi** Ge, Fanhua Shang, Hongying Liu, Yuanyuan Liu, Liang Wan, Wei Feng, Xiaosen Wang

    Abstract: Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: 10 pages, 2 figures, accepted by the 31st ACM International Conference on Multimedia (MM '23)

  37. arXiv:2308.04666  [pdf, other

    cs.SD eess.AS

    Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

    Authors: Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang

    Abstract: The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and… ▽ More

    Submitted 23 February, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

    Comments: 9 pages, 4 figures

  38. arXiv:2308.04152  [pdf, other

    cs.CV

    Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

    Authors: Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

    Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can recognize. This is achieved by training the VPGs on millions of image-caption pairs, where the VPG-generated tokens of images are fed into a frozen LLM to generate the corresponding captions. However, this image-captioning based tr… ▽ More

    Submitted 25 May, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted by ICLR 2024 (Spotlight)

  39. arXiv:2307.16601  [pdf, other

    cs.CV

    Sampling to Distill: Knowledge Transfer from Open-World Data

    Authors: Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, Lizhe Qi

    Abstract: Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of super… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

  40. arXiv:2307.11307  [pdf, other

    cs.CV

    EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos

    Authors: Ruyi Zha, Xuelian Cheng, Hongdong Li, Mehrtash Harandi, Zongyuan Ge

    Abstract: Reconstructing soft tissues from stereo endoscope videos is an essential prerequisite for many medical applications. Previous methods struggle to produce high-quality geometry and appearance due to their inadequate representations of 3D scenes. To address this issue, we propose a novel neural-field-based method, called EndoSurf, which effectively learns to represent a deforming surface from an RGB… ▽ More

    Submitted 3 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: MICCAI 2023(Oral, Student Travel Award, Top 3%); Ruyi Zha and Xuelian Cheng made equal contributions. Corresponding author: Ruyi Zha ([email protected])

  41. arXiv:2307.09474  [pdf, other

    cs.CL cs.CV

    ChatSpot: Bootstrap** Multimodal LLMs via Precise Referring Instruction Tuning

    Authors: Liang Zhao, En Yu, Zheng Ge, **rong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang

    Abstract: Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: 15 pages, 8 figures

  42. arXiv:2307.09472  [pdf, other

    cs.CV

    GroupLane: End-to-End 3D Lane Detection with Channel-wise Grou**

    Authors: Zhuoling Li, Chunrui Han, Zheng Ge, **rong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang

    Abstract: Efficiency is quite important for 3D lane detection due to practical deployment demand. In this work, we propose a simple, fast, and end-to-end detector that still maintains high detection precision. Specifically, we devise a set of fully convolutional heads based on row-wise classification. In contrast to previous counterparts, ours supports recognizing both vertical and horizontal lanes. Besides… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  43. arXiv:2306.17450  [pdf, other

    cs.CV cs.AI

    GMM: Delving into Gradient Aware and Model Perceive Depth Mining for Monocular 3D Detection

    Authors: Weixin Mao, **rong Yang, Zheng Ge, Lin Song, Hongyu Zhou, Tiezheng Mao, Zeming Li, Osamu Yoshie

    Abstract: Depth perception is a crucial component of monoc-ular 3D detection tasks that typically involve ill-posed problems. In light of the success of sample mining techniques in 2D object detection, we propose a simple yet effective mining strategy for improving depth perception in 3D object detection. Concretely, we introduce a plain metric to evaluate the quality of depth predictions, which chooses the… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

    Comments: 8 pages, 4 figures

  44. arXiv:2306.09590  [pdf, other

    cs.CV

    The 1st-place Solution for CVPR 2023 OpenLane Topology in Autonomous Driving Challenge

    Authors: Dongming Wu, Fan Jia, Jiahao Chang, Zhuoling Li, Jianjian Sun, Chunrui Han, Shuailin Li, Yingfei Liu, Zheng Ge, Tiancai Wang

    Abstract: We present the 1st-place solution of OpenLane Topology in Autonomous Driving Challenge. Considering that topology reasoning is based on centerline detection and traffic element detection, we develop a multi-stage framework for high performance. Specifically, the centerline is detected by the powerful PETRv2 detector and the popular YOLOv8 is employed to detect the traffic elements. Further, we des… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted by CVPR2023 Workshop (https://opendrivelab.com/AD23Challenge.html#openlane_topology)

  45. arXiv:2306.05225  [pdf, other

    cs.CV cs.CR cs.LG

    Boosting Adversarial Transferability by Achieving Flat Local Maxima

    Authors: Zhi** Ge, Hongying Liu, Xiaosen Wang, Fanhua Shang, Yuanyuan Liu

    Abstract: Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the observation that flat local minima are correlated with good g… ▽ More

    Submitted 2 November, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted by the Neural Information Processing Systems (NeurIPS 2023)

  46. arXiv:2305.09933  [pdf, other

    cs.RO

    Impact of ROS 2 Node Composition in Robotic Systems

    Authors: Steve Macenski, Alberto Soragna, Michael Carroll, Zhenpeng Ge

    Abstract: The Robot Operating System 2 (ROS 2) is the second generation of ROS representing a step forward in the robotic framework. Several new types of nodes and executor models are integral to control where, how, and when information is processed in the computational graph. This paper explores and benchmarks one of these new node types -- the Component node -- which allows nodes to be composed manually o… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: IEEE Robotics and Automation Letters, 2023

  47. arXiv:2305.04536  [pdf, other

    cs.CV

    LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

    Authors: Peng Xia, Di Xu, Ming Hu, Lie Ju, Zongyuan Ge

    Abstract: Long-tailed multi-label visual recognition (LTML) task is a highly challenging task due to the label co-occurrence and imbalanced data distribution. In this work, we propose a unified framework for LTML, namely prompt tuning with class-specific embedding loss (LMPT), capturing the semantic feature interactions between categories by combining text and image modality data and improving the performan… ▽ More

    Submitted 18 June, 2024; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted by 3rd Workshop on Advances in Language and Vision Research (ALVR) @ ACL 2024

  48. arXiv:2305.00696  [pdf, other

    cs.CV

    TPMIL: Trainable Prototype Enhanced Multiple Instance Learning for Whole Slide Image Classification

    Authors: Litao Yang, Deval Mehta, Sidong Liu, Dwarikanath Mahapatra, Antonio Di Ieva, Zongyuan Ge

    Abstract: Digital pathology based on whole slide images (WSIs) plays a key role in cancer diagnosis and clinical practice. Due to the high resolution of the WSI and the unavailability of patch-level annotations, WSI classification is usually formulated as a weakly supervised problem, which relies on multiple instance learning (MIL) based on patches of a WSI. In this paper, we aim to learn an optimal patch-l… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted for MIDL 2023

  49. arXiv:2304.07527  [pdf, other

    cs.CV

    Align-DETR: Improving DETR with Simple IoU-aware BCE loss

    Authors: Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang

    Abstract: DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. However, despite the significant progress in improving DETR, this paper identifies a problem of misalignment in the output distribution, which prevents the best-regressed samples from being assigned with high confidence, hindering the model's accuracy… ▽ More

    Submitted 15 April, 2023; originally announced April 2023.

  50. arXiv:2304.04185  [pdf, other

    cs.CV

    BEVStereo++: Accurate Depth Estimation in Multi-view 3D Object Detection via Dynamic Temporal Stereo

    Authors: Yinhao Li, **rong Yang, Jianjian Sun, Han Bao, Zheng Ge, Li Xiao

    Abstract: Bounded by the inherent ambiguity of depth perception, contemporary multi-view 3D object detection methods fall into the performance bottleneck. Intuitively, leveraging temporal multi-view stereo (MVS) technology is the natural knowledge for tackling this ambiguity. However, traditional attempts of MVS has two limitations when applying to 3D object detection scenes: 1) The affinity measurement amo… ▽ More

    Submitted 9 April, 2023; originally announced April 2023.