Skip to main content

Showing 1–50 of 320 results for author: Yuille, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20092  [pdf, other

    cs.CV

    LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

    Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille

    Abstract: While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Code is available at https://github.com/Beckschen/LLaVolta

  2. arXiv:2406.09613  [pdf, other

    cs.CV

    ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

    Authors: Wufei Ma, Guanning Zeng, Guofeng Zhang, Qihao Liu, Letian Zhang, Adam Kortylewski, Yaoyao Liu, Alan Yuille

    Abstract: A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categori… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2406.07537  [pdf, other

    cs.CV

    Autoregressive Pretraining with Mamba in Vision

    Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie

    Abstract: The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structur… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  4. arXiv:2406.05565  [pdf, other

    cs.CV

    Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

    Authors: Sucheng Ren, Xiaoke Huang, Xianhang Li, Junfei Xiao, Jieru Mei, Zeyu Wang, Alan Yuille, Yuyin Zhou

    Abstract: This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treati… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  5. arXiv:2406.04322  [pdf, other

    cs.CV

    DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

    Authors: Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille

    Abstract: We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR 2024. Code: https://github.com/qihao067/direct3d Project page: https://direct-3d.github.io/

  6. arXiv:2406.00622  [pdf, other

    cs.CV cs.AI

    Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

    Authors: Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, Alan Yuille

    Abstract: For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions within 3D scenes from video is crucial for effective reasoning. In this work, we introduce a video question answering dataset SuperCLEVR-Physics that focuses on the dynamics properties of objects. We concentrate on physical concepts -- velocity, acceleration, and collisions within 4D scenes, w… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  7. arXiv:2406.00327  [pdf, other

    cs.CV

    Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

    Authors: Yixiong Chen, Zongwei Zhou, Alan Yuille

    Abstract: An increasing number of public datasets have shown a transformative impact on automated medical segmentation. However, these datasets are often with varying label quality, ranging from manual expert annotations to AI-generated pseudo-annotations. There is no systematic, reliable, and automatic quality control (QC). To fill in this bridge, we introduce a regression model, Quality Sentinel, to estim… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 13 pages, 6 figures, 3 tables

  8. arXiv:2405.18356  [pdf, other

    eess.IV cs.CV

    Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

    Authors: Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, Zongwei Zhou

    Abstract: The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted to Medical Image Analysis

  9. arXiv:2405.15160  [pdf, other

    cs.CV

    ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

    Authors: Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie

    Abstract: This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both spatially and temporally, thereby enabling a richer aggregation of contextual information compared to the standard spat… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  10. arXiv:2405.15125  [pdf, other

    cs.CV

    HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting

    Authors: Yuanhao Cai, Zihao Xiao, Yixun Liang, Minghan Qin, Yulun Zhang, Xiaokang Yang, Yaoyao Liu, Alan Yuille

    Abstract: High dynamic range (HDR) novel view synthesis (NVS) aims to create photorealistic images from novel viewpoints using HDR imaging techniques. The rendered HDR images capture a wider range of brightness levels containing more details of the scene than normal low dynamic range (LDR) images. Existing HDR NVS methods are mainly based on NeRF. They suffer from long training time and slow inference speed… ▽ More

    Submitted 27 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: The first 3D Gaussian Splatting-based method for HDR imaging

  11. arXiv:2405.14858  [pdf, other

    cs.CV

    Mamba-R: Vision Mamba ALSO Needs Registers

    Authors: Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

    Abstract: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  12. arXiv:2404.14248  [pdf, other

    cs.CV

    NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results

    Authors: Xiaoning Liu, Zongwei Wu, Ao Li, Florin-Alexandru Vasluianu, Yulun Zhang, Shuhang Gu, Le Zhang, Ce Zhu, Radu Timofte, Zhi **, Hongjun Wu, Chenxi Wang, Haitao Ling, Yuanhao Cai, Hao Bian, Yuxin Zheng, **g Lin, Alan Yuille, Ben Shao, ** Guo, Tianli Liu, Mohao Wu, Yixu Feng, Shuo Hou, Haotian Lin , et al. (87 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlig… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: NTIRE 2024 Challenge Report

  13. arXiv:2404.05626  [pdf, other

    cs.CV

    Learning a Category-level Object Pose Estimator without Pose Annotations

    Authors: Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang

    Abstract: 3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  14. arXiv:2404.02132  [pdf, other

    cs.CV

    ViTamin: Designing Scalable Vision Models in the Vision-Language Era

    Authors: Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

    Abstract: Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice… ▽ More

    Submitted 3 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: CVPR 2024; https://github.com/Beckschen/ViTamin

  15. arXiv:2403.08689  [pdf, other

    eess.IV cs.CV

    Exploiting Structural Consistency of Chest Anatomy for Unsupervised Anomaly Detection in Radiography Images

    Authors: Tiange Xiang, Yixiao Zhang, Yongyi Lu, Alan Yuille, Chaoyi Zhang, Weidong Cai, Zongwei Zhou

    Abstract: Radiography imaging protocols focus on particular body regions, therefore producing images of great similarity and yielding recurrent anatomical structures across patients. Exploiting this structured information could potentially ease the detection of anomalies from radiography images. To this end, we propose a Simple Space-Aware Memory Matrix for In-painting and Detecting anomalies from radiograp… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). arXiv admin note: substantial text overlap with arXiv:2111.13495

  16. arXiv:2403.07277  [pdf, other

    cs.CV cs.AI

    A Bayesian Approach to OOD Robustness in Image Classification

    Authors: Prakhar Kaushik, Adam Kortylewski, Alan Yuille

    Abstract: An important and unsolved problem in computer vision is to ensure that the algorithms are robust to changes in image domains. We address this problem in the scenario where we have access to images from the target domains but no annotations. Motivated by the challenges of the OOD-CV benchmark where we encounter real world Out-of-Domain (OOD) nuisances and occlusion, we introduce a novel Bayesian ap… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  17. arXiv:2403.06459  [pdf, other

    eess.IV cs.CV

    From Pixel to Cancer: Cellular Automata in Computed Tomography

    Authors: Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, Zongwei Zhou

    Abstract: AI for cancer detection encounters the bottleneck of data scarcity, annotation difficulty, and low prevalence of early tumors. Tumor synthesis seeks to create artificial tumors in medical images, which can greatly diversify the data and annotations for AI training. However, current tumor synthesis approaches are not applicable across different organs due to their need for specific expertise and de… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  18. arXiv:2403.04116  [pdf, other

    eess.IV cs.CV

    Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis

    Authors: Yuanhao Cai, Yixun Liang, Jiahao Wang, Angtian Wang, Yulun Zhang, Xiaokang Yang, Zongwei Zhou, Alan Yuille

    Abstract: X-ray is widely applied for transmission imaging due to its stronger penetration than natural light. When rendering novel view X-ray projections, existing methods mainly based on NeRF suffer from long training time and slow inference speed. In this paper, we propose a 3D Gaussian splatting-based framework, namely X-Gaussian, for X-ray novel view synthesis. Firstly, we redesign a radiative Gaussian… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: The first 3D Gaussian Splatting-based method for X-ray 3D reconstruction

  19. arXiv:2402.19470  [pdf, other

    eess.IV cs.CV

    Towards Generalizable Tumor Synthesis

    Authors: Qi Chen, Xiaoxi Chen, Haorui Song, Zhiwei Xiong, Alan Yuille, Chen Wei, Zongwei Zhou

    Abstract: Tumor synthesis enables the creation of artificial tumors in medical images, facilitating the training of AI models for tumor detection and segmentation. However, success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and, furthermore, the resulting AI models being capable of detecting real tumors in images sourced from different domai… ▽ More

    Submitted 28 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR 2024)

  20. arXiv:2402.19423  [pdf, other

    cs.CV cs.AI

    Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?

    Authors: Tiezheng Zhang, Xiaoxi Chen, Chongyu Qu, Alan Yuille, Zongwei Zhou

    Abstract: Interactive segmentation, an integration of AI algorithms and human expertise, premises to improve the accuracy and efficiency of curating large-scale, detailed-annotated datasets in healthcare. Human experts revise the annotations predicted by AI, and in turn, AI improves its predictions by learning from these revised annotations. This interactive process continues to enhance the quality of annot… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: IEEE International Symposium on Biomedical Imaging (ISBI)

  21. arXiv:2402.10896  [pdf, other

    cs.CV

    PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

    Authors: Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang

    Abstract: This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a… ▽ More

    Submitted 31 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Technical report, 15 pages; v2 fix typos, add additional results in appendix

  22. arXiv:2401.10848  [pdf, other

    cs.CV cs.AI

    Source-Free and Image-Only Unsupervised Domain Adaptation for Category Level Object Pose Estimation

    Authors: Prakhar Kaushik, Aayush Mishra, Adam Kortylewski, Alan Yuille

    Abstract: We consider the problem of source-free unsupervised category-level pose estimation from only RGB images to a target domain without any access to source domain data or 3D annotations during adaptation. Collecting and annotating real-world 3D data and corresponding images is laborious, expensive, yet unavoidable process, since even 3D pose domain adaptation methods require 3D data in the target doma… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: 36 pages, 9 figures, 50 tables; ICLR 2024 (Poster)

  23. arXiv:2401.02931  [pdf, other

    cs.CV

    SPFormer: Enhancing Vision Transformer with Superpixel Representation

    Authors: Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie

    Abstract: In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and ap… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

  24. arXiv:2312.17192  [pdf, other

    cs.CV

    HISR: Hybrid Implicit Surface Representation for Photorealistic 3D Human Reconstruction

    Authors: Angtian Wang, Yuanlu Xu, Nikolaos Sarafianos, Robert Maier, Edmond Boyer, Alan Yuille, Tony Tung

    Abstract: Neural reconstruction and rendering strategies have demonstrated state-of-the-art performances due, in part, to their ability to preserve high level shape details. Existing approaches, however, either represent objects as implicit surface functions or neural volumes and still struggle to recover shapes with heterogeneous materials, in particular human skin, hair or clothes. To this aim, we present… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024 main track

  25. arXiv:2312.13764  [pdf, other

    cs.CV cs.CL cs.LG

    A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties

    Authors: Junfei Xiao, Ziqi Zhou, Wenxuan Li, Shiyi Lan, Jieru Mei, Zhiding Yu, Alan Yuille, Yuyin Zhou, Cihang Xie

    Abstract: This paper introduces ProLab, a novel approach using property-level label space for creating strong interpretable segmentation models. Instead of relying solely on category-specific annotations, ProLab uses descriptive properties grounded in common sense knowledge for supervising segmentation models. It is based on two core designs. First, we employ Large Language Models (LLMs) and carefully craft… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Preprint. Code is available at https://github.com/lambert-x/ProLab

  26. arXiv:2312.09481  [pdf, other

    cs.CV cs.CR cs.LG

    Continual Adversarial Defense

    Authors: Qian Wang, Yaoyao Liu, Hefei Ling, Yingwei Li, Qihao Liu, ** Li, Jiazhong Chen, Alan Yuille, Ning Yu

    Abstract: In response to the rapidly evolving nature of adversarial attacks against visual classifiers on a monthly basis, numerous defenses have been proposed to generalize against as many known attacks as possible. However, designing a defense method that generalizes to all types of attacks is not realistic because the environment in which defense systems operate is dynamic and comprises various unique at… ▽ More

    Submitted 13 March, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

  27. arXiv:2312.06685  [pdf, other

    cs.AI

    Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models

    Authors: Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, Yan Wang

    Abstract: While Multi-modal Language Models (MLMs) demonstrate impressive multimodal ability, they still struggle on providing factual and precise responses for tasks like visual question answering (VQA). In this paper, we address this challenge from the perspective of contextual information. We propose Causal Context Generation, Causal-CoG, which is a prompting strategy that engages contextual information… ▽ More

    Submitted 9 December, 2023; originally announced December 2023.

  28. arXiv:2312.02147  [pdf, other

    cs.CV

    Rejuvenating image-GPT as Strong Visual Representation Learners

    Authors: Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie

    Abstract: This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instru… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: Larger models are coming

  29. arXiv:2312.01597  [pdf, other

    cs.CV

    SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

    Authors: Feng Wang, Jieru Mei, Alan Yuille

    Abstract: Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning… ▽ More

    Submitted 2 January, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

  30. arXiv:2312.00785  [pdf, other

    cs.CV

    Sequential Modeling Enables Scalable Learning for Large Vision Models

    Authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros

    Abstract: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

    Comments: Website: https://yutongbai.com/lvm.html

  31. arXiv:2311.18661  [pdf, other

    cs.CV

    Learning Part Segmentation from Synthetic Animals

    Authors: Jiawei Peng, Ju He, Prakhar Kaushik, Zihao Xiao, Jiteng Mu, Alan Yuille

    Abstract: Semantic part segmentation provides an intricate and interpretable understanding of an object, thereby benefiting numerous downstream tasks. However, the need for exhaustive annotations impedes its usage across diverse object types. This paper focuses on learning part segmentation from synthetic animals, leveraging the Skinned Multi-Animal Linear (SMAL) models to scale up existing synthetic data g… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  32. arXiv:2311.18537  [pdf, other

    cs.CV

    A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

    Authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen

    Abstract: Video segmentation requires consistently segmenting and tracking objects over time. Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges, often leading to insufficient GPU memory capacity. Consequently, modern video segmenters either extend an image segmenter without incorporating any… ▽ More

    Submitted 12 June, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: The paper and model names have been updated to better reflect the methodological contributions

  33. arXiv:2311.18266  [pdf, other

    cs.CV

    Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

    Authors: Ruxiao Duan, Yaoyao Liu, Jieneng Chen, Adam Kortylewski, Alan Yuille

    Abstract: Replay-based methods in class-incremental learning (CIL) have attained remarkable success, as replaying the exemplars of old classes can significantly mitigate catastrophic forgetting. Despite their effectiveness, the inherent memory restrictions of CIL result in saving a limited number of exemplars with poor diversity, leading to data imbalance and overfitting issues. In this paper, we introduce… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Code: https://github.com/KerryDRX/ESCORT

  34. arXiv:2311.17072  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

    Authors: Chenglin Yang, Siyuan Qiao, Yuan Cao, Yu Zhang, Tao Zhu, Alan Yuille, Jiahui Yu

    Abstract: Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or add… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  35. arXiv:2311.15551  [pdf, other

    cs.CV cs.AI cs.CR cs.LG eess.IV

    Instruct2Attack: Language-Guided Semantic Adversarial Attacks

    Authors: Jiang Liu, Chen Wei, Yuxiang Guo, Heng Yu, Alan Yuille, Soheil Feizi, Chun Pong Lau, Rama Chellappa

    Abstract: We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing no… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: under submission, code coming soon

  36. arXiv:2311.10959  [pdf, other

    eess.IV cs.CV

    Structure-Aware Sparse-View X-ray 3D Reconstruction

    Authors: Yuanhao Cai, Jiahao Wang, Alan Yuille, Zongwei Zhou, Angtian Wang

    Abstract: X-ray, known for its ability to reveal internal structures of objects, is expected to provide richer information for 3D reconstruction than visible light. Yet, existing neural radiance fields (NeRF) algorithms overlook this important nature of X-ray, leading to their limitations in capturing structural contents of imaged objects. In this paper, we propose a framework, Structure-Aware X-ray Neural… ▽ More

    Submitted 23 March, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: CVPR 2024; The first Transformer-based method for X-ray and CT 3D reconstruction

  37. arXiv:2311.00618  [pdf, other

    cs.CV

    De-Diffusion Makes Text a Strong Cross-Modal Interface

    Authors: Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu

    Abstract: We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is traine… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: Technical report. Project page: https://dediffusion.github.io

  38. arXiv:2310.17914  [pdf, other

    cs.CV cs.CL

    3D-Aware Visual Question Answering about Parts, Poses and Occlusions

    Authors: Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan Yuille

    Abstract: Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS2023

  39. arXiv:2310.16052  [pdf, other

    cs.CV cs.AI

    Synthetic Data as Validation

    Authors: Qixin Hu, Alan Yuille, Zongwei Zhou

    Abstract: This study leverages synthetic data as a validation set to reduce overfitting and ease the selection of the best model in AI development. While synthetic data have been used for augmenting the training set, we find that synthetic data can also significantly diversify the validation set, offering marked advantages in domains like healthcare, where data are typically limited, sensitive, and from out… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  40. Acquiring Weak Annotations for Tumor Localization in Temporal and Volumetric Data

    Authors: Yu-Cheng Chou, Bowen Li, Deng-** Fan, Alan Yuille, Zongwei Zhou

    Abstract: Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization. However, with limited resources, it is challenging to determine the best type of annotations when annotating massive amounts of unlabeled data. To address this issue, we focus on polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans; both application… ▽ More

    Submitted 20 February, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Published in Machine Intelligence Research

    Journal ref: Mach. Intell. Res. (2024)

  41. arXiv:2310.07781  [pdf, other

    cs.CV

    3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers

    Authors: Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew Lungren, Lei Xing, Le Lu, Alan Yuille, Yuyin Zhou

    Abstract: Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitati… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: Code and models are available at https://github.com/Beckschen/3D-TransUNet

  42. arXiv:2310.04412  [pdf, other

    cs.CV

    FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

    Authors: Peiran Xu, Zeyu Wang, Jieru Mei, Liangqiong Qu, Alan Yuille, Cihang Xie, Yuyin Zhou

    Abstract: Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: 9 pages, 6 figures. Equal contribution by P. Xu and Z. Wang

  43. arXiv:2310.02906  [pdf, other

    cs.CV cs.AI

    Boosting Dermatoscopic Lesion Segmentation via Diffusion Models with Visual and Textual Prompts

    Authors: Shiyi Du, Xiaosong Wang, Yongyi Lu, Yuyin Zhou, Shaoting Zhang, Alan Yuille, Kang Li, Zongwei Zhou

    Abstract: Image synthesis approaches, e.g., generative adversarial networks, have been popular as a form of data augmentation in medical image analysis tasks. It is primarily beneficial to overcome the shortage of publicly accessible data and associated quality annotations. However, the current techniques often lack control over the detailed contents in generated images, e.g., the type of disease patterns,… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: 10 pages, 4 figures

  44. arXiv:2310.02718  [pdf, other

    cs.LG cs.CV

    Understanding Pan-Sharpening via Generalized Inverse

    Authors: Shiqi Liu, Yutong Bai, Xinyang Han, Alan Yuille

    Abstract: Pan-sharpening algorithm utilizes panchromatic image and multispectral image to obtain a high spatial and high spectral image. However, the optimizations of the algorithms are designed with different standards. We adopt the simple matrix equation to describe the Pan-sharpening problem. The solution existence condition and the acquirement of spectral and spatial resolution are discussed. A down-sam… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  45. arXiv:2308.16139  [pdf, other

    cs.CV cs.DB cs.LG

    MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision

    Authors: Jianning Li, Zongwei Zhou, Jiancheng Yang, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Chongyu Qu, Tiezheng Zhang, Xiaoxi Chen, Wenxuan Li, Marek Wodzinski, Paul Friedrich, Kangxian Xie, Yuan **, Narmada Ambigapathy, Enrico Nasca, Naida Solak, Gian Marco Melito, Viet Duc Vu, Afaque R. Memon, Christopher Schlachta, Sandrine De Ribaupierre, Rajnikant Patel, Roy Eagleson, Xiaojun Chen , et al. (132 additional authors not shown)

    Abstract: Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of Shape… ▽ More

    Submitted 12 December, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

    Comments: 16 pages

    MSC Class: 68T01

  46. arXiv:2308.11737  [pdf, other

    cs.CV cs.LG

    Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape

    Authors: Jiacong Xu, Yi Zhang, Jiawei Peng, Wufei Ma, Artur Jesslen, Pengliang Ji, Qixin Hu, Jiehua Zhang, Qihao Liu, Jiahao Wang, Wei Ji, Chen Wang, Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Jie Liu, Yushan Xie, Yawen Cui, Alan Yuille, Adam Kortylewski

    Abstract: Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dat… ▽ More

    Submitted 20 January, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: 11 pages, 5 figures, link to the dataset: https://xujiacong.github.io/Animal3D/

  47. arXiv:2308.10123  [pdf, other

    cs.CV cs.AI

    3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation

    Authors: Yi Zhang, Pengliang Ji, Angtian Wang, Jieru Mei, Adam Kortylewski, Alan Yuille

    Abstract: Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

    Comments: ICCV 2023, project page: https://3dnbf.github.io/

  48. arXiv:2308.03008  [pdf, other

    eess.IV cs.CV cs.LG

    Early Detection and Localization of Pancreatic Cancer by Label-Free Tumor Synthesis

    Authors: Bowen Li, Yu-Cheng Chou, Shuwen Sun, Hualin Qiao, Alan Yuille, Zongwei Zhou

    Abstract: Early detection and localization of pancreatic cancer can increase the 5-year survival rate for patients from 8.5% to 20%. Artificial intelligence (AI) can potentially assist radiologists in detecting pancreatic tumors at an early stage. Training AI models require a vast number of annotated examples, but the availability of CT scans obtaining early-stage tumors is constrained. This is because earl… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

    Comments: Big Task Small Data, 1001-AI, MICCAI Workshop, 2023

  49. arXiv:2307.12591  [pdf, other

    cs.CV

    SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

    Authors: Yiqing Wang, Zihan Li, Jieru Mei, Zihao Wei, Li Liu, Chen Wang, Shengtian Sang, Alan Yuille, Cihang Xie, Yuyin Zhou

    Abstract: Recent advancements in large-scale Vision Transformers have made significant strides in improving pre-trained models for medical image segmentation. However, these methods face a notable challenge in acquiring a substantial amount of pre-training data, particularly within the medical field. To address this limitation, we present Masked Multi-view with Swin Transformers (SwinMM), a novel multi-view… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: MICCAI 2023; project page: https://github.com/UCSC-VLAA/SwinMM/

  50. arXiv:2306.08103  [pdf, other

    cs.CV

    Generating Images with 3D Annotations Using Diffusion Models

    Authors: Wufei Ma, Qihao Liu, Jiahao Wang, Angtian Wang, Xiaoding Yuan, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, Alan Yuille

    Abstract: Diffusion models have emerged as a powerful generative method, capable of producing stunning photo-realistic images from natural language descriptions. However, these models lack explicit control over the 3D structure in the generated images. Consequently, this hinders our ability to obtain detailed 3D annotations for the generated images or to craft instances with specific poses and distances. In… ▽ More

    Submitted 3 April, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: ICLR 2024 Spotlight. Code: https://ccvl.jhu.edu/3D-DST/