Skip to main content

Showing 1–50 of 320 results for author: Torr, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01511  [pdf, other

    cs.AI

    CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

    Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li

    Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the compl… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Cheng**g Wu, Ting Liu, Luoqi Liu, Xinyu Liu, **g Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, **gnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  3. arXiv:2406.14563  [pdf, other

    cs.CL cs.AI cs.LG

    Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

    Authors: Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay

    Abstract: Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popu… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Under review

  4. arXiv:2406.10288  [pdf, other

    cs.CL cs.LG

    Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

    Authors: Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi

    Abstract: Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined ta… ▽ More

    Submitted 1 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2406.10079  [pdf, other

    cs.CV cs.AI

    Localizing Events in Videos with Multimodal Queries

    Authors: Gengyuan Zhang, Mang Ling Ada Fok, Yan Xia, Yansong Tang, Daniel Cremers, Philip Torr, Volker Tresp, **dong Gu

    Abstract: Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current resea… ▽ More

    Submitted 22 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 9 pages; fix some typos

  6. arXiv:2406.05222  [pdf, other

    cs.LG cs.NE

    Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

    Authors: Yibo Yang, Xiaojie Li, Motasem Alfarra, Hasan Hammoud, Adel Bibi, Philip Torr, Bernard Ghanem

    Abstract: Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: ICML 2024

  7. arXiv:2406.03428  [pdf, other

    cs.LG

    HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

    Authors: Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr, João F. Henriques, Jakob N. Foerster

    Abstract: Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

  8. arXiv:2406.03303  [pdf, other

    cs.CV

    Learning Visual Prompts for Guiding the Attention of Vision Transformers

    Authors: Razieh Rezaei, Masoud Jalili Sabet, **dong Gu, Daniel Rueckert, Philip Torr, Ashkan Khakzar

    Abstract: Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowle… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Short version (4-pages) accepted as a spotlight paper at T4V workshop, CVPR 2024

  9. arXiv:2406.01424  [pdf, other

    cs.LG cs.AI cs.CL

    Universal In-Context Approximation By Prompting Fully Recurrent Models

    Authors: Aleksandar Petrov, Tom A. Lamb, Alasdair Paren, Philip H. S. Torr, Adel Bibi

    Abstract: Zero-shot and in-context learning enable solving tasks without model fine-tuning, making them essential for develo** generative model solutions. Therefore, it is crucial to understand whether a pretrained model can be prompted to approximate any function, i.e., whether it is a universal in-context approximator. While it was recently shown that transformer models do possess this property, these r… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  10. arXiv:2405.14832  [pdf, other

    cs.CV

    Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

    Authors: Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, **gxi Xu, Philip Torr, Xun Cao, Yao Yao

    Abstract: Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two prima… ▽ More

    Submitted 1 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  11. arXiv:2405.13922  [pdf, other

    cs.LG stat.ML

    Towards Certification of Uncertainty Calibration under Adversarial Attacks

    Authors: Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz, Philip H. S. Torr, Adel Bibi

    Abstract: Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) c… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 11 pages main paper, appendix included

  12. arXiv:2405.10255  [pdf, other

    cs.CV cs.RO

    When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

    Authors: Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, **dong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

    Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context lear… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  13. arXiv:2405.08597  [pdf, other

    cs.LG

    Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, Fazel Keshtkar, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster

    Abstract: Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This reg… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Extension of arXiv:2404.17047

  14. arXiv:2405.03735  [pdf, other

    cs.LG cs.AI cs.MA

    Select to Perfect: Imitating desired behavior from large multi-agent data

    Authors: Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, Joao Henriques

    Abstract: AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might rela… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: ICLR 2024

  15. arXiv:2404.17047  [pdf, other

    cs.LG

    Near to Mid-term Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster

    Abstract: In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation i… ▽ More

    Submitted 24 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted to ICML'24 as a position paper

  16. arXiv:2404.16557  [pdf, other

    cs.CV cs.AI

    Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

    Authors: Kuofeng Gao, **dong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li

    Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2401.11170

  17. arXiv:2404.15518  [pdf, other

    cs.LG cs.AI

    An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

    Authors: Yangchen Pan, Junfeng Wen, Chenjun Xiao, Philip Torr

    Abstract: In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluati… ▽ More

    Submitted 9 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

  18. arXiv:2404.12766  [pdf, other

    cs.LG cs.CV

    Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation

    Authors: Wenxuan Zhang, Youssef Mohamed, Bernard Ghanem, Philip H. S. Torr, Adel Bibi, Mohamed Elhoseiny

    Abstract: We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and ins… ▽ More

    Submitted 8 June, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  19. arXiv:2404.09447  [pdf, other

    cs.CV cs.LG

    kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

    Authors: Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

    Abstract: Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with z… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 10 pages, 3 figures

  20. arXiv:2404.08031  [pdf, other

    cs.CV cs.AI cs.LG

    Latent Guard: a Safety Framework for Text-to-image Generation

    Authors: Runtao Liu, Ashkan Khakzar, **dong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

    Abstract: With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: under review

  21. arXiv:2404.04946  [pdf, other

    cs.CV

    AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment

    Authors: Yuanfeng Xu, Yuhao Chen, Zhongzhan Huang, Zijian He, Guangrun Wang, Philip Torr, Liang Lin

    Abstract: Recent video editing advancements rely on accurate pose sequences to animate subjects. However, these efforts are not suitable for cross-species animation due to pose misalignment between species (for example, the poses of a cat differs greatly from that of a pig due to differences in body structure). In this paper, we present AnimateZoo, a zero-shot diffusion-based video generator to address this… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: Technical report,15 pages

  22. arXiv:2404.04125  [pdf, other

    cs.CV cs.CL cs.LG

    No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

    Authors: Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge

    Abstract: Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream conce… ▽ More

    Submitted 8 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Extended version of the short paper accepted at DPFM, ICLR'24

  23. arXiv:2404.03411  [pdf, ps, other

    cs.LG cs.CL cs.CR

    Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

    Authors: Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, **dong Gu

    Abstract: Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproductio… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: technical report

  24. arXiv:2404.02697  [pdf, other

    cs.CV

    Model-agnostic Origin Attribution of Generated Images with Few-shot Examples

    Authors: Fengyuan Liu, Haochen Luo, Yiming Li, Philip Torr, **dong Gu

    Abstract: Recent progress in visual generative models enables the generation of high-quality images. To prevent the misuse of generated images, it is important to identify the origin model that generates them. In this work, we study the origin attribution of generated images in a practical setting where only a few images generated by a source model are available and the source model cannot be accessed. The… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  25. arXiv:2403.17237  [pdf, other

    cs.CV cs.AI cs.GR

    DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

    Authors: Yuanze Lin, Ronald Clark, Philip Torr

    Abstract: We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that wo… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Project webpage: https://yuanze-lin.me/DreamPolisher_page/

  26. arXiv:2403.14442  [pdf, other

    cs.CV

    RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

    Authors: Yufan Chen, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Rui** Liu, Philip Torr, Rainer Stiefelhagen

    Abstract: Before develo** a Document Layout Analysis (DLA) model in real-world applications, conducting comprehensive robustness testing is essential. However, the robustness of DLA models remains underexplored in the literature. To address this, we are the first to introduce a robustness benchmark for DLA models, which includes 450K document images of three datasets. To cover realistic corruptions, we pr… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024. Project page: https://yufanchen96.github.io/projects/RoDLA

  27. arXiv:2403.13808  [pdf, other

    cs.CV cs.AI cs.LG

    On Pretraining Data Diversity for Self-Supervised Learning

    Authors: Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

    Abstract: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even wit… ▽ More

    Submitted 5 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Under review

  28. arXiv:2403.12693  [pdf, other

    cs.CV

    As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?

    Authors: Anjun Hu, **dong Gu, Francesco Pinto, Konstantinos Kamnitsas, Philip Torr

    Abstract: Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerab… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  29. arXiv:2403.12488  [pdf, other

    cs.CV cs.AI

    DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

    Authors: Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Jian Wu, Philip Torr

    Abstract: We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM… ▽ More

    Submitted 7 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

  30. arXiv:2403.12034  [pdf, other

    cs.CV cs.GR cs.LG

    VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

    Authors: Junlin Han, Filippos Kokkinos, Philip Torr

    Abstract: This paper presents a novel paradigm for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in develo** foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://junlinhan.github.io/projects/vfusion3d.html

  31. arXiv:2403.11062  [pdf, other

    cs.LG math.OC

    A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

    Authors: Yudong Luo, Yangchen Pan, Han Wang, Philip Torr, Pascal Poupart

    Abstract: Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return di… ▽ More

    Submitted 28 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

    Comments: RLC 2024

  32. arXiv:2403.09766  [pdf, other

    cs.CV

    An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models

    Authors: Haochen Luo, **dong Gu, Fengyuan Liu, Philip Torr

    Abstract: Different from traditional task-specific vision models, recent large VLMs can readily adapt to different vision tasks by simply using different textual instructions, i.e., prompts. However, a well-known concern about traditional task-specific vision models is that they can be misled by imperceptible adversarial perturbations. Furthermore, the concern is exacerbated by the phenomenon that the same… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted to ICLR 2024

  33. arXiv:2403.08733  [pdf, other

    cs.CV

    GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

    Authors: **g Wu, Jia-Wang Bian, Xinghui Li, Guangrun Wang, Ian Reid, Philip Torr, Victor Adrian Prisacariu

    Abstract: We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editin… ▽ More

    Submitted 25 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: Our Project Website: https://gaussctrl.active.vision/

  34. arXiv:2403.04640  [pdf, other

    cs.CV

    CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

    Authors: Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao

    Abstract: This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  35. arXiv:2403.01325  [pdf, other

    cs.CV

    NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning

    Authors: Linsheng Chen, Guangrun Wang, Liuchun Yuan, Keze Wang, Ken Deng, Philip H. S. Torr

    Abstract: Neural Radiance Fields (NeRF) have garnered remarkable success in novel view synthesis. Nonetheless, the task of generating high-quality images for novel views persists as a critical challenge. While the existing efforts have exhibited commendable progress, capturing intricate details, enhancing textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics warrant further focused atte… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: AAAI 2024

  36. arXiv:2402.19472  [pdf, other

    cs.LG cs.CV

    Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

    Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

    Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing (for no… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  37. arXiv:2402.16392  [pdf, other

    cs.CV

    Placing Objects in Context via Inpainting for Out-of-distribution Segmentation

    Authors: Pau de Jorge, Riccardo Volpi, Puneet K. Dokania, Philip H. S. Torr, Gregory Rogez

    Abstract: When deploying a semantic segmentation model into the real world, it will inevitably be confronted with semantic classes unseen during training. Thus, to safely deploy such systems, it is crucial to accurately evaluate and improve their anomaly segmentation capabilities. However, acquiring and labelling semantic segmentation data is expensive and unanticipated conditions are long-tail and potentia… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  38. arXiv:2402.14899  [pdf, other

    cs.CV cs.AI cs.CR cs.LG

    Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images

    Authors: Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, **dong Gu

    Abstract: Recently, Multimodal LLMs (MLLMs) have shown a great ability to understand images. However, like traditional vision models, they are still vulnerable to adversarial images. Meanwhile, Chain-of-Thought (CoT) reasoning has been widely explored on MLLMs, which not only improves model's performance, but also enhances model's explainability by giving intermediate reasoning steps. Nevertheless, there is… ▽ More

    Submitted 18 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  39. arXiv:2402.14753  [pdf, other

    cs.LG cs.AI math.FA

    Prompting a Pretrained Transformer Can Be a Universal Approximator

    Authors: Aleksandar Petrov, Philip H. S. Torr, Adel Bibi

    Abstract: Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-t… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  40. arXiv:2402.14015  [pdf, other

    cs.LG cs.AI cs.CR cs.CV

    Corrective Machine Unlearning

    Authors: Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, Amartya Sanyal

    Abstract: Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input do… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: 17 pages, 7 figures

  41. arXiv:2402.10186  [pdf, other

    cs.LG physics.chem-ph physics.comp-ph

    Self-consistent Validation for Machine Learning Electronic Structure

    Authors: Gengyuan Hu, Gengchen Wei, Zekun Lou, Philip H. S. Torr, Wanli Ouyang, Han-sen Zhong, Chen Lin

    Abstract: Machine learning has emerged as a significant approach to efficiently tackle electronic structure problems. Despite its potential, there is less guarantee for the model to generalize to unseen data that hinders its application in real-world scenarios. To address this issue, a technique has been proposed to estimate the accuracy of the predictions. This method integrates machine learning with self-… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: 6 pages, 4 figures

  42. arXiv:2402.08823  [pdf, other

    cs.CV cs.LG

    RanDumb: A Simple Approach that Questions the Efficacy of Continual Representation Learning

    Authors: Ameya Prabhu, Shiven Sinha, Ponnurangam Kumaraguru, Philip H. S. Torr, Ozan Sener, Puneet K. Dokania

    Abstract: We propose RanDumb to examine the efficacy of continual representation learning. RanDumb embeds raw pixels using a fixed random transform which approximates an RBF-Kernel, initialized before seeing any data, and learns a simple linear classifier on top. We present a surprising and consistent finding: RanDumb significantly outperforms the continually learned representations using deep networks acro… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: Tech Report

  43. arXiv:2402.08480  [pdf, other

    cs.LG math.DG

    Revealing Decurve Flows for Generalized Graph Propagation

    Authors: Chen Lin, Liheng Ma, Yiyang Chen, Wanli Ouyang, Michael M. Bronstein, Philip H. S. Torr

    Abstract: This study addresses the limitations of the traditional analysis of message-passing, central to graph learning, by defining {\em \textbf{generalized propagation}} with directed and weighted graphs. The significance manifest in two ways. \textbf{Firstly}, we propose {\em Generalized Propagation Neural Networks} (\textbf{GPNNs}), a framework that unifies most propagation-based graph neural networks.… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: 15 pages, 4 figures

  44. arXiv:2402.07510  [pdf, other

    cs.AI cs.CR

    Secret Collusion Among Generative AI Agents

    Authors: Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, Christian Schroeder de Witt

    Abstract: Recent capability increases in large language models (LLMs) open up applications in which teams of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensi… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  45. arXiv:2402.04559  [pdf, other

    cs.AI cs.CL cs.HC

    Can Large Language Model Agents Simulate Human Trust Behaviors?

    Authors: Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip Torr, Bernard Ghanem, Guohao Li

    Abstract: Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust be… ▽ More

    Submitted 10 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: The first two authors contributed equally. Project website: https://www.camel-ai.org/research/agent-trust

  46. arXiv:2402.01832  [pdf, other

    cs.CV cs.AI cs.LG

    SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

    Authors: Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

    Abstract: We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With trainin… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Under review

  47. arXiv:2401.11170  [pdf, other

    cs.CV cs.CR

    Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

    Authors: Kuofeng Gao, Yang Bai, **dong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu

    Abstract: Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this p… ▽ More

    Submitted 22 March, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

    Comments: Accepted by ICLR 2024

  48. arXiv:2312.15241  [pdf, ps, other

    cs.AI cs.IR

    Measuring Value Alignment

    Authors: Fazl Barez, Philip Torr

    Abstract: As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. This paper introduces a novel formalism to quantify the alignment between AI systems and human values, using Markov Decision Processes (MDPs) as the foundational model. We delve into the concept of values as desirable goals tied to actions and no… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: arXiv admin note: text overlap with arXiv:2110.09240 by other authors

    Journal ref: NeurIPS 2023 MP2 Workshop

  49. arXiv:2312.12419  [pdf, other

    cs.CV

    Scene-Conditional 3D Object Stylization and Composition

    Authors: **ghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht

    Abstract: Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene,… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  50. arXiv:2312.07661  [pdf, other

    cs.CV cs.CL cs.LG cs.MM

    CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

    Authors: Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li

    Abstract: Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to… ▽ More

    Submitted 7 May, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: To appear in CVPR 2024. Project page: https://torrvision.com/clip_as_rnn/