Skip to main content

Showing 1–50 of 214 results for author: Ghanem, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01511  [pdf, other

    cs.AI

    CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

    Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li

    Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the compl… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2407.01265  [pdf, other

    cs.CV

    OSL-ActionSpotting: A Unified Library for Action Spotting in Sports Videos

    Authors: Yassine Benzakour, Bruno Cabado, Silvio Giancola, Anthony Cioppa, Bernard Ghanem, Marc Van Droogenbroeck

    Abstract: Action spotting is crucial in sports analytics as it enables the precise identification and categorization of pivotal moments in sports matches, providing insights that are essential for performance analysis and tactical decision-making. The fragmentation of existing methodologies, however, impedes the progression of sports analytics, necessitating a unified codebase to support the development and… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2406.14563  [pdf, other

    cs.CL cs.AI cs.LG

    Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

    Authors: Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay

    Abstract: Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popu… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Under review

  4. arXiv:2406.08659  [pdf, other

    cs.CV

    Vivid-ZOO: Multi-View Video Generation with Diffusion Model

    Authors: Bing Li, Cheng Zheng, Wenxuan Zhu, **jie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

    Abstract: While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline tha… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Our project page is at https://hi-zhengcheng.github.io/vividzoo/

  5. arXiv:2406.05223  [pdf, other

    cs.LG cs.AI

    CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

    Authors: Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, Bernard Ghanem

    Abstract: Current parameter-efficient fine-tuning (PEFT) methods build adapters without considering the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter finetuning, and meanwhile the finetuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  6. arXiv:2406.05222  [pdf, other

    cs.LG cs.NE

    Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

    Authors: Yibo Yang, Xiaojie Li, Motasem Alfarra, Hasan Hammoud, Adel Bibi, Philip Torr, Bernard Ghanem

    Abstract: Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: ICML 2024

  7. arXiv:2405.17146  [pdf, other

    cs.CV

    Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration

    Authors: Juan C. Pérez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan Leon-Alcazar, Bernard Ghanem

    Abstract: This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs unders… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  8. arXiv:2405.00466  [pdf, other

    cs.CV cs.CR

    Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable

    Authors: Haozhe Liu, Wentian Zhang, Bing Li, Bernard Ghanem, Jürgen Schmidhuber

    Abstract: Foundational generative models should be traceable to protect their owners and facilitate safety regulation. To achieve this, traditional approaches embed identifiers based on supervisory trigger-response signals, which are commonly known as backdoor watermarks. They are prone to failure when the model is fine-tuned with nontrigger data. Our experiments show that this vulnerability is due to energ… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  9. arXiv:2404.17930  [pdf, other

    cs.CV cs.AI eess.IV

    Multi-Stream Cellular Test-Time Adaptation of Real-Time Models Evolving in Dynamic Environments

    Authors: Benoît Gérin, Anaïs Halin, Anthony Cioppa, Maxim Henry, Bernard Ghanem, Benoît Macq, Christophe De Vleeschouwer, Marc Van Droogenbroeck

    Abstract: In the era of the Internet of Things (IoT), objects connect through a dynamic network, empowered by technologies like 5G, enabling real-time data sharing. However, smart objects, notably autonomous vehicles, face challenges in critical local computations due to limited resources. Lightweight AI models offer a solution but struggle with diverse data distributions. To address this limitation, we pro… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  10. arXiv:2404.15161  [pdf, other

    cs.CV

    Combating Missing Modalities in Egocentric Videos at Test Time

    Authors: Merey Ramazanova, Alejandro Pardo, Bernard Ghanem, Motasem Alfarra

    Abstract: Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  11. arXiv:2404.12766  [pdf, other

    cs.LG cs.CV

    Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation

    Authors: Wenxuan Zhang, Youssef Mohamed, Bernard Ghanem, Philip H. S. Torr, Adel Bibi, Mohamed Elhoseiny

    Abstract: We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and ins… ▽ More

    Submitted 8 June, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  12. arXiv:2404.11335  [pdf, other

    cs.CV cs.AI cs.LG

    SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap

    Authors: Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Giancola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Baptiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, Bernard Ghanem, Alexandre Alahi, Marc Van Droogenbroeck, Christophe De Vleeschouwer

    Abstract: Tracking and identifying athletes on the pitch holds a central role in collecting essential insights from the game, such as estimating the total distance covered by players or understanding team tactics. This tracking and identification process is crucial for reconstructing the game state, defined by the athletes' positions and identities on a 2D top-view of the pitch, (i.e. a minimap). However, r… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  13. arXiv:2404.06332  [pdf, other

    cs.CV

    X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

    Authors: Jan Held, Hani Itani, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

    Abstract: The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of large language models to explain decisions, using football refereeing as a testing ground, give… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  14. arXiv:2404.04526  [pdf, other

    cs.CV

    DATENeRF: Depth-Aware Text-based Editing of NeRFs

    Authors: Sara Rojas, Julien Philip, Kai Zhang, Sai Bi, Fujun Luan, Bernard Ghanem, Kalyan Sunkavall

    Abstract: Recent advancements in diffusion models have shown remarkable proficiency in editing 2D images based on text prompts. However, extending these techniques to edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual 2D frames can result in inconsistencies across multiple views. Our crucial insight is that a NeRF scene's geometry can serve as a bridge to integrate these 2D edits… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 14 pages, Conference paper, 3D Scene Editing, Neural Rendering, Diffusion Models

  15. arXiv:2404.03477  [pdf, other

    cs.CV

    Towards Automated Movie Trailer Generation

    Authors: Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem

    Abstract: Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating trailers can be time-consuming and expensive. To streamline this process, we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation tec… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  16. arXiv:2404.00777  [pdf, other

    cs.CV cs.AI cs.CR cs.LG eess.IV

    Privacy-preserving Optics for Enhancing Protection in Face De-identification

    Authors: Jhon Lopez, Carlos Hinojosa, Henry Arguello, Bernard Ghanem

    Abstract: The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes, offices, hospitals, etc. The need to access or process personal information for these purposes raises privacy concerns. While softwar… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project Website and Code coming soon

  17. arXiv:2403.17823  [pdf, other

    cs.CV

    Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

    Authors: Alexandre Eymaël, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

    Abstract: Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: 19 pages, 6 figures, 3 tables, 1 page of supplementary material

    ACM Class: I.2.6; I.2.10

  18. arXiv:2403.13808  [pdf, other

    cs.CV cs.AI cs.LG

    On Pretraining Data Diversity for Self-Supervised Learning

    Authors: Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

    Abstract: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even wit… ▽ More

    Submitted 5 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Under review

  19. arXiv:2403.12003  [pdf, other

    cs.CV

    GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

    Authors: Xiaojie Li, Yibo Yang, Xiangtai Li, Jianlong Wu, Yue Yu, Bernard Ghanem, Min Zhang

    Abstract: Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting i… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/xiaojieli0903/genview

  20. arXiv:2402.10128  [pdf, other

    cs.CV cs.GR cs.LG

    GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering

    Authors: Abdullah Hamdi, Luke Melas-Kyriazi, **jie Mai, Guocheng Qian, Ruoshi Liu, Carl Vondrick, Bernard Ghanem, Andrea Vedaldi

    Abstract: Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However, it may require a large number of Gaussians, which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting), a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer particles to represe… ▽ More

    Submitted 24 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: CVPR 2024 paper. project website https://abdullahamdi.com/ges

  21. arXiv:2402.05235  [pdf, other

    cs.CV

    SPAD : Spatially Aware Multiview Diffusers

    Authors: Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin

    Abstract: We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVD… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: Webpage: https://yashkant.github.io/spad

  22. arXiv:2402.04559  [pdf, other

    cs.AI cs.CL cs.HC

    Can Large Language Model Agents Simulate Human Trust Behaviors?

    Authors: Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip Torr, Bernard Ghanem, Guohao Li

    Abstract: Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust be… ▽ More

    Submitted 10 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: The first two authors contributed equally. Project website: https://www.camel-ai.org/research/agent-trust

  23. arXiv:2402.01832  [pdf, other

    cs.CV cs.AI cs.LG

    SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

    Authors: Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

    Abstract: We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With trainin… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: Under review

  24. arXiv:2402.00867  [pdf, other

    cs.CV

    AToM: Amortized Text-to-Mesh using 2D Diffusion

    Authors: Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov

    Abstract: We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times re… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 19 pages with appendix and references. Webpage: https://snap-research.github.io/AToM/

  25. arXiv:2401.11470  [pdf, other

    cs.CV

    Exploring Missing Modality in Multimodal Egocentric Datasets

    Authors: Merey Ramazanova, Alejandro Pardo, Humam Alwassel, Bernard Ghanem

    Abstract: Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missi… ▽ More

    Submitted 17 April, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

  26. arXiv:2401.10228  [pdf, other

    cs.CV

    RAP-SAM: Towards Real-Time All-Purpose Segment Anything

    Authors: Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, **gbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang

    Abstract: Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most VFMs cannot run in realtime, which makes it difficult to transfer them into several products. On the other hand, current real-time segmentation mainl… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Project Page: https://xushilin1.github.io/rap_sam/

  27. arXiv:2401.04105  [pdf, other

    cs.CV cs.AI

    Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

    Authors: Chen Zhao, Shuming Liu, Karttikeya Mangalam, Guocheng Qian, Fatimah Zohra, Abdulmohsen Alghannam, Jitendra Malik, Bernard Ghanem

    Abstract: Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly memory-intensive for tasks with high-resolution data, e.g., video understanding, small object detection, and point cloud analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, or Dr$^2$Net, a novel fa… ▽ More

    Submitted 30 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Journal ref: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

  28. arXiv:2312.12487  [pdf, other

    cs.LG cs.AI

    Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models

    Authors: Angela Castillo, Jonas Kohler, Juan C. Pérez, Juan Pablo Pérez, Albert Pumarola, Bernard Ghanem, Pablo Arbeláez, Ali Thabet

    Abstract: This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead search for efficient guidance policies. We formulate the discovery of such policies in the differentiable Neural Architecture Search fr… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  29. arXiv:2312.10639  [pdf, other

    cs.CV cs.AI physics.optics

    Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s

    Authors: Maksim Makarenko, Qizhou Wang, Arturo Burguete-Lopez, Silvio Giancola, Bernard Ghanem, Luca Passone, Andrea Fratalocchi

    Abstract: Foundation models, exemplified by GPT technology, are discovering new horizons in artificial intelligence by executing tasks beyond their designers' expectations. While the present generation provides fundamental advances in understanding language and images, the next frontier is video comprehension. Progress in this area must overcome the 1 Tb/s data rate demanded to grasp real-time multidimensio… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

  30. arXiv:2312.02219  [pdf, other

    cs.CV cs.CL

    Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

    Authors: Andrés Villa, Juan Carlos León Alcázar, Alvaro Soto, Bernard Ghanem

    Abstract: Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot visual tasks. These large architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and visual data. Despite thi… ▽ More

    Submitted 12 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

    Comments: 16 pages, 7 figures, 6 tables

  31. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  32. arXiv:2311.17241  [pdf, other

    cs.CV

    End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

    Authors: Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

    Abstract: Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to… ▽ More

    Submitted 20 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted to CVPR 2024. Camera-Ready Version

  33. arXiv:2311.16671  [pdf, other

    cs.CV cs.AI cs.GR

    SplitNeRF: Split Sum Approximation Neural Field for Joint Geometry, Illumination, and Material Estimation

    Authors: Jesus Zarzar, Bernard Ghanem

    Abstract: We present a novel approach for digitizing real-world objects by estimating their geometry, material properties, and environmental lighting from a set of posed images with fixed lighting. Our method incorporates into Neural Radiance Field (NeRF) pipelines the split sum approximation used with image-based lighting for real-time physical-based rendering. We propose modeling the scene's lighting with… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  34. arXiv:2311.11293  [pdf, other

    cs.LG

    From Categories to Classifier: Name-Only Continual Learning by Exploring the Web

    Authors: Ameya Prabhu, Hasan Abed Al Kader Hammoud, Ser-Nam Lim, Bernard Ghanem, Philip H. S. Torr, Adel Bibi

    Abstract: Continual Learning (CL) often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. In this scenario, learners adapt to new category shifts using only category names without the luxury of annot… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

  35. arXiv:2310.08358  [pdf, other

    cs.LG

    Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

    Authors: Peifeng Gao, Qianqian Xu, Yibo Yang, Peisong Wen, Huiyang Shao, Zhiyong Yang, Bernard Ghanem, Qingming Huang

    Abstract: Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT). It is characterized by the collapse of features and classifier into a symmetrical structure, known as simplex equiangular tight frame (ETF). While there have been extensive studies on optimization characteristics showing the global optimality of neural collapse, little research has been… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 20 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:2304.08914

  36. arXiv:2309.14207  [pdf, other

    cs.CV

    Automatic Animation of Hair Blowing in Still Portrait Photos

    Authors: Wenpeng Xiao, Wentao Liu, Yitong Wang, Bernard Ghanem, Bing Li

    Abstract: We propose a novel approach to animate human hair in a still portrait photo. Existing work has largely studied the animation of fluid elements such as water and fire. However, hair animation for a real image remains underexplored, which is a challenging problem, due to the high complexity of hair structure and dynamics. Considering the complexity of hair structure, we innovatively treat hair wisp… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted to ICCV 2023

  37. arXiv:2309.06006  [pdf, ps, other

    cs.CV cs.AI

    SoccerNet 2023 Challenges Results

    Authors: Anthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliège, Jan Held, Carlos Hinojosa, Amir M. Mansourian, Pierre Miralles, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdullah Kamal, Adrien Maglo, Albert Clapés, Amr Abdelaziz, Artur Xarles, Astrid Orcesi, Atom Scott, Bin Liu, Byoungkwon Lim , et al. (77 additional authors not shown)

    Abstract: The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, fo… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

  38. arXiv:2309.05490  [pdf, other

    cs.CV cs.AI cs.LG

    Learning Semantic Segmentation with Query Points Supervision on Aerial Images

    Authors: Santiago Rivier, Carlos Hinojosa, Silvio Giancola, Bernard Ghanem

    Abstract: Semantic segmentation is crucial in remote sensing, where high-resolution satellite images are segmented into meaningful regions. Recent advancements in deep learning have significantly improved satellite image segmentation. However, most of these methods are typically trained in fully supervised settings that require high-quality pixel-level annotations, which are expensive and time-consuming to… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: Paper presented at the LXCV workshop at ICCV 2023

  39. arXiv:2308.14583  [pdf, other

    cs.CV

    Learning to Read Analog Gauges from Synthetic Data

    Authors: Juan Leon-Alcazar, Yazeed Alnumay, Cheng Zheng, Hassane Trigui, Sahejad Patel, Bernard Ghanem

    Abstract: Manually reading and logging gauge data is time inefficient, and the effort increases according to the number of gauges available. We present a computer vision pipeline that automates the reading of analog gauges. We propose a two-stage CNN pipeline that identifies the key structural components of an analog gauge and outputs an angular reading. To facilitate the training of our approach, a synthet… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Journal ref: Winter Conference on Applications of Computer Vision 2024

  40. arXiv:2308.11290  [pdf, other

    quant-ph cs.AI cs.LG

    ShadowNet for Data-Centric Quantum System Learning

    Authors: Yuxuan Du, Yibo Yang, Tongliang Liu, Zhouchen Lin, Bernard Ghanem, Dacheng Tao

    Abstract: Understanding the dynamics of large quantum systems is hindered by the curse of dimensionality. Statistical learning offers new possibilities in this regime by neural-network protocols and classical shadows, while both methods have limitations: the former is plagued by the predictive uncertainty and the latter lacks the generalization ability. Here we propose a data-centric learning paradigm combi… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

  41. arXiv:2308.07795  [pdf, other

    cs.CV cs.AI

    Learning to Identify Critical States for Reinforcement Learning from Videos

    Authors: Haozhe Liu, Mingchen Zhuge, Bing Li, Yuhui Wang, Francesco Faccio, Bernard Ghanem, Jürgen Schmidhuber

    Abstract: Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first lear… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: This paper was accepted to ICCV23

  42. arXiv:2308.05721  [pdf, other

    cs.CV

    Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

    Authors: Yangyang Xu, Yibo Yang, Bernard Ghanem, Lefei Zhang, Du Bo, Dacheng Tao

    Abstract: CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may o… ▽ More

    Submitted 21 September, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

    Comments: submitted to IJCV; an extension to our previous AAAI 2023 paper arXiv:2301.03461

  43. arXiv:2308.01746  [pdf, other

    cs.LG cs.CV

    Neural Collapse Terminus: A Unified Solution for Class Incremental Learning and Its Variants

    Authors: Yibo Yang, Haobo Yuan, Xiangtai Li, Jianlong Wu, Lefei Zhang, Zhouchen Lin, Philip Torr, Dacheng Tao, Bernard Ghanem

    Abstract: How to enable learnability for new classes while kee** the capability well on old classes has been a crucial challenge for class incremental learning. Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exace… ▽ More

    Submitted 3 August, 2023; originally announced August 2023.

    Comments: An extension of our ICLR 2023 paper https://openreview.net/pdf?id=y5W8tpojhtJ. arXiv admin note: text overlap with arXiv:2302.03004

  44. arXiv:2307.05646  [pdf, other

    cs.CL

    Better Handling Coreference Resolution in Aspect Level Sentiment Classification by Fine-Tuning Language Models

    Authors: Dhruv Mullick, Bilal Ghanem, Alona Fyshe

    Abstract: Customer feedback is invaluable to companies as they refine their products. Monitoring customer feedback can be automated with Aspect Level Sentiment Classification (ALSC) which allows us to analyse specific aspects of the products in reviews. Large Language Models (LLMs) are the heart of many state-of-the-art ALSC solutions, but they perform poorly in some scenarios requiring Coreference Resoluti… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

    Comments: Work done up till December 2022

  45. arXiv:2306.17843  [pdf, other

    cs.CV

    Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

    Authors: Guocheng Qian, **jie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem

    Abstract: We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing… ▽ More

    Submitted 23 July, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: webpage: https://guochengqian.github.io/project/magic123/

  46. arXiv:2306.15880  [pdf, other

    cs.CV cs.AI

    Towards Open Vocabulary Learning: A Survey

    Authors: Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, Dacheng Tao

    Abstract: In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progr… ▽ More

    Submitted 1 February, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted by IEEE T-PAMI. Project page: https://github.com/jianzongwu/Awesome-Open-Vocabulary

  47. arXiv:2306.08904  [pdf, other

    cs.CV

    Enhancing Neural Rendering Methods with Image Augmentations

    Authors: Juan C. Pérez, Sara Rojas, Jesus Zarzar, Bernard Ghanem

    Abstract: Faithfully reconstructing 3D geometry and generating novel views of scenes are critical tasks in 3D computer vision. Despite the widespread use of image augmentations across computer vision applications, their potential remains underexplored when learning neural rendering methods (NRMs) for 3D scenes. This paper presents a comprehensive analysis of the use of image augmentations in NRMs, where we… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

  48. arXiv:2306.07716  [pdf, other

    cs.CV

    Dynamically Masked Discriminator for Generative Adversarial Networks

    Authors: Wentian Zhang, Haozhe Liu, Bing Li, **heng Xie, Yawen Huang, Yuexiang Li, Yefeng Zheng, Bernard Ghanem

    Abstract: Training Generative Adversarial Networks (GANs) remains a challenging problem. The discriminator trains the generator by learning the distribution of real/generated data. However, the distribution of generated data changes throughout the training process, which is difficult for the discriminator to learn. In this paper, we propose a novel method for GANs from the viewpoint of online continual lear… ▽ More

    Submitted 4 January, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Updated v2 -- NeurIPS 2023 camera ready version

  49. arXiv:2306.00450  [pdf, other

    cs.CV

    Exploring Open-Vocabulary Semantic Segmentation without Human Labels

    Authors: Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana

    Abstract: Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  50. arXiv:2305.18418  [pdf, other

    cs.CV cs.AI cs.LG

    Just a Glimpse: Rethinking Temporal Information for Video Continual Learning

    Authors: Lama Alssum, Juan Leon Alcazar, Merey Ramazanova, Chen Zhao, Bernard Ghanem

    Abstract: Class-incremental learning is one of the most important settings for the study of Continual Learning, as it closely resembles real-world application scenarios. With constrained memory sizes, catastrophic forgetting arises as the number of classes/tasks increases. Studying continual learning in the video domain poses even more challenges, as video data contains a large number of frames, which place… ▽ More

    Submitted 28 June, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted at CLVision Workshop - CVPR23 (Best Paper Award)