Skip to main content

Showing 1–22 of 22 results for author: Shukor, M

.
  1. arXiv:2406.08074  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    A Concept-Based Explainability Framework for Large Multimodal Models

    Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord

    Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  2. arXiv:2406.02842  [pdf, other

    cs.CV

    Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features

    Authors: Paul Couairon, Mustafa Shukor, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

    Abstract: Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harn… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2404.15736  [pdf, other

    cs.CV cs.AI

    What Makes Multimodal In-Context Learning Work?

    Authors: Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

    Abstract: Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal mo… ▽ More

    Submitted 25 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: 20 pages, 16 figures. Accepted to CVPR 2024 Workshop on Prompting in Vision. Project page: https://folbaeni.gitlab.io/multimodal-icl

  4. arXiv:2403.20105  [pdf, other

    cs.CV

    FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

    Authors: Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord

    Abstract: Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial represe… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  5. arXiv:2403.13499  [pdf, other

    cs.CV

    Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

    Authors: Théophane Vallaeys, Mustafa Shukor, Matthieu Cord, Jakob Verbeek

    Abstract: The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with ``… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  6. arXiv:2401.02096  [pdf

    q-bio.BM

    Isolation and Characterisation of Polypropylene Microplastic-Utilising Bacterium from the Antarctic Soil

    Authors: Nur Ain Shuhada Ab Razak, Syahir Habib, Mohd Yunus Abd Shukor, Siti Aisyah Alias, Jerzy Smykla, Nur Adeela Yasid

    Abstract: Despite its remoteness from other continents, the Antarctic region cannot escape the aftermath of human activities as it is highly influenced by anthropogenic impacts that occur both in the regional and global context. Contamination by microplastics, mostly caused by the improper disposal of plastic waste, is widely recognised as a serious environmental threat due to its ubiquity. In recent years,… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  7. arXiv:2310.01845  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Zero-Shot Refinement of Buildings' Segmentation Models using SAM

    Authors: Ali Mayladan, Hasan Nasrallah, Hasan Moughnieh, Mustafa Shukor, Ali J. Ghandour

    Abstract: Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise building instance segmentation is vital for applications like urban planning. While Convolutional Neural Networks (CNNs) perform well, their generalization can b… ▽ More

    Submitted 11 February, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

  8. arXiv:2310.01837  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation

    Authors: Abdul Karim Gizzini, Mustafa Shukor, Ali J. Ghandour

    Abstract: Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite imagery lack transparency and explainability and can be merely seen as a black box, which limits their wide-level adoption. Experts need help understanding the comple… ▽ More

    Submitted 28 November, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

  9. arXiv:2310.01825  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Empirical Study of PEFT techniques for Winter Wheat Segmentation

    Authors: Mohamad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor, Ghaleb Faour, Ali J. Ghandour

    Abstract: Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with minimal computational needs. Despite these advances, more research has yet to delve into potential PEFT applications in real-life scenarios, particularly in the cr… ▽ More

    Submitted 27 November, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

  10. arXiv:2310.00647  [pdf, other

    cs.CV cs.MM

    Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

    Authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord

    Abstract: Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not… ▽ More

    Submitted 22 January, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. Project Page: https://evalign-icl.github.io/

  11. arXiv:2307.16184  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

    Authors: Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord

    Abstract: Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac e… ▽ More

    Submitted 22 December, 2023; v1 submitted 30 July, 2023; originally announced July 2023.

    Comments: Accepted at TMLR 2023. 40 pages. Project page: https://unival-model.github.io/

  12. arXiv:2306.04488  [pdf, other

    cs.LG cs.AI cs.CV

    Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

    Authors: Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, Matthieu Cord

    Abstract: Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate th… ▽ More

    Submitted 16 October, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

  13. arXiv:2303.11403  [pdf, other

    cs.CV cs.CL cs.LG

    eP-ALM: Efficient Perceptual Augmentation of Language Models

    Authors: Mustafa Shukor, Corentin Dancette, Matthieu Cord

    Abstract: Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal… ▽ More

    Submitted 27 October, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023. Project page: https://mshukor.github.io/eP-ALM.github.io/

  14. arXiv:2212.04267  [pdf, other

    cs.CV cs.LG

    Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

    Authors: Mustafa Shukor, Nicolas Thome, Matthieu Cord

    Abstract: Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based comp… ▽ More

    Submitted 15 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: Code: https://github.com/mshukor/VLPCook

  15. arXiv:2208.13628  [pdf, other

    cs.CV cs.LG

    Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

    Authors: Mustafa Shukor, Guillaume Couairon, Matthieu Cord

    Abstract: Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem reasonable in the long term to move toward sustainable solutions, and de facto excludes academic laboratories with limited resources. In this work, we propose a ne… ▽ More

    Submitted 5 October, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

    Comments: BMVC 2022

  16. arXiv:2207.04324  [pdf, other

    eess.IV cs.CV stat.ML

    Video Coding Using Learned Latent GAN Compression

    Authors: Mustafa Shukor, Bharath Bhushan Damodaran, Xu Yao, Pierre Hellier

    Abstract: We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned. To do so, a diffeomorphic latent representation is learned using a normalizing flows model,… ▽ More

    Submitted 12 July, 2022; v1 submitted 9 July, 2022; originally announced July 2022.

    Comments: Accepted at ACM Multimedia 2022

  17. arXiv:2206.14892  [pdf, other

    cs.CV cs.LG

    Semantic Unfolding of StyleGAN Latent Space

    Authors: Mustafa Shukor, Xu Yao, Bharath Bushan Damodaran, Pierre Hellier

    Abstract: Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled nature of the latent space. In this paper, we identify that the facial attribute disentanglement is not optimal, thus facial editing relying on linear attribute separ… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted at ICIP22

  18. arXiv:2204.09730  [pdf, other

    cs.CV

    Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

    Authors: Mustafa Shukor, Guillaume Couairon, Asya Grechka, Matthieu Cord

    Abstract: Cross-modal image-recipe retrieval has gained significant attention in recent years. Most work focuses on improving cross-modal embeddings using unimodal encoders, that allow for efficient retrieval in large-scale databases, leaving aside cross-attention between modalities which is more computationally expensive. We propose a new retrieval framework, T-Food (Transformer Decoders with MultiModal Re… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: Accepted at CVPR 2022, MULA Workshop. Code is available at https://github.com/mshukor/TFood

  19. arXiv:2111.14650  [pdf, other

    cs.CV

    Buildings Classification using Very High Resolution Satellite Imagery

    Authors: Mohammad Dimassi, Abed Ellatif Samhat, Mohammad Zaraket, Jamal Haidar, Mustafa Shukor, Ali J. Ghandour

    Abstract: Buildings classification using satellite images is becoming more important for several applications such as damage assessment, resource allocation, and population estimation. We focus, in this work, on buildings damage assessment (BDA) and buildings type classification (BTC) of residential and non-residential buildings. We propose to rely solely on RGB satellite images and follow a 2-stage deep le… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  20. arXiv:2111.06812  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Sci-Net: Scale Invariant Model for Buildings Segmentation from Aerial Imagery

    Authors: Hasan Nasrallah, Mustafa Shukor, Ali J. Ghandour

    Abstract: Buildings' segmentation is a fundamental task in the field of earth observation and aerial imagery analysis. Most existing deep learning-based methods in the literature can be applied to a fixed or narrow-range spatial resolution imagery. In practical scenarios, users deal with a broad spectrum of image resolutions. Thus, a given aerial image often needs to be re-sampled to match the spatial resol… ▽ More

    Submitted 1 February, 2023; v1 submitted 12 November, 2021; originally announced November 2021.

  21. arXiv:2107.04481  [pdf, other

    cs.CV

    Semantic and Geometric Unfolding of StyleGAN Latent Space

    Authors: Mustafa Shukor, Xu Yao, Bharath Bhushan Damodaran, Pierre Hellier

    Abstract: Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to a natural image. This property emerges from the disentangled nature of the latent space. In this paper, we identify two geometric limitations of such latent space: (a) euclidean distances differ from image perceptual distance, and (b) dise… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

    Comments: 16 pages

  22. arXiv:2104.02980  [pdf, other

    cs.CV cs.AI cs.LG

    Synthetic training data generation for deep learning based quality inspection

    Authors: Pierre Gutierrez, Maria Luschkova, Antoine Cordier, Mustafa Shukor, Mona Schappert, Tim Dahmen

    Abstract: Deep learning is now the gold standard in computer vision-based quality inspection systems. In order to detect defects, supervised learning is often utilized, but necessitates a large amount of annotated images, which can be costly: collecting, cleaning, and annotating the data is tedious and limits the speed at which a system can be deployed as everything the system must detect needs to be observ… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: 8 pages, 4 figures, to be published in QCAV 2021 conference, proceedings will by published by SPIE

    ACM Class: I.2.10; I.3.3; I.4.1; I.4.6; I.4.8; I.4.9; I.5