Skip to main content

Showing 1–17 of 17 results for author: Maaz, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09418  [pdf, other

    cs.CV

    VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan

    Abstract: Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequenc… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Technical Report

  2. arXiv:2402.14818  [pdf, other

    cs.CL cs.CV

    PALO: A Polyglot Large Multimodal Model for 5B People

    Authors: Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

    Abstract: In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated tr… ▽ More

    Submitted 5 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Technical Report of PALO

  3. arXiv:2311.13435  [pdf, other

    cs.CV cs.AI

    PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

    Authors: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan

    Abstract: Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video… ▽ More

    Submitted 13 December, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

    Comments: Technical Report

  4. arXiv:2311.03356  [pdf, other

    cs.CV cs.AI

    GLaMM: Pixel Grounding Large Multimodal Model

    Authors: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan

    Abstract: Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dens… ▽ More

    Submitted 1 June, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  5. arXiv:2306.10160  [pdf, other

    cs.LG

    On Orderings of Probability Vectors and Unsupervised Performance Estimation

    Authors: Muhammad Maaz, Rui Qiao, Yiheng Zhou, Renxian Zhang

    Abstract: Unsupervised performance estimation, or evaluating how well models perform on unlabeled data is a difficult task. Recently, a method was proposed by Garg et al. [2022] which performs much better than previous methods. Their method relies on having a score function, satisfying certain properties, to map probability vectors outputted by the classifier to the reals, but it is an open problem which sc… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: IJCAI 2023 Workshop on Generalizing from Limited Resources in the Open World

  6. arXiv:2306.05424  [pdf, other

    cs.CV

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

    Abstract: Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model… ▽ More

    Submitted 9 June, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2024 (Main)

  7. arXiv:2303.15446  [pdf, other

    cs.CV

    SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

    Authors: Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention… ▽ More

    Submitted 25 July, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023

  8. arXiv:2212.04497  [pdf, other

    cs.CV

    UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation

    Authors: Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medi… ▽ More

    Submitted 4 May, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: Accepted at IEEE TMI-2024

  9. arXiv:2212.03640  [pdf, other

    cs.CV cs.AI

    Fine-tuned CLIP Models are Efficient Video Learners

    Authors: Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan

    Abstract: Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts… ▽ More

    Submitted 26 March, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

    Comments: Accepted at CVPR 2023

  10. arXiv:2210.03117  [pdf, other

    cs.CV

    MaPLe: Multi-modal Prompt Learning

    Authors: Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan

    Abstract: Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP… ▽ More

    Submitted 1 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: Accepted at CVPR2023

  11. arXiv:2207.03482  [pdf, other

    cs.CV cs.AI

    Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

    Authors: Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan

    Abstract: Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for t… ▽ More

    Submitted 29 November, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted at NeurIPS 2022

  12. arXiv:2206.10589  [pdf, other

    cs.CV

    EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

    Authors: Muhammad Maaz, Abdelrahman Shaker, Hisham Cholakkal, Salman Khan, Syed Waqas Zamir, Rao Muhammad Anwer, Fahad Shahbaz Khan

    Abstract: In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths… ▽ More

    Submitted 22 October, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Accepted at ECCVW 2022 (Oral, CADL: Computational Aspects of Deep Learning)

    Report number: 197

  13. arXiv:2111.11430  [pdf, other

    cs.CV

    Class-agnostic Object Detection with Multi-modal Transformer

    Authors: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

    Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable sem… ▽ More

    Submitted 19 July, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

    Comments: Accepted at ECCV 2022

  14. arXiv:2105.08788  [pdf, other

    cs.CV

    Self-Supervised Learning for Fine-Grained Visual Categorization

    Authors: Muhammad Maaz, Hanoona Abdul Rasheed, Dhanalaxmi Gaddam

    Abstract: Recent research in self-supervised learning (SSL) has shown its capability in learning useful semantic representations from images for classification tasks. Through our work, we study the usefulness of SSL for Fine-Grained Visual Categorization (FGVC). FGVC aims to distinguish objects of visually similar sub categories within a general category. The small inter-class, but large intra-class variati… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

    Comments: 10 pages, 6 figures

  15. arXiv:2011.06046  [pdf, ps, other

    cs.GT econ.TH math.CO

    Saturating stable matchings

    Authors: Muhammad Maaz

    Abstract: I relate bipartite graph matchings to stable matchings. I prove a necessary and sufficient condition for the existence of a saturating stable matching, where every agent on one side is matched, for all possible preferences. I extend my analysis to perfect stable matchings, where every agent on both sides is matched.

    Submitted 28 March, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

    Comments: 10 pages, 2 figures. Version 2: removed simulation and discussion, added section 2.1 "equivalent statements", shortened proofs

    Journal ref: Operations Research Letters 2021;49(4):597-601

  16. arXiv:2007.04639  [pdf, other

    cs.CV eess.IV

    Attention Neural Network for Trash Detection on Water Channels

    Authors: Mohbat Tharani, Abdul Wahab Amin, Mohammad Maaz, Murtaza Taj

    Abstract: Rivers and canals flowing through cities are often used illegally for dum** the trash. This contaminates freshwater channels as well as causes blockage in sewerage resulting in urban flooding. When this contaminated water reaches agricultural fields, it results in degradation of soil and poses critical environmental as well as economic threats. The dumped trash is often found floating on the wat… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

    Comments: Object Detection, Trash Detection, Water Quality

  17. arXiv:1908.08610  [pdf

    cs.LG stat.ML

    Viability of machine learning to reduce workload in systematic review screenings in the health sciences: a working paper

    Authors: Muhammad Maaz

    Abstract: Systematic reviews, which summarize and synthesize all the current research in a specific topic, are a crucial component to academia. They are especially important in the biomedical and health sciences, where they synthesize the state of medical evidence and conclude the best course of action for various diseases, pathologies, and treatments. Due to the immense amount of literature that exists, as… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: 10 pages, 2 figures, 6 tables