Skip to main content

Showing 1–50 of 84 results for author: Elhoseiny, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01851  [pdf, other

    cs.CV cs.AI cs.LG eess.AS

    Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

    Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

    Abstract: Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  2. arXiv:2406.19875  [pdf, other

    cs.CV

    InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

    Authors: Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

    Abstract: Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averagin… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 16 page ,17 figures

  3. arXiv:2406.12384  [pdf, other

    cs.CV

    VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

    Authors: Xiang Li, Jian Ding, Mohamed Elhoseiny

    Abstract: We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these im… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Submitted for consideration at a conference

  4. arXiv:2406.06211  [pdf, other

    cs.CV

    iMotion-LLM: Motion Prediction Instruction Tuning

    Authors: Abdulwahab Felemban, Eslam Mohamed Bakr, Xiaoqian Shen, Jian Ding, Abduallah Mohamed, Mohamed Elhoseiny

    Abstract: We introduce iMotion-LLM: a Multimodal Large Language Models (LLMs) with trajectory prediction, tailored to guide interactive multi-agent scenarios. Different from conventional motion prediction approaches, iMotion-LLM capitalizes on textual instructions as key inputs for generating contextually relevant trajectories. By enriching the real-world driving scenarios in the Waymo Open Dataset with tex… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  5. arXiv:2405.18937  [pdf, other

    cs.CV cs.CL

    Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

    Authors: Junjie Fei, Mahmoud Ahmed, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

    Abstract: While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its sig… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  6. arXiv:2404.12766  [pdf, other

    cs.LG cs.CV

    Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation

    Authors: Wenxuan Zhang, Youssef Mohamed, Bernard Ghanem, Philip H. S. Torr, Adel Bibi, Mohamed Elhoseiny

    Abstract: We propose and study a realistic Continual Learning (CL) setting where learning algorithms are granted a restricted computational budget per time step while training. We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates. Previous proficient CL methods perform very poorly in this challenging setting. Overfitting to the sparse labeled data and ins… ▽ More

    Submitted 8 June, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  7. arXiv:2404.03413  [pdf, other

    cs.CV

    MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

    Authors: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny

    Abstract: This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved imp… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: 6 pages,8 figures

  8. arXiv:2402.02453  [pdf, other

    cs.CV

    AI Art Neural Constellation: Revealing the Collective and Contrastive State of AI-Generated and Human Art

    Authors: Faizan Farooq Khan, Diana Kim, Divyansh Jha, Youssef Mohamed, Hanna H Chang, Ahmed Elgammal, Luba Elliott, Mohamed Elhoseiny

    Abstract: Discovering the creative potentials of a random signal to various artistic expressions in aesthetic and conceptual richness is a ground for the recent success of generative machine learning as a way of art creation. To understand the new artistic medium better, we conduct a comprehensive analysis to position AI-generated art within the context of human art heritage. Our comparative analysis is bas… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

  9. arXiv:2312.03026  [pdf, other

    cs.CV

    Uni3DL: Unified Model for 3D and Language Understanding

    Authors: Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny

    Abstract: In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tas… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  10. arXiv:2312.02252  [pdf, other

    cs.CV

    StoryGPT-V: Large Language Models as Consistent Story Visualizers

    Authors: Xiaoqian Shen, Mohamed Elhoseiny

    Abstract: Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consiste… ▽ More

    Submitted 13 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page: https://xiaoqian-shen.github.io/StoryGPT-V

  11. arXiv:2312.00923  [pdf, other

    cs.LG cs.CV

    Label Delay in Online Continual Learning

    Authors: Botos Csaba, Wenxuan Zhang, Matthias Müller, Ser-Nam Lim, Mohamed Elhoseiny, Philip Torr, Adel Bibi

    Abstract: Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time… ▽ More

    Submitted 25 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: 17 pages, 12 figures

    ACM Class: I.4.0; I.4.10

  12. arXiv:2311.14542  [pdf, other

    cs.CV

    ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

    Authors: Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny

    Abstract: Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. This paper introduces ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into simpler, interpretable stage… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  13. arXiv:2310.18511  [pdf, other

    cs.CV cs.AI

    3DCoMPaT$^{++}$: An improved Large-scale 3D Vision Dataset for Compositional Recognition

    Authors: Habib Slim, Xiang Li, Yuchen Li, Mahmoud Ahmed, Mohamed Ayman, Ujjwal Upadhyay, Ahmed Abdelreheem, Arpit Prajapati, Suhail Pothigara, Peter Wonka, Mohamed Elhoseiny

    Abstract: In this work, we present 3DCoMPaT$^{++}$, a multimodal 2D/3D dataset with 160 million rendered views of more than 10 million stylized 3D shapes carefully annotated at the part-instance level, alongside matching RGB point clouds, 3D textured meshes, depth maps, and segmentation masks. 3DCoMPaT$^{++}$ covers 41 shape categories, 275 fine-grained part categories, and 293 fine-grained material classes… ▽ More

    Submitted 12 March, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: https://3dcompat-dataset.org/v2/

  14. arXiv:2310.17493  [pdf, other

    cs.CV

    A Hybrid Graph Network for Complex Activity Detection in Video

    Authors: Salman Khan, Izzeddin Teeti, Andrew Bradley, Mohamed Elhoseiny, Fabio Cuzzolin

    Abstract: Interpretation and understanding of video presents a challenging computer vision task in numerous fields - e.g. autonomous driving and sports analytics. Existing approaches to interpreting the actions taking place within a video clip are based upon Temporal Action Localisation (TAL), which typically identifies short-term actions. The emerging field of Complex Activity Detection (CompAD) extends th… ▽ More

    Submitted 30 October, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: This paper is Accepted at WACV 2024

  15. arXiv:2310.09478  [pdf, other

    cs.CV

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Authors: Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny

    Abstract: Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language t… ▽ More

    Submitted 7 November, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

  16. arXiv:2310.06214  [pdf, other

    cs.CV

    CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

    Authors: Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny

    Abstract: 3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual groundin… ▽ More

    Submitted 20 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  17. arXiv:2308.16349  [pdf, other

    cs.CL

    Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

    Authors: Kilichbek Haydarov, Xiaoqian Shen, Avinash Madasu, Mahmoud Salem, Li-Jia Li, Gamaleldin Elsayed, Mohamed Elhoseiny

    Abstract: We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of… ▽ More

    Submitted 12 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

  18. arXiv:2308.12462  [pdf, other

    cs.CV

    Overcoming Generic Knowledge Loss with Selective Parameter Update

    Authors: Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohamed Elhoseiny

    Abstract: Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains, we… ▽ More

    Submitted 19 April, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

  19. arXiv:2308.12366  [pdf, other

    cs.CV

    Continual Zero-Shot Learning through Semantically Guided Generative Random Walks

    Authors: Wenxuan Zhang, Paul Janson, Kai Yi, Ivan Skorokhodov, Mohamed Elhoseiny

    Abstract: Learning novel concepts, remembering previous knowledge, and adapting it to future tasks occur simultaneously throughout a human's lifetime. To model such comprehensive abilities, continual zero-shot learning (CZSL) has recently been introduced. However, most existing methods overused unseen semantic information that may not be continually accessible in realistic settings. In this paper, we addres… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted to ICCV 2023

  20. arXiv:2307.11636  [pdf, other

    cs.CV cs.CL

    OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?

    Authors: Runjia Li, Shuyang Sun, Mohamed Elhoseiny, Philip Torr

    Abstract: This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale dataset for humour generation and understanding. Humour is an abstract, subjective, and context-dependent cognitive construct involving several cognitive factors, making it a challenging task to generate and interpret. Hence, humour generation and understanding can serve as a new task for evaluating the ability of deep-lear… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023

  21. arXiv:2306.00576  [pdf, other

    cs.CV

    MammalNet: A Large-scale Video Benchmark for Mammal Recognition and Behavior Understanding

    Authors: Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, Mohamed Elhoseiny

    Abstract: Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognitio… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 proceeding

  22. arXiv:2306.00450  [pdf, other

    cs.CV

    Exploring Open-Vocabulary Semantic Segmentation without Human Labels

    Authors: Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Mohamed Elhoseiny, Sean Chang Culatana

    Abstract: Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  23. arXiv:2304.10592  [pdf, other

    cs.CV

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

    Abstract: The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4… ▽ More

    Submitted 2 October, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

    Comments: Project Website: https://minigpt-4.github.io/; Code, Pretrained Model, and Dataset: https://github.com/Vision-CAIR/MiniGPT-4; Deyao Zhu and Jun Chen contributed equally to this work

  24. arXiv:2304.09349   

    cs.AI cs.CL cs.RO

    LLM as A Robotic Brain: Unifying Egocentric Memory and Control

    Authors: **jie Mai, Jun Chen, Bing Li, Guocheng Qian, Mohamed Elhoseiny, Bernard Ghanem

    Abstract: Embodied AI focuses on the study and development of intelligent systems that possess a physical or virtual embodiment (i.e. robots) and are able to dynamically interact with their environment. Memory and control are the two essential parts of an embodied system and usually require separate frameworks to model each of them. In this paper, we propose a novel and generalizable framework called LLM-Br… ▽ More

    Submitted 12 June, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: This early project is now integrated to: Mindstorms in Natural Language-Based Societies of Mind, arXiv:2305.17066

  25. arXiv:2304.05390  [pdf, other

    cs.CV cs.AI cs.LG

    HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

    Authors: Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, Mohamed Elhoseiny

    Abstract: In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developi… ▽ More

    Submitted 23 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: ICCV 2023

  26. arXiv:2304.04874  [pdf, other

    cs.CV cs.AI cs.LG

    ImageCaptioner$^2$: Image Captioner for Image Captioning Bias Amplification Assessment

    Authors: Eslam Mohamed Bakr, Pengzhan Sun, Li Erran Li, Mohamed Elhoseiny

    Abstract: Most pre-trained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image captioning. Despite the significant effort in this direction, we observed that existing metrics lack consistency in the inclusion of the visual signal. In this paper… ▽ More

    Submitted 5 June, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

  27. arXiv:2304.04227  [pdf, other

    cs.CV cs.AI

    Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

    Authors: Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny

    Abstract: Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehe… ▽ More

    Submitted 24 May, 2023; v1 submitted 9 April, 2023; originally announced April 2023.

  28. arXiv:2304.02777  [pdf, other

    cs.CV

    MoStGAN-V: Video Generation with Temporal Motion Styles

    Authors: Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny

    Abstract: Video generation remains a challenging task due to spatiotemporal complexity and the requirement of synthesizing diverse motions with temporal consistency. Previous works attempt to generate videos in arbitrary lengths either in an autoregressive manner or regarding time as a continuous signal. However, they struggle to synthesize detailed and diverse motions with temporal coherence and tend to ge… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

  29. arXiv:2303.06594  [pdf, other

    cs.CV cs.AI cs.LG

    ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

    Authors: Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny

    Abstract: Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provi… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

  30. arXiv:2303.04654  [pdf, other

    cs.CV eess.IV physics.optics

    Aberration-Aware Depth-from-Focus

    Authors: Xinge Yang, Qiang Fu, Mohammed Elhoseiny, Wolfgang Heidrich

    Abstract: Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-… ▽ More

    Submitted 17 July, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

    Comments: [ICCP & TPAMI 2023] Considering optical aberrations during network training can improve the generalizability

  31. arXiv:2301.12876  [pdf, other

    cs.LG cs.AI

    Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

    Authors: Deyao Zhu, Yuhui Wang, Jürgen Schmidhuber, Mohamed Elhoseiny

    Abstract: Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, these methods typically require action information to be logged during data collection, which can be difficult or even impossible in some practical cases. In this paper, we investigate the potential of using action-free offline datasets to improve online r… ▽ More

    Submitted 22 March, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  32. arXiv:2211.14241  [pdf, other

    cs.CV

    Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

    Authors: Eslam Mohamed Bakr, Yasmeen Alsaedy, Mohamed Elhoseiny

    Abstract: The 3D visual grounding task has been explored with visual and language streams comprehending referential language to identify target objects in 3D scenes. However, most existing methods devote the visual stream to capturing the 3D visual clues using off-the-shelf point clouds encoders. The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized f… ▽ More

    Submitted 25 November, 2022; originally announced November 2022.

    Journal ref: NeurIPS 2022

  33. arXiv:2211.10780  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

    Authors: Youssef Mohamed, Mohamed Abdelfattah, Shyma Alhuwaider, Feifan Li, Xiangliang Zhang, Kenneth Ward Church, Mohamed Elhoseiny

    Abstract: This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance. More than 51K artworks… ▽ More

    Submitted 19 November, 2022; originally announced November 2022.

    Comments: 9 pages, Accepted at EMNLP 22, for more details see https://www.artelingo.org/

  34. arXiv:2210.04428  [pdf, ps, other

    cs.CV cs.LG

    A Simple Baseline that Questions the Use of Pretrained-Models in Continual Learning

    Authors: Paul Janson, Wenxuan Zhang, Rahaf Aljundi, Mohamed Elhoseiny

    Abstract: With the success of pretraining techniques in representation learning, a number of continual learning methods based on pretrained models have been proposed. Some of these methods design continual learning mechanisms on the pre-trained representations and only allow minimum updates or even no updates of the backbone models during the training of continual learning. In this paper, we question whethe… ▽ More

    Submitted 29 March, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: 6 pages, Workshop on Distribution Shifts 2022 , Code available at https://github.com/Pauljanson002/pretrained-cl.git

  35. arXiv:2206.04670  [pdf, other

    cs.CV cs.AI

    PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

    Authors: Guocheng Qian, Yuchen Li, Houwen Peng, **jie Mai, Hasan Abed Al Kader Hammoud, Mohamed Elhoseiny, Bernard Ghanem

    Abstract: PointNet++ is one of the most influential neural architectures for point cloud understanding. Although the accuracy of PointNet++ has been largely surpassed by recent networks such as PointMLP and Point Transformer, we find that a large portion of the performance gain is due to improved training strategies, i.e. data augmentation and optimization techniques, and increased model sizes rather than a… ▽ More

    Submitted 12 October, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS'22. Code and models are available at https://github.com/guochengqian/pointnext

  36. arXiv:2206.04384  [pdf, other

    cs.LG cs.AI

    Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning

    Authors: Deyao Zhu, Li Erran Li, Mohamed Elhoseiny

    Abstract: Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original e… ▽ More

    Submitted 2 May, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

  37. arXiv:2206.00790  [pdf, other

    cs.CV

    Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

    Authors: Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny

    Abstract: Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To ad… ▽ More

    Submitted 20 June, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: Add code

  38. arXiv:2204.07660  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection

    Authors: Youssef Mohamed, Faizan Farooq Khan, Kilichbek Haydarov, Mohamed Elhoseiny

    Abstract: Datasets that capture the connection between vision, language, and affection are limited, causing a lack of understanding of the emotional aspect of human intelligence. As a step in this direction, the ArtEmis dataset was recently introduced as a large-scale dataset of emotional reactions to images along with language explanations of these chosen emotions. We observed a significant emotional bias… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

    Comments: 8 pages, Accepted at CVPR 22, for more details see https://www.artemisdataset-v2.org

  39. arXiv:2203.03057  [pdf, other

    cs.CV cs.LG cs.RO

    Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation

    Authors: Abduallah Mohamed, Deyao Zhu, Warren Vu, Mohamed Elhoseiny, Christian Claudel

    Abstract: Best-of-N (BoN) Average Displacement Error (ADE)/ Final Displacement Error (FDE) is the most used metric for evaluating trajectory prediction models. Yet, the BoN does not quantify the whole generated samples, resulting in an incomplete view of the model's prediction quality and performance. We propose a new metric, Average Mahalanobis Distance (AMD) to tackle this issue. AMD is a metric that quan… ▽ More

    Submitted 10 September, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: Accepted at ECCV 2022

  40. arXiv:2203.01386  [pdf, other

    cs.CV cs.AI

    Exploring Hierarchical Graph Representation for Large-Scale Zero-Shot Image Classification

    Authors: Kai Yi, Xiaoqian Shen, Yunhao Gou, Mohamed Elhoseiny

    Abstract: The main question we address in this paper is how to scale up visual recognition of unseen classes, also known as zero-shot learning, to tens of thousands of categories as in the ImageNet-21K benchmark. At this scale, especially with many fine-grained categories included in ImageNet-21K, it is critical to learn quality visual semantic representations that are discriminative enough to recognize uns… ▽ More

    Submitted 19 July, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: ECCV 2022, camera-ready version

  41. arXiv:2201.01942  [pdf, other

    cs.LG stat.ML

    Efficiently Disentangle Causal Representations

    Authors: Yuanpeng Li, Joel Hestness, Mohamed Elhoseiny, Liang Zhao, Kenneth Church

    Abstract: This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach,… ▽ More

    Submitted 1 January, 2024; v1 submitted 6 January, 2022; originally announced January 2022.

    Comments: 17 pages, 7 figures

    Report number: Causal-01

  42. arXiv:2112.14683  [pdf, other

    cs.CV cs.AI cs.LG

    StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

    Authors: Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny

    Abstract: Videos show continuous events, yet most $-$ if not all $-$ video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be $-$ time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. T… ▽ More

    Submitted 31 May, 2022; v1 submitted 29 December, 2021; originally announced December 2021.

    Comments: CVPR 2022

  43. arXiv:2112.12989  [pdf, other

    cs.CV cs.LG

    Domain-Aware Continual Zero-Shot Learning

    Authors: Kai Yi, Paul Janson, Wenxuan Zhang, Mohamed Elhoseiny

    Abstract: Modern visual systems have a wide range of potential applications in vision tasks for natural science research, such as aiding in species discovery, monitoring animals in the wild, and so on. However, real-world vision tasks may experience changes in environmental conditions, leading to shifts in how captured images are presented. To address this issue, we introduce Domain-Aware Continual Zero-Sho… ▽ More

    Submitted 12 March, 2024; v1 submitted 24 December, 2021; originally announced December 2021.

  44. arXiv:2104.11934  [pdf, other

    cs.CV cs.AI

    RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

    Authors: Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed Elhoseiny

    Abstract: The visual relationship recognition (VRR) task aims at understanding the pairwise visual relationships between interacting objects in an image. These relationships typically have a long-tail distribution due to their compositional nature. This problem gets more severe when the vocabulary becomes large, rendering this task very challenging. This paper shows that modeling an effective message-passin… ▽ More

    Submitted 29 March, 2022; v1 submitted 24 April, 2021; originally announced April 2021.

  45. arXiv:2104.09757  [pdf, other

    cs.CV cs.AI

    Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation

    Authors: Divyansh Jha, Kai Yi, Ivan Skorokhodov, Mohamed Elhoseiny

    Abstract: We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. Quality learning representation of unseen classes (or styles) is critical to facilitate novel image generation and better generative understanding of unseen visual classes, i.e., zero-shot learning (ZSL). By generating representations o… ▽ More

    Submitted 24 September, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: Project homepage: https://imaginative-walks.github.io

  46. arXiv:2104.06954  [pdf, other

    cs.CV cs.AI

    Aligning Latent and Image Spaces to Connect the Unconnectable

    Authors: Ivan Skorokhodov, Grigorii Sotnikov, Mohamed Elhoseiny

    Abstract: In this work, we develop a method to generate infinite high-resolution images with diverse and complex content. It is based on a perfectly equivariant generator with synchronous interpolations in the image and latent spaces. Latent codes, when sampled, are positioned on the coordinate grid, and each pixel is computed from an interpolation of the nearby style codes. We modify the AdaIN mechanism to… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  47. arXiv:2102.10407  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

    Authors: Jun Chen, Han Guo, Kai Yi, Boyang Li, Mohamed Elhoseiny

    Abstract: The ability to quickly learn from a small quantity oftraining data widens the range of machine learning applications. In this paper, we propose a data-efficient image captioning model, VisualGPT, which leverages the linguistic knowledge from a large pretrained language model(LM). A crucial challenge is to balance between the use of visual information in the image and prior linguistic knowledge acq… ▽ More

    Submitted 30 March, 2022; v1 submitted 20 February, 2021; originally announced February 2021.

  48. arXiv:2101.07396  [pdf, other

    cs.CV cs.CL

    ArtEmis: Affective Language for Visual Art

    Authors: Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, Leonidas Guibas

    Abstract: We present a novel large-scale dataset and accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language. In contrast to most existing annotation datasets in computer vision, we focus on the affective experience triggered by visual artworks and ask the annotators to indicat… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

    Comments: https://artemisdataset.org

  49. arXiv:2101.00173  [pdf, other

    cs.CV cs.AI cs.CL

    CIZSL++: Creativity Inspired Generative Zero-Shot Learning

    Authors: Mohamed Elhoseiny, Kai Yi, Mohamed Elfeki

    Abstract: Zero-shot learning (ZSL) aims at understanding unseen categories with no training examples from class-level descriptions. To improve the discriminative power of ZSL, we model the visual learning process of unseen categories with inspiration from the psychology of human creativity for producing novel art. First, we propose CIZSL-v1 as a creativity inspired model for generative ZSL. We relate ZSL to… ▽ More

    Submitted 17 February, 2021; v1 submitted 1 January, 2021; originally announced January 2021.

    Comments: This paper is an extended version of a paper published on the International Conference on Computer Vision (ICCV), held in Seoul, Republic of Korea, October 27-Nov 2nd, 2019 CIZSL-v2 code is available here https://github.com/Vision-CAIR/CIZSLv2. arXiv admin note: substantial text overlap with arXiv:1904.01109

    Journal ref: https://openaccess.thecvf.com/content_ICCV_2019/papers/Elhoseiny_Creativity_Inspired_Zero-Shot_Learning_ICCV_2019_paper.pdf

  50. arXiv:2011.12026  [pdf, other

    cs.CV cs.AI cs.LG

    Adversarial Generation of Continuous Images

    Authors: Ivan Skorokhodov, Savva Ignatyev, Mohamed Elhoseiny

    Abstract: In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) - an MLP that predicts an RGB pixel value given its (x,y) coordinate. In this paper, we propose two novel architectural techniques for building INR-based image decoders: factorized multiplicative mod… ▽ More

    Submitted 28 June, 2021; v1 submitted 24 November, 2020; originally announced November 2020.

    Comments: 19 pages, 17 figures