-
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Authors:
Shraman Pramanick,
Guangxing Han,
Rui Hou,
Sayan Nag,
Ser-Nam Lim,
Nicolas Ballas,
Qifan Wang,
Rama Chellappa,
Amjad Almahairi
Abstract:
The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a…
▽ More
The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.
△ Less
Submitted 19 June, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Authors:
Kristen Grauman,
Andrew Westbury,
Lorenzo Torresani,
Kris Kitani,
Jitendra Malik,
Triantafyllos Afouras,
Kumar Ashutosh,
Vijay Baiyya,
Siddhant Bansal,
Bikram Boote,
Eugene Byrne,
Zach Chavis,
Joya Chen,
Feng Cheng,
Fu-Jen Chu,
Sean Crane,
Avijit Dasgupta,
**g Dong,
Maria Escobar,
Cristhian Forigua,
Abrham Gebreselasie,
Sanjay Haresh,
**g Huang,
Md Mohaiminul Islam,
Suyog Jain
, et al. (76 additional authors not shown)
Abstract:
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from…
▽ More
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/
△ Less
Submitted 29 April, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Uniform Partitioning of a Bounded Region using Opaque ASYNC Luminous Mobile Robots
Authors:
Subhajit Pramanick,
Saswata Jana,
Adri Bhattacharya,
Partha Sarathi Mandal
Abstract:
We are given $N$ autonomous mobile robots inside a bounded region. The robots are opaque which means that three collinear robots are unable to see each other as one of the robots acts as an obstruction for the other two. They operate in classical \emph{Look-Compute-Move} (LCM) activation cycles. Moreover, the robots are oblivious except for a persistent light (which is why they are called \emph{Lu…
▽ More
We are given $N$ autonomous mobile robots inside a bounded region. The robots are opaque which means that three collinear robots are unable to see each other as one of the robots acts as an obstruction for the other two. They operate in classical \emph{Look-Compute-Move} (LCM) activation cycles. Moreover, the robots are oblivious except for a persistent light (which is why they are called \emph{Luminous robots}) that can determine a color from a fixed color set. Obliviousness does not allow the robots to remember any information from past activation cycles. The Uniform Partitioning problem requires the robots to partition the whole region into sub-regions of equal area, each of which contains exactly one robot. Due to application-oriented motivation, we, in this paper consider the region to be well-known geometric shapes such as rectangle, square and circle. We investigate the problem in \emph{asynchronous} setting where there is no notion of common time and any robot gets activated at any time with a fair assumption that every robot needs to get activated infinitely often. To the best of our knowledge, this is the first attempt to study the Uniform Partitioning problem using oblivious opaque robots working under asynchronous settings. We propose three algorithms considering three different regions: rectangle, square and circle. The algorithms proposed for rectangular and square regions run in $O(N)$ epochs whereas the algorithm for circular regions runs in $O(N^2)$ epochs, where an epoch is the smallest unit of time in which all robots are activated at least once and execute their LCM cycles. The algorithms for the rectangular, square and circular regions require $2$ (which is optimal), $5$ and $8$ colors, respectively.
△ Less
Submitted 1 May, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
UniVTG: Towards Unified Video-Language Temporal Grounding
Authors:
Kevin Qinghong Lin,
Pengchuan Zhang,
Joya Chen,
Shraman Pramanick,
Difei Gao,
Alex **peng Wang,
Rui Yan,
Mike Zheng Shou
Abstract:
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect…
▽ More
Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.
△ Less
Submitted 18 August, 2023; v1 submitted 31 July, 2023;
originally announced July 2023.
-
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Authors:
Shraman Pramanick,
Yale Song,
Sayan Nag,
Kevin Qinghong Lin,
Hardik Shah,
Mike Zheng Shou,
Rama Chellappa,
Pengchuan Zhang
Abstract:
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of e…
▽ More
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.
△ Less
Submitted 18 August, 2023; v1 submitted 11 July, 2023;
originally announced July 2023.
-
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Authors:
Shraman Pramanick,
Li **g,
Sayan Nag,
Jiachen Zhu,
Hardik Shah,
Yann LeCun,
Rama Chellappa
Abstract:
Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accu…
▽ More
Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.
△ Less
Submitted 29 October, 2023; v1 submitted 8 October, 2022;
originally announced October 2022.
-
Where in the World is this Image? Transformer-based Geo-localization in the Wild
Authors:
Shraman Pramanick,
Ewa M. Nowara,
Joshua Gleason,
Carlos D. Castillo,
Rama Chellappa
Abstract:
Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem. The challenges include huge diversity of images due to different environmental scenarios, drastic variation in the appearance of the same location depending on the time of the day, weather, season, and more importantly, the prediction is made from a…
▽ More
Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem. The challenges include huge diversity of images due to different environmental scenarios, drastic variation in the appearance of the same location depending on the time of the day, weather, season, and more importantly, the prediction is made from a single image possibly having only a few geo-locating cues. For these reasons, most existing works are restricted to specific cities, imagery, or worldwide landmarks. In this work, we focus on develo** an efficient solution to planet-scale single-image geo-localization. To this end, we propose TransLocator, a unified dual-branch transformer network that attends to tiny details over the entire image and produces robust feature representation under extreme appearance variations. TransLocator takes an RGB image and its semantic segmentation map as inputs, interacts between its two parallel branches after each transformer layer, and simultaneously performs geo-localization and scene recognition in a multi-task fashion. We evaluate TransLocator on four benchmark datasets - Im2GPS, Im2GPS3k, YFCC4k, YFCC26k and obtain 5.5%, 14.1%, 4.9%, 9.9% continent-level accuracy improvement over the state-of-the-art. TransLocator is also validated on real-world test images and found to be more effective than previous methods.
△ Less
Submitted 25 July, 2022; v1 submitted 28 April, 2022;
originally announced April 2022.
-
Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
Authors:
Shraman Pramanick,
Aniket Roy,
Vishal M. Patel
Abstract:
Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is reflected across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-cons…
▽ More
Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is reflected across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-constrained training, where the number of training samples is limited. To this end, we propose a novel multimodal learning system, MuLOT (Multimodal Learning using Optimal Transport), which utilizes self-attention to exploit intra-modal correspondence and optimal transport for cross-modal correspondence. Finally, the modalities are combined with multimodal attention fusion to capture the inter-dependencies across modalities. We test our approach for multimodal sarcasm and humor detection on three benchmark datasets - MUStARD (video, audio, text), UR-FUNNY (video, audio, text), MST (image, text) and obtain 2.1%, 1.54%, and 2.34% accuracy improvements over state-of-the-art.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Detecting Harmful Memes and Their Targets
Authors:
Shraman Pramanick,
Dimitar Dimitrov,
Rituparna Mukherjee,
Shivam Sharma,
Md. Shad Akhtar,
Preslav Nakov,
Tanmoy Chakraborty
Abstract:
Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without app…
▽ More
Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without appropriate contexts, off-the-shelf multimodal models may not be adequate to understand their underlying semantics. In this work, we propose two novel problem formulations: detecting harmful memes and the social entities that these harmful memes target. To this end, we present HarMeme, the first benchmark dataset, containing 3,544 memes related to COVID-19. Each meme went through a rigorous two-stage annotation process. In the first stage, we labeled a meme as very harmful, partially harmful, or harmless; in the second stage, we further annotated the type of target(s) that each harmful meme points to: individual, organization, community, or society/general public/other. The evaluation results using ten unimodal and multimodal models highlight the importance of using multimodal signals for both tasks. We further discuss the limitations of these models and we argue that more research is needed to address these problems.
△ Less
Submitted 24 September, 2021;
originally announced October 2021.
-
MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets
Authors:
Shraman Pramanick,
Shivam Sharma,
Dimitar Dimitrov,
Md Shad Akhtar,
Preslav Nakov,
Tanmoy Chakraborty
Abstract:
Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abuse. Detecting such memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes…
▽ More
Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abuse. Detecting such memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes such as hate speech and propaganda, there has been little work on harm in general. Here, we aim to bridge this gap. We focus on two tasks: (i)detecting harmful memes, and (ii)identifying the social entities they target. We further extend a recently released HarMeme dataset, which covered COVID-19, with additional memes and a new topic: US politics. To solve these tasks, we propose MOMENTA (MultimOdal framework for detecting harmful MemEs aNd Their tArgets), a novel multimodal deep neural network that uses global and local perspectives to detect harmful memes. MOMENTA systematically analyzes the local and the global perspective of the input meme (in both modalities) and relates it to the background context. MOMENTA is interpretable and generalizable, and our experiments show that it outperforms several strong rivaling approaches.
△ Less
Submitted 22 September, 2021; v1 submitted 11 September, 2021;
originally announced September 2021.
-
Filling MIS Vertices by Myopic Luminous Robots
Authors:
Subhajit Pramanick,
Sai Vamshi Samala,
Debasish Pattanayak,
Partha Sarathi Mandal
Abstract:
We present the problem of finding a maximal independent set (MIS) (named as \emph{MIS Filling problem}) of an arbitrary connected graph having $n$ vertices with luminous myopic mobile robots. The robots enter the graph one after another from a particular vertex called the \emph{Door} and disperse along the edges of the graph without collision to occupy vertices such that the set of vertices occupi…
▽ More
We present the problem of finding a maximal independent set (MIS) (named as \emph{MIS Filling problem}) of an arbitrary connected graph having $n$ vertices with luminous myopic mobile robots. The robots enter the graph one after another from a particular vertex called the \emph{Door} and disperse along the edges of the graph without collision to occupy vertices such that the set of vertices occupied by the robots is a maximal independent set. We assume the robots have knowledge only about the maximum degree of the graph, denoted by $Δ$.
In this paper, we explore two versions of the problem: the solution to the first version, named as \emph{MIS Filling with Single Door}, works under an asynchronous scheduler using robots with 3 hops of visibility range, $Δ+ 6$ number of colors and $O(\log Δ)$ bits of persistent storage. The time complexity is measured in terms of epochs and it can be solved in $O(n^2)$ epochs. An epoch is the smallest time interval in which each participating robot gets activated and executes the algorithm at least once. For the second version with $k~ ( > 1)$ \textit{Doors}, named as \emph{MIS Filling with Multiple Doors}, the solution works under a semi-synchronous scheduler using robots with 5 hops of visibility range, $Δ+ k + 6$ number of colors and $O(\log (Δ+ k))$ bits of persistent storage. The problem with multiple Doors can be solved in $O(n^2)$ epochs.
△ Less
Submitted 15 October, 2022; v1 submitted 10 July, 2021;
originally announced July 2021.
-
See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization
Authors:
Yash Kumar Atri,
Shraman Pramanick,
Vikram Goyal,
Tanmoy Chakraborty
Abstract:
In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Ad…
▽ More
In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Additionally, there exists no benchmark dataset to generalize this task on videos of varying lengths. In this paper, we introduce AVIATE, the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding research papers as the reference summaries, which ensure adequate quality and uniformity of the ground-truth. We then propose FLORAL, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task. FLORAL utilizes an increasing number of self-attentions to capture multimodality and performs significantly better than traditional encoder-decoder based networks. Extensive experiments illustrate that FLORAL achieves significant improvement over the baselines in both qualitative and quantitative evaluations on the existing How2 dataset for short videos and newly introduced AVIATE dataset for videos with diverse duration, beating the best baseline on the two datasets by $1.39$ and $2.74$ ROUGE-L points respectively.
△ Less
Submitted 15 September, 2021; v1 submitted 20 May, 2021;
originally announced May 2021.
-
Exercise? I thought you said 'Extra Fries': Leveraging Sentence Demarcations and Multi-hop Attention for Meme Affect Analysis
Authors:
Shraman Pramanick,
Md Shad Akhtar,
Tanmoy Chakraborty
Abstract:
Today's Internet is awash in memes as they are humorous, satirical, or ironic which make people laugh. According to a survey, 33% of social media users in age bracket [13-35] send memes every day, whereas more than 50% send every week. Some of these memes spread rapidly within a very short time-frame, and their virality depends on the novelty of their (textual and visual) content. A few of them co…
▽ More
Today's Internet is awash in memes as they are humorous, satirical, or ironic which make people laugh. According to a survey, 33% of social media users in age bracket [13-35] send memes every day, whereas more than 50% send every week. Some of these memes spread rapidly within a very short time-frame, and their virality depends on the novelty of their (textual and visual) content. A few of them convey positive messages, such as funny or motivational quotes; while others are meant to mock/hurt someone's feelings through sarcastic or offensive messages. Despite the appealing nature of memes and their rapid emergence on social media, effective analysis of memes has not been adequately attempted to the extent it deserves.
In this paper, we attempt to solve the same set of tasks suggested in the SemEval'20-Memotion Analysis competition. We propose a multi-hop attention-based deep neural network framework, called MHA-MEME, whose prime objective is to leverage the spatial-domain correspondence between the visual modality (an image) and various textual segments to extract fine-grained feature representations for classification. We evaluate MHA-MEME on the 'Memotion Analysis' dataset for all three sub-tasks - sentiment classification, affect classification, and affect class quantification. Our comparative study shows sota performances of MHA-MEME for all three tasks compared to the top systems that participated in the competition. Unlike all the baselines which perform inconsistently across all three tasks, MHA-MEME outperforms baselines in all the tasks on average. Moreover, we validate the generalization of MHA-MEME on another set of manually annotated test samples and observe it to be consistent. Finally, we establish the interpretability of MHA-MEME.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.