Skip to main content

Showing 1–13 of 13 results for author: Pramanick, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.12423  [pdf, other

    cs.CV cs.AI

    Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

    Authors: Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

    Abstract: The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a… ▽ More

    Submitted 19 June, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Highlight

  2. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  3. arXiv:2311.04536  [pdf, other

    cs.DC

    Uniform Partitioning of a Bounded Region using Opaque ASYNC Luminous Mobile Robots

    Authors: Subhajit Pramanick, Saswata Jana, Adri Bhattacharya, Partha Sarathi Mandal

    Abstract: We are given $N$ autonomous mobile robots inside a bounded region. The robots are opaque which means that three collinear robots are unable to see each other as one of the robots acts as an obstruction for the other two. They operate in classical \emph{Look-Compute-Move} (LCM) activation cycles. Moreover, the robots are oblivious except for a persistent light (which is why they are called \emph{Lu… ▽ More

    Submitted 1 May, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: This paper recently got accepted in ICDCN 2024

  4. arXiv:2307.16715  [pdf, other

    cs.CV

    UniVTG: Towards Unified Video-Language Temporal Grounding

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex **peng Wang, Rui Yan, Mike Zheng Shou

    Abstract: Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detect… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code: https://github.com/showlab/UniVTG

  5. arXiv:2307.05463  [pdf, other

    cs.CV

    EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

    Authors: Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang

    Abstract: Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of e… ▽ More

    Submitted 18 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

    Comments: Published in ICCV 2023

  6. arXiv:2210.04135  [pdf, other

    cs.CV cs.LG cs.MM

    VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

    Authors: Shraman Pramanick, Li **g, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann LeCun, Rama Chellappa

    Abstract: Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accu… ▽ More

    Submitted 29 October, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

    Comments: Published in TMLR 2023

  7. arXiv:2204.13861  [pdf, other

    cs.CV

    Where in the World is this Image? Transformer-based Geo-localization in the Wild

    Authors: Shraman Pramanick, Ewa M. Nowara, Joshua Gleason, Carlos D. Castillo, Rama Chellappa

    Abstract: Predicting the geographic location (geo-localization) from a single ground-level RGB image taken anywhere in the world is a very challenging problem. The challenges include huge diversity of images due to different environmental scenarios, drastic variation in the appearance of the same location depending on the time of the day, weather, season, and more importantly, the prediction is made from a… ▽ More

    Submitted 25 July, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

    Comments: Accepted in ECCV 2022

  8. arXiv:2110.10949  [pdf, other

    cs.CV

    Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection

    Authors: Shraman Pramanick, Aniket Roy, Vishal M. Patel

    Abstract: Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is reflected across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-cons… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

    Comments: Accepted to WACV 2022

  9. arXiv:2110.00413  [pdf, other

    cs.CL cs.LG cs.MM cs.SI

    Detecting Harmful Memes and Their Targets

    Authors: Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md. Shad Akhtar, Preslav Nakov, Tanmoy Chakraborty

    Abstract: Among the various modes of communication in social media, the use of Internet memes has emerged as a powerful means to convey political, psychological, and socio-cultural opinions. Although memes are typically humorous in nature, recent days have witnessed a proliferation of harmful memes targeted to abuse various social entities. As most harmful memes are highly satirical and abstruse without app… ▽ More

    Submitted 24 September, 2021; originally announced October 2021.

    Comments: harmful memes, multimodality, social media

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7

    Journal ref: ACL-2021 (Findings)

  10. arXiv:2109.05184  [pdf, other

    cs.MM cs.CL

    MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets

    Authors: Shraman Pramanick, Shivam Sharma, Dimitar Dimitrov, Md Shad Akhtar, Preslav Nakov, Tanmoy Chakraborty

    Abstract: Internet memes have become powerful means to transmit political, psychological, and socio-cultural ideas. Although memes are typically humorous, recent days have witnessed an escalation of harmful memes used for trolling, cyberbullying, and abuse. Detecting such memes is challenging as they can be highly satirical and cryptic. Moreover, while previous work has focused on specific aspects of memes… ▽ More

    Submitted 22 September, 2021; v1 submitted 11 September, 2021; originally announced September 2021.

    Comments: The paper has been accepted in the Findings of Empirical Methods in Natural Language Processing (EMNLP), 2021

  11. arXiv:2107.04885  [pdf, other

    cs.DC

    Filling MIS Vertices by Myopic Luminous Robots

    Authors: Subhajit Pramanick, Sai Vamshi Samala, Debasish Pattanayak, Partha Sarathi Mandal

    Abstract: We present the problem of finding a maximal independent set (MIS) (named as \emph{MIS Filling problem}) of an arbitrary connected graph having $n$ vertices with luminous myopic mobile robots. The robots enter the graph one after another from a particular vertex called the \emph{Door} and disperse along the edges of the graph without collision to occupy vertices such that the set of vertices occupi… ▽ More

    Submitted 15 October, 2022; v1 submitted 10 July, 2021; originally announced July 2021.

    Comments: A version of this paper appears in the Proceedings of ICDCIT'23

  12. arXiv:2105.09601  [pdf, other

    cs.LG cs.CL

    See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization

    Authors: Yash Kumar Atri, Shraman Pramanick, Vikram Goyal, Tanmoy Chakraborty

    Abstract: In recent years, abstractive text summarization with multimodal inputs has started drawing attention due to its ability to accumulate information from different source modalities and generate a fluent textual summary. However, existing methods use short videos as the visual modality and short summary as the ground-truth, therefore, perform poorly on lengthy videos and long ground-truth summary. Ad… ▽ More

    Submitted 15 September, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: Journal paper accepted in Knowledge Based Systems

  13. arXiv:2103.12377  [pdf, other

    cs.CL

    Exercise? I thought you said 'Extra Fries': Leveraging Sentence Demarcations and Multi-hop Attention for Meme Affect Analysis

    Authors: Shraman Pramanick, Md Shad Akhtar, Tanmoy Chakraborty

    Abstract: Today's Internet is awash in memes as they are humorous, satirical, or ironic which make people laugh. According to a survey, 33% of social media users in age bracket [13-35] send memes every day, whereas more than 50% send every week. Some of these memes spread rapidly within a very short time-frame, and their virality depends on the novelty of their (textual and visual) content. A few of them co… ▽ More

    Submitted 23 March, 2021; originally announced March 2021.

    Comments: Accepted for publication in ICWSM-2021