Skip to main content

Showing 1–22 of 22 results for author: Patro, B N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16112  [pdf, other

    cs.LG cs.AI cs.CV cs.MM eess.IV

    Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

    Authors: Badri Narayana Patro, Vijay Srinivas Agneeswaran

    Abstract: Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advance… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  2. arXiv:2403.18063  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

    Authors: Badri N. Patro, Suhas Ranganath, Vinay P. Namboodiri, Vijay S. Agneeswaran

    Abstract: Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in compute… ▽ More

    Submitted 3 June, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

  3. arXiv:2403.15360  [pdf, other

    cs.CV cs.LG eess.IV eess.SY

    SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

    Authors: Badri N. Patro, Vijay S. Agneeswaran

    Abstract: Transformers have widely adopted attention networks for sequence mixing and MLPs for channel mixing, playing a pivotal role in achieving breakthroughs across domains. However, recent literature highlights issues with attention networks, including low inductive bias and quadratic complexity concerning input sequence length. State Space Models (SSMs) like S4 and others (Hippo, Global Convolutions, l… ▽ More

    Submitted 24 April, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  4. arXiv:2311.01310  [pdf, other

    cs.CV cs.AI cs.LG eess.IV eess.SP

    Scattering Vision Transformer: Spectral Mixing Matters

    Authors: Badri N. Patro, Vijay Srinivas Agneeswaran

    Abstract: Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such… ▽ More

    Submitted 20 November, 2023; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted @NeurIPS 2023

  5. arXiv:2304.06446  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SpectFormer: Frequency and Attention is what you need in a Vision Transformer

    Authors: Badri N. Patro, Vinay P. Namboodiri, Vijay Srinivas Agneeswaran

    Abstract: Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT \cite{dosovitskiy2020image}, DeIT, \cite{touvron2021training}) similar to the original work in textual models or more recently based on spectral layers (Fnet\cite{lee2021fnet}, GFNet\cite{rao2021global}, AFNO\cite{guibas2021efficient}). We hypothesize that b… ▽ More

    Submitted 14 April, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: The project page is available at this webpage \url{https://badripatro.github.io/SpectFormers/}

  6. arXiv:2302.08374  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Efficiency 360: Efficient Vision Transformers

    Authors: Badri N. Patro, Vijay Srinivas Agneeswaran

    Abstract: Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We ma… ▽ More

    Submitted 23 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

  7. arXiv:2203.03727  [pdf, other

    cs.CV

    Barlow constrained optimization for Visual Question Answering

    Authors: Abhishek Jha, Badri N. Patro, Luc Van Gool, Tinne Tuytelaars

    Abstract: Visual question answering is a vision-and-language multimodal task, that aims at predicting answers given samples from the question and image modalities. Most recent methods focus on learning a good joint embedding space of images and questions, either by improving the interaction between these two modalities, or by making it a more discriminant space. However, how informative this joint space is,… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

  8. arXiv:2109.12135  [pdf, other

    cs.CV

    Attentive Contractive Flow with Lipschitz-constrained Self-Attention

    Authors: Avideep Mukherjee, Badri Narayan Patro, Vinay P. Namboodiri

    Abstract: Normalizing flows provide an elegant method for obtaining tractable density estimates from distributions by using invertible transformations. The main challenge is to improve the expressivity of the models while kee** the invertibility constraints intact. We propose to do so via the incorporation of localized self-attention. However, conventional self-attention mechanisms don't satisfy the requi… ▽ More

    Submitted 6 September, 2023; v1 submitted 24 September, 2021; originally announced September 2021.

    Comments: 10 pages, to be published at BMVC 2023

  9. arXiv:2104.02656  [pdf, other

    cs.CV cs.AI cs.GR cs.MM cs.SD eess.AS eess.IV

    Collaborative Learning to Generate Audio-Video Jointly

    Authors: Vinod K Kurmi, Vipul Bajaj, Badri N Patro, K S Venkatesh, Vinay P Namboodiri, Preethi Jyothi

    Abstract: There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are… ▽ More

    Submitted 31 March, 2021; originally announced April 2021.

    Comments: ICASSP 2021 (Accepted)

  10. arXiv:2102.01906  [pdf, other

    cs.LG cs.CV

    Do Not Forget to Attend to Uncertainty while Mitigating Catastrophic Forgetting

    Authors: Vinod K Kurmi, Badri N. Patro, Venkatesh K. Subramanian, Vinay P. Namboodiri

    Abstract: One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario. There have been several approaches proposed to tackle the problem of incremental learning. Most of these methods are based on knowledge distillation and do not adequately utilize the information provided by older task models, such as uncertainty estimation in predicti… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: Accepted WACV 2021

    Journal ref: WACV 2021

  11. arXiv:2002.10309  [pdf, other

    cs.CV cs.CL cs.LG

    Uncertainty based Class Activation Maps for Visual Question Answering

    Authors: Badri N. Patro, Mayank Lunayach, Vinay P. Namboodiri

    Abstract: Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold… ▽ More

    Submitted 23 January, 2020; originally announced February 2020.

    Comments: This work is an extension of our ICCV-2019 work. arXiv admin note: text overlap with arXiv:1908.06306

  12. arXiv:2001.08779  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Deep Bayesian Network for Visual Question Generation

    Authors: Badri N. Patro, Vinod K. Kurmi, Sandeep Kumar, Vinay P. Namboodiri

    Abstract: Generating natural questions from an image is a semantic task that requires using vision and language modalities to learn multimodal representations. Images can have multiple visual and language cues such as places, captions, and tags. In this paper, we propose a principled deep Bayesian learning framework that combines these cues to produce natural questions. We observe that with the addition of… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: WACV-2020 (Accepted)

  13. arXiv:2001.08730  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Robust Explanations for Visual Question Answering

    Authors: Badri N. Patro, Shivansh Pate, Vinay P. Namboodiri

    Abstract: In this paper, we propose a method to obtain robust explanations for visual question answering(VQA) that correlate well with the answers. Our model explains the answers obtained through a VQA model by providing visual and textual explanations. The main challenges that we address are i) Answers and textual explanations obtained by current methods are not well correlated and ii) Current methods for… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: WACV-2020 (Accepted)

  14. arXiv:1912.13149  [pdf, other

    cs.CL cs.LG

    Revisiting Paraphrase Question Generator using Pairwise Discriminator

    Authors: Badri N. Patro, Dev Chauhan, Vinod K. Kurmi, Vinay P. Namboodiri

    Abstract: In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the… ▽ More

    Submitted 4 January, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

    Comments: This work is an extension of our COLING-2018 paper arXiv:1806.00807

  15. arXiv:1912.09551  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Deep Exemplar Networks for VQA and VQG

    Authors: Badri N. Patro, Vinay P. Namboodiri

    Abstract: In this paper, we consider the problem of solving semantic tasks such as `Visual Question Answering' (VQA), where one aims to answers related to an image and `Visual Question Generation' (VQG), where one aims to generate a natural question pertaining to an image. Solutions for VQA and VQG tasks have been proposed using variants of encoder-decoder deep learning based frameworks that have shown impr… ▽ More

    Submitted 19 December, 2019; originally announced December 2019.

    Comments: This work is an extension of CVPR-2018 accepted paper arXiv:1804.00298 and EMNLP-2018 accepted paper arXiv:1808.03986

  16. arXiv:1911.08618  [pdf, other

    cs.CV cs.LG cs.MM eess.IV

    Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

    Authors: Badri N. Patro, Anupriy, Vinay P. Namboodiri

    Abstract: In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation map**s (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions o… ▽ More

    Submitted 19 November, 2019; originally announced November 2019.

    Comments: AAAI-2020(Accepted)

  17. arXiv:1910.06315  [pdf, other

    cs.CV cs.CL cs.LG

    Dynamic Attention Networks for Task Oriented Grounding

    Authors: Soumik Dasgupta, Badri N. Patro, Vinay P. Namboodiri

    Abstract: In order to successfully perform tasks specified by natural language instructions, an artificial agent operating in a visual world needs to map words, concepts, and actions from the instruction to visual elements in its environment. This association is termed as Task-Oriented Grounding. In this work, we propose a novel Dynamic Attention Network architecture for the efficient multi-modal fusion of… ▽ More

    Submitted 14 October, 2019; originally announced October 2019.

    Comments: Accepted ICCV 2019 Workshop

  18. arXiv:1910.05728  [pdf, other

    cs.CV cs.CL cs.LG

    Granular Multimodal Attention Networks for Visual Dialog

    Authors: Badri N. Patro, Shivansh Patel, Vinay P. Namboodiri

    Abstract: Vision and language tasks have benefited from attention. There have been a number of different attention models proposed. However, the scale at which attention needs to be applied has not been well examined. Particularly, in this work, we propose a new method Granular Multi-modal Attention, where we aim to particularly address the question of the right granularity at which one needs to attend whil… ▽ More

    Submitted 13 October, 2019; originally announced October 2019.

    Comments: ICCV Workshop

  19. arXiv:1909.04800  [pdf, other

    cs.CV cs.CL cs.LG cs.MM

    Probabilistic framework for solving Visual Dialog

    Authors: Badri N. Patro, Anupriy, Vinay P. Namboodiri

    Abstract: In this paper, we propose a probabilistic framework for solving the task of `Visual Dialog'. Solving this task requires reasoning and understanding of visual modality, language modality, and common sense knowledge to answer. Various architectures have been proposed to solve this task by variants of multi-modal deep learning techniques that combine visual and language representations. However, we b… ▽ More

    Submitted 17 October, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

  20. arXiv:1908.06306  [pdf, other

    cs.CV cs.CL cs.LG eess.IV

    U-CAM: Visual Explanation using Uncertainty based Class Activation Maps

    Authors: Badri N. Patro, Mayank Lunayach, Shivansh Patel, Vinay P. Namboodiri

    Abstract: Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide visual attention maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold… ▽ More

    Submitted 17 October, 2019; v1 submitted 17 August, 2019; originally announced August 2019.

    Comments: ICCV 2019 (accepted)

  21. arXiv:1808.03986  [pdf, other

    cs.CL cs.AI cs.CV

    Multimodal Differential Network for Visual Question Generation

    Authors: Badri N. Patro, Sandeep Kumar, Vinod K. Kurmi, Vinay P. Namboodiri

    Abstract: Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal… ▽ More

    Submitted 17 October, 2019; v1 submitted 12 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018 (accepted)

  22. arXiv:1806.00807  [pdf, other

    cs.CL cs.AI

    Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator

    Authors: Badri N. Patro, Vinod K. Kurmi, Sandeep Kumar, Vinay P. Namboodiri

    Abstract: In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the… ▽ More

    Submitted 14 March, 2019; v1 submitted 3 June, 2018; originally announced June 2018.

    Comments: COLING 2018 (accepted)