Skip to main content

Showing 1–19 of 19 results for author: Ranasinghe, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01596  [pdf, other

    cs.LG cs.AI cs.RO eess.IV

    Maze Discovery using Multiple Robots via Federated Learning

    Authors: Kalpana Ranasinghe, H. P. Madushanka, Rafaela Scaciota, Sumudu Samarakoon, Mehdi Bennis

    Abstract: This work presents a use case of federated learning (FL) applied to discovering a maze with LiDAR sensors-equipped robots. Goal here is to train classification models to accurately identify the shapes of grid areas within two different square mazes made up with irregular shaped walls. Due to the use of different shapes for the walls, a classification model trained in one maze that captures its str… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

    Comments: Accepted in ISCC 2024 conference

  2. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, **ghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with au… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  3. arXiv:2406.09396  [pdf, other

    cs.CV

    Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    Authors: Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo

    Abstract: Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large lan… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2404.07449  [pdf, other

    cs.CV

    Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

    Authors: Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

    Abstract: Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  5. arXiv:2403.16998  [pdf, other

    cs.CV

    Understanding Long Videos in One Multimodal Language Model Pass

    Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmark… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 24 pages

  6. arXiv:2403.14622  [pdf, other

    cs.CV

    Language Repository for Long Video Understanding

    Authors: Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

    Abstract: Language has become a prominent modality in computer vision with the rise of multi-modal LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintai… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  7. arXiv:2403.14616  [pdf, other

    cs.CV

    Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning

    Authors: Hasindri Watawana, Kanchana Ranasinghe, Tariq Mahmood, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Self-supervised representation learning has been highly promising for histopathology image analysis with numerous approaches leveraging their patient-slide-patch hierarchy to learn better representations. In this paper, we explore how the combination of domain specific natural language information with such hierarchical visual representations can benefit rich representation learning for medical im… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 13 pages and 5 figures

  8. arXiv:2312.03817  [pdf, other

    cs.CV

    Diffusion Illusions: Hiding Images in Plain Sight

    Authors: Ryan Burgert, Xiang Li, Abe Leite, Kanchana Ranasinghe, Michael S. Ryoo

    Abstract: We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

  9. arXiv:2307.10922  [pdf, other

    cs.CV cs.LG

    Language-based Action Concept Spaces Improve Video Self-Supervised Learning

    Authors: Kanchana Ranasinghe, Michael Ryoo

    Abstract: Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling… ▽ More

    Submitted 26 October, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Presented at NeurIPS 2023

  10. arXiv:2211.13224  [pdf, other

    cs.CV cs.CL cs.LG

    Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

    Authors: Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

    Abstract: Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases witho… ▽ More

    Submitted 21 June, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 19 pages; contains appendix

  11. arXiv:2210.09996  [pdf, other

    cs.CV cs.LG

    Perceptual Grou** in Contrastive Vision-Language Models

    Authors: Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens

    Abstract: Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how… ▽ More

    Submitted 21 August, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted and presented at ICCV 2023

  12. arXiv:2112.01514  [pdf, other

    cs.CV

    Self-supervised Video Transformer

    Authors: Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Michael Ryoo

    Abstract: In this paper, we propose self-supervised training for video transformers using unlabeled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of ou… ▽ More

    Submitted 19 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Accepted to CVPR '22

  13. arXiv:2106.04169  [pdf, other

    cs.CV cs.AI cs.LG

    On Improving Adversarial Transferability of Vision Transformers

    Authors: Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Fahad Shahbaz Khan, Fatih Porikli

    Abstract: Vision transformers (ViTs) process input images as sequences of patches via self-attention; a radically different architecture than convolutional neural networks (CNNs). This makes it interesting to study the adversarial feature space of ViT models and their transferability. In particular, we observe that adversarial patterns found via conventional adversarial attacks show very \emph{low} black-bo… ▽ More

    Submitted 3 March, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: ICLR'22 (Spotlight), the first two authors contributed equally. Code: https://t.ly/hBbW

  14. arXiv:2105.10497  [pdf, other

    cs.CV cs.AI cs.LG

    Intriguing Properties of Vision Transformers

    Authors: Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang

    Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in nat… ▽ More

    Submitted 25 November, 2021; v1 submitted 21 May, 2021; originally announced May 2021.

    Comments: NeurIPS'21 (Spotlight), Code: https://git.io/Js15X

  15. arXiv:2103.14021  [pdf, other

    cs.CV

    Orthogonal Projection Loss

    Authors: Kanchana Ranasinghe, Muzammal Naseer, Munawar Hayat, Salman Khan, Fahad Shahbaz Khan

    Abstract: Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes. However, this is a relative constraint and does not explicitly force different class fea… ▽ More

    Submitted 25 March, 2021; originally announced March 2021.

  16. arXiv:2010.03132  [pdf, other

    cs.LG cs.CV

    Conditional Generative Modeling via Learning the Latent Space

    Authors: Sameera Ramasinghe, Kanchana Ranasinghe, Salman Khan, Nick Barnes, Stephen Gould

    Abstract: Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost… ▽ More

    Submitted 8 October, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

  17. arXiv:1912.11651  [pdf, other

    cs.CV cs.RO

    Extending Multi-Object Tracking systems to better exploit appearance and 3D information

    Authors: Kanchana Ranasinghe, Sahan Liyanaarachchi, Harsha Ranasinghe, Mayuka Jayawardhana

    Abstract: Tracking multiple objects in real time is essential for a variety of real-world applications, with self-driving industry being at the foremost. This work involves exploiting temporally varying appearance and motion information for tracking. Siamese networks have recently become highly successful at appearance based single object tracking and Recurrent Neural Networks have started dominating both m… ▽ More

    Submitted 25 December, 2019; originally announced December 2019.

    Comments: 7 pages

  18. arXiv:1912.05307  [pdf, other

    cs.CV eess.IV

    Bipartite Conditional Random Fields for Panoptic Segmentation

    Authors: Sadeep Jayasumana, Kanchana Ranasinghe, Mayuka Jayawardhana, Sahan Liyanaarachchi, Harsha Ranasinghe

    Abstract: We tackle the panoptic segmentation problem with a conditional random field (CRF) model. Panoptic segmentation involves assigning a semantic label and an instance label to each pixel of a given image. At each pixel, the semantic label and the instance label should be compatible. Furthermore, a good panoptic segmentation should have a number of other desirable properties such as the spatial and col… ▽ More

    Submitted 21 August, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

  19. Combined Static and Motion Features for Deep-Networks Based Activity Recognition in Videos

    Authors: Sameera Ramasinghe, Jathushan Rajasegaran, Vinoj Jayasundara, Kanchana Ranasinghe, Ranga Rodrigo, Ajith A. Pasqual

    Abstract: Activity recognition in videos in a deep-learning setting---or otherwise---uses both static and pre-computed motion components. The method of combining the two components, whilst kee** the burden on the deep network less, still remains uninvestigated. Moreover, it is not clear what the level of contribution of individual components is, and how to control the contribution. In this work, we use a… ▽ More

    Submitted 16 October, 2018; originally announced October 2018.

    Journal ref: IEEE Transactions on Circuits and Systems for Video Technology (2017)