Skip to main content

Showing 1–18 of 18 results for author: Kahatapitiya, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, **ghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with au… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.09396  [pdf, other

    cs.CV

    Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    Authors: Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo

    Abstract: Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large lan… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2403.16998  [pdf, other

    cs.CV

    Understanding Long Videos in One Multimodal Language Model Pass

    Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmark… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 24 pages

  4. arXiv:2403.14622  [pdf, other

    cs.CV

    Language Repository for Long Video Understanding

    Authors: Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

    Abstract: Language has become a prominent modality in computer vision with the rise of multi-modal LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintai… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  5. arXiv:2401.05735  [pdf, other

    cs.CV cs.LG

    Object-Centric Diffusion for Efficient Video Editing

    Authors: Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

    Abstract: Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we c… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  6. arXiv:2304.02560  [pdf, other

    cs.CV

    VicTR: Video-conditioned Text Representations for Activity Recognition

    Authors: Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

    Abstract: Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely… ▽ More

    Submitted 29 March, 2024; v1 submitted 5 April, 2023; originally announced April 2023.

    Comments: To appear at CVPR 2024

  7. arXiv:2211.09119  [pdf, other

    cs.LG cs.CV cs.RO

    Token Turing Machines

    Authors: Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

    Abstract: We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the p… ▽ More

    Submitted 13 April, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: CVPR 2023 camera-ready copy

    Journal ref: CVPR 2023

  8. arXiv:2210.15943  [pdf, other

    cs.CV

    Grafting Vision Transformers

    Authors: Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo

    Abstract: Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better perfor… ▽ More

    Submitted 3 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

  9. arXiv:2112.03902  [pdf, other

    cs.CV

    MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

    Authors: Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond

    Abstract: Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we… ▽ More

    Submitted 29 March, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted in CVPR 2022

  10. arXiv:2111.13677  [pdf, other

    cs.CV

    SWAT: Spatial Structure Within and Among Tokens

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discardin… ▽ More

    Submitted 20 November, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Accepted to be published at IJCAI23

  11. arXiv:2111.13675  [pdf, other

    cs.CV

    Weakly-guided Self-supervised Pretraining for Temporal Activity Detection

    Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua

    Abstract: Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained o… ▽ More

    Submitted 4 February, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Published as a conference paper at AAAI 2023

  12. StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

    Authors: **ghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo

    Abstract: Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like… ▽ More

    Submitted 3 January, 2023; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted to ECCV 2022. Our code is available at https://github.com/elicassion/StARformer

  13. arXiv:2103.01302  [pdf, other

    cs.CV

    Coarse-Fine Networks for Temporal Activity Detection in Videos

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input a… ▽ More

    Submitted 1 April, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: To appear at CVPR 2021

  14. arXiv:2006.13904  [pdf, other

    cs.CV cs.LG eess.IV

    Feature-Dependent Cross-Connections in Multi-Path Neural Networks

    Authors: Dumindu Tissera, Kasun Vithanage, Rukshan Wijesinghe, Kumara Kahatapitiya, Subha Fernando, Ranga Rodrigo

    Abstract: Learning a particular task from a dataset, samples in which originate from diverse contexts, is challenging, and usually addressed by deepening or widening standard neural networks. As opposed to conventional network widening, multi-path architectures restrict the quadratic increment of complexity to a linear scale. However, existing multi-column/path networks or model ensembling methods do not co… ▽ More

    Submitted 1 January, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

    Comments: International Conference on Pattern Recognition (ICPR) 2020

  15. arXiv:1907.11519  [pdf, other

    cs.CV cs.LG

    Context-Aware Multipath Networks

    Authors: Dumindu Tissera, Kumara Kahatapitiya, Rukshan Wijesinghe, Subha Fernando, Ranga Rodrigo

    Abstract: Making a single network effectively address diverse contexts---learning the variations within a dataset or multiple datasets---is an intriguing step towards achieving generalized intelligence. Existing approaches of deepening, widening, and assembling networks are not cost effective in general. In view of this, networks which can allocate resources according to the context of the input and regulat… ▽ More

    Submitted 26 July, 2019; originally announced July 2019.

  16. arXiv:1907.11432  [pdf, other

    cs.CV

    Exploiting the Redundancy in Convolutional Filters for Parameter Reduction

    Authors: Kumara Kahatapitiya, Ranga Rodrigo

    Abstract: Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks over the years. However, this comes at the cost of heavy computation and memory intensive network designs, suggesting potential improvements in efficiency. Convolutional layers of CNNs partly account for such an inefficiency, as they are known to learn redundant features. In this work, we… ▽ More

    Submitted 10 August, 2020; v1 submitted 26 July, 2019; originally announced July 2019.

    Comments: Accepted to be published in Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV) 2021

  17. arXiv:1905.02710  [pdf, other

    cs.CV

    Context-Aware Automatic Occlusion Removal

    Authors: Kumara Kahatapitiya, Dumindu Tissera, Ranga Rodrigo

    Abstract: Occlusion removal is an interesting application of image enhancement, for which, existing work suggests manually-annotated or domain-specific occlusion removal. No work tries to address automatic occlusion detection and removal as a context-aware generic problem. In this paper, we present a novel methodology to identify objects that do not relate to the image context as occlusions and remove them,… ▽ More

    Submitted 7 May, 2019; originally announced May 2019.

    Comments: Accepted to be published in Proceedings of IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, September 2019

  18. arXiv:1811.06712  [pdf, ps, other

    eess.SP cs.IT

    Outage Analysis of $2\times2 $ MIMO-MRC in Correlated Rician Fading

    Authors: Prathapasinghe Dharmawansa, Kumara Kahatapitiya, Saman Atapattu, Chintha Tellambura

    Abstract: This paper addresses one of the classical problems in random matrix theory-- finding the distribution of the maximum eigenvalue of the correlated Wishart unitary ensemble. In particular, we derive a new exact expression for the cumulative distribution function (c.d.f.) of the maximum eigenvalue of a $2\times 2$ correlated non-central Wishart matrix with rank-$1$ mean. By using this new result, we… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

    Comments: 7 pages

    MSC Class: 62H10; 15B52;