Skip to main content

Showing 1–14 of 14 results for author: Pătrăucean, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2312.13090  [pdf, other

    cs.CV

    Perception Test 2023: A Summary of the First Challenge And Outcome

    Authors: Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

    Abstract: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio,… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  3. arXiv:2312.07395  [pdf, other

    cs.CV cs.CL

    A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

    Authors: Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzdeh

    Abstract: Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in stan… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  4. arXiv:2312.00598  [pdf, other

    cs.CV cs.AI

    Learning from One Continuous Video Stream

    Authors: João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman

    Abstract: We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str… ▽ More

    Submitted 28 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: CVPR camera ready version

  5. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  6. arXiv:2211.05039  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

    Authors: Jannik Kossen, Cătălina Cangea, Eszter Vértes, Andrew Jaegle, Viorica Patraucean, Ira Ktena, Nenad Tomasev, Danielle Belgrave

    Abstract: We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performan… ▽ More

    Submitted 3 July, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Published in Transactions on Machine Learning Research. Previous version accepted to Foundation Models for Decision Making Workshop at NeurIPS 2022

  7. arXiv:2106.08318  [pdf, other

    cs.CV cs.DC cs.LG eess.IV

    Gradient Forward-Propagation for Large-Scale Temporal Video Modelling

    Authors: Mateusz Malinowski, Dimitrios Vytiniotis, Grzegorz Swirszcz, Viorica Patraucean, Joao Carreira

    Abstract: How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and incr… ▽ More

    Submitted 12 July, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

    Comments: Accepted to CVPR 2021. arXiv admin note: text overlap with arXiv:2001.06232

  8. arXiv:2103.16559  [pdf, other

    cs.CV

    Broaden Your Views for Self-Supervised Video Learning

    Authors: Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Ross Hemsley, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-Bastien Grill, Aäron van den Oord, Andrew Zisserman

    Abstract: Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly extracted by crop** and augmenting the resulting crop. However, these methods miss a crucial element in the video domain: time. We introduce BraVe, a self-supervise… ▽ More

    Submitted 19 October, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: This paper is an extended version of our ICCV-21 paper. It includes more results as well as a minor architectural variation which improves results

  9. arXiv:2001.06232  [pdf, other

    cs.LG cs.CV stat.ML

    Sideways: Depth-Parallel Training of Video Models

    Authors: Mateusz Malinowski, Grzegorz Swirszcz, Joao Carreira, Viorica Patraucean

    Abstract: We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input stream… ▽ More

    Submitted 30 March, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

    Comments: Accepted at CVPR'20

  10. arXiv:1806.03863  [pdf, other

    cs.CV

    Massively Parallel Video Networks

    Authors: Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, Simon Osindero

    Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. Th… ▽ More

    Submitted 5 September, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

    Comments: Fixed typos in densenet model definition in appendix

  11. arXiv:1607.07405  [pdf, other

    cs.CV cs.LG

    gvnn: Neural Network Library for Geometric Computer Vision

    Authors: Ankur Handa, Michael Bloesch, Viorica Patraucean, Simon Stent, John McCormac, Andrew Davison

    Abstract: We introduce gvnn, a neural network library in Torch aimed towards bridging the gap between classic geometric computer vision and deep learning. Inspired by the recent success of Spatial Transformer Networks, we propose several new layers which are often used as parametric transformations on the data in geometric computer vision. These layers can be inserted within a neural network much in the spi… ▽ More

    Submitted 12 August, 2016; v1 submitted 25 July, 2016; originally announced July 2016.

    Comments: Submitted to ECCV Workshop on Deep Geometry

  12. arXiv:1511.07041  [pdf, other

    cs.CV

    SceneNet: Understanding Real World Indoor Scenes With Synthetic Data

    Authors: Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent, Roberto Cipolla

    Abstract: Scene understanding is a prerequisite to many high level tasks for any automated intelligent machine operating in real world environments. Recent attempts with supervised learning have shown promise in this direction but also highlighted the need for enormous quantity of supervised data --- performance increases in proportion to the amount of data used. However, this quickly becomes prohibitive wh… ▽ More

    Submitted 26 November, 2015; v1 submitted 22 November, 2015; originally announced November 2015.

  13. arXiv:1511.06309  [pdf, other

    cs.LG cs.CV

    Spatio-temporal video autoencoder with differentiable memory

    Authors: Viorica Patraucean, Ankur Handa, Roberto Cipolla

    Abstract: We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional long short-term memory (LSTM) cells that integrate changes over time. Here we target motion changes and use as temporal decoder a robust optical flow prediction modu… ▽ More

    Submitted 1 September, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

    Comments: The experiments section has been extended and a direct application to weakly-supervised video segmentation through label propagation has been included

  14. arXiv:1505.00171  [pdf, other

    cs.CV

    SynthCam3D: Semantic Understanding With Synthetic Indoor Scenes

    Authors: Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent, Roberto Cipolla

    Abstract: We are interested in automatic scene understanding from geometric cues. To this end, we aim to bring semantic segmentation in the loop of real-time reconstruction. Our semantic segmentation is built on a deep autoencoder stack trained exclusively on synthetic depth data generated from our novel 3D scene library, SynthCam3D. Importantly, our network is able to segment real world scenes without any… ▽ More

    Submitted 1 May, 2015; originally announced May 2015.