Skip to main content

Showing 1–14 of 14 results for author: Caelles, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2401.01974  [pdf, other

    cs.CV cs.AI cs.LG

    Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

    Authors: Aleksandar Stanić, Sergi Caelles, Michael Tschannen

    Abstract: Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the t… ▽ More

    Submitted 14 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

  3. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  4. arXiv:2206.07307  [pdf, other

    cs.CV cs.LG eess.IV

    VCT: A Video Compression Transformer

    Authors: Fabian Mentzer, George Toderici, David Minnen, Sung-** Hwang, Sergi Caelles, Mario Lucic, Eirikur Agustsson

    Abstract: We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and war** operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distri… ▽ More

    Submitted 12 October, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: NeurIPS'22 Camera Ready Version. Code: https://goo.gle/vct-paper

  5. arXiv:1905.00737  [pdf, other

    cs.CV

    The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation

    Authors: Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, Luc Van Gool

    Abstract: We present the 2019 DAVIS Challenge on Video Object Segmentation, the third edition of the DAVIS Challenge series, a public competition designed for the task of Video Object Segmentation (VOS). In addition to the original semi-supervised track and the interactive track introduced in the previous edition, a new unsupervised multi-object track will be featured this year. In the newly introduced trac… ▽ More

    Submitted 2 May, 2019; originally announced May 2019.

    Comments: CVPR 2019 Workshop/Challenge

  6. arXiv:1903.12161  [pdf, other

    cs.CV

    Fast video object segmentation with Spatio-Temporal GANs

    Authors: Sergi Caelles, Albert Pumarola, Francesc Moreno-Noguer, Alberto Sanfeliu, Luc Van Gool

    Abstract: Learning descriptive spatio-temporal object models from data is paramount for the task of semi-supervised video object segmentation. Most existing approaches mainly rely on models that estimate the segmentation mask based on a reference mask at the first frame (aided sometimes by optical flow or the previous mask). These models, however, are prone to fail under rapid appearance changes or occlusio… ▽ More

    Submitted 28 March, 2019; originally announced March 2019.

  7. arXiv:1808.09814  [pdf, other

    cs.CV

    Iterative Deep Learning for Road Topology Extraction

    Authors: Carles Ventura, Jordi Pont-Tuset, Sergi Caelles, Kevis-Kokitsi Maninis, Luc Van Gool

    Abstract: This paper tackles the task of estimating the topology of road networks from aerial images. Building on top of a global model that performs a dense semantical classification of the pixels of the image, we design a Convolutional Neural Network (CNN) that predicts the local connectivity among the central pixel of an input patch and its border points. By iterating this local connectivity we sweep the… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: BMVC 2018 camera ready. Code: https://github.com/carlesventura/iterative-deep-learning. arXiv admin note: substantial text overlap with arXiv:1712.01217

  8. arXiv:1803.00557  [pdf, other

    cs.CV

    The 2018 DAVIS Challenge on Video Object Segmentation

    Authors: Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, Jordi Pont-Tuset

    Abstract: We present the 2018 DAVIS Challenge on Video Object Segmentation, a public competition specifically designed for the task of video object segmentation. It builds upon the DAVIS 2017 dataset, which was presented in the previous edition of the DAVIS Challenge, and added 100 videos with multiple objects per sequence to the original DAVIS 2016 dataset. Motivated by the analysis of the results of the 2… ▽ More

    Submitted 27 March, 2018; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: Challenge website: http://davischallenge.org/

  9. arXiv:1712.01217  [pdf, other

    cs.CV

    Iterative Deep Learning for Network Topology Extraction

    Authors: Carles Ventura, Jordi Pont-Tuset, Sergi Caelles, Kevis-Kokitsi Maninis, Luc Van Gool

    Abstract: This paper tackles the task of estimating the topology of filamentary networks such as retinal vessels and road networks. Building on top of a global model that performs a dense semantical classification of the pixels of the image, we design a Convolutional Neural Network (CNN) that predicts the local connectivity between the central pixel of an input patch and its border points. By iterating this… ▽ More

    Submitted 4 December, 2017; originally announced December 2017.

  10. arXiv:1711.09081  [pdf, other

    cs.CV

    Deep Extreme Cut: From Extreme Points to Object Segmentation

    Authors: Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, Luc Van Gool

    Abstract: This paper explores the use of extreme points in an object (left-most, right-most, top, bottom pixels) as input to obtain precise object segmentation for images and videos. We do so by adding an extra channel to the image in the input of a convolutional neural network (CNN), which contains a Gaussian centered in each of the extreme points. The CNN learns to transform this information into a segmen… ▽ More

    Submitted 27 March, 2018; v1 submitted 24 November, 2017; originally announced November 2017.

    Comments: CVPR 2018 camera ready. Project webpage and code: http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr/

  11. arXiv:1709.06031  [pdf, other

    cs.CV

    Video Object Segmentation Without Temporal Information

    Authors: Kevis-Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool

    Abstract: Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly or they may not even produce a… ▽ More

    Submitted 16 May, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

    Comments: Accepted to T-PAMI. Extended version of "One-Shot Video Object Segmentation", CVPR 2017 (arXiv:1611.05198). Project page: http://www.vision.ee.ethz.ch/~cvlsegmentation/osvos/

  12. arXiv:1704.01926  [pdf, other

    cs.CV

    Semantically-Guided Video Object Segmentation

    Authors: Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Luc Van Gool

    Abstract: This paper tackles the problem of semi-supervised video object segmentation, that is, segmenting an object in a sequence given its mask in the first frame. One of the main challenges in this scenario is the change of appearance of the objects of interest. Their semantics, on the other hand, do not vary. This paper investigates how to take advantage of such invariance via the introduction of a sema… ▽ More

    Submitted 17 July, 2018; v1 submitted 6 April, 2017; originally announced April 2017.

    Comments: This paper has been incorporated in the following T-PAMI publication: arXiv:1709.06031

  13. arXiv:1704.00675  [pdf, other

    cs.CV

    The 2017 DAVIS Challenge on Video Object Segmentation

    Authors: Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, Luc Van Gool

    Abstract: We present the 2017 DAVIS Challenge on Video Object Segmentation, a public dataset, benchmark, and competition specifically designed for the task of video object segmentation. Following the footsteps of other successful initiatives, such as ILSVRC and PASCAL VOC, which established the avenue of research in the fields of scene classification and semantic segmentation, the DAVIS Challenge comprises… ▽ More

    Submitted 1 March, 2018; v1 submitted 3 April, 2017; originally announced April 2017.

    Comments: Challenge website: http://davischallenge.org

  14. arXiv:1611.05198  [pdf, other

    cs.CV

    One-Shot Video Object Segmentation

    Authors: Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool

    Abstract: This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foregro… ▽ More

    Submitted 13 April, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

    Comments: CVPR 2017 camera ready. Code: http://www.vision.ee.ethz.ch/~cvlsegmentation/osvos/