Skip to main content

Showing 1–23 of 23 results for author: Voigtlaender, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13951  [pdf, other

    cs.CV

    Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

    Authors: Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

    Abstract: We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Paper accepted to AI4CC Workshop at CVPR 2024

  2. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2402.05917  [pdf, other

    cs.CV

    Point-VOS: Pointing Up Video Object Segmentation

    Authors: Idil Esen Zulfikar, Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe

    Abstract: Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-s… ▽ More

    Submitted 10 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to CVPR2024!

  4. arXiv:2310.09199  [pdf, other

    cs.CV

    PaLI-3 Vision Language Models: Smaller, Faster, Stronger

    Authors: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

    Abstract: This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classific… ▽ More

    Submitted 17 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

  5. arXiv:2308.11606  [pdf, other

    cs.CV cs.CL

    StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

    Authors: Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender

    Abstract: Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect compre… ▽ More

    Submitted 12 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: NeurIPS D&B 2023

  6. arXiv:2302.11217  [pdf, other

    cs.CV

    Connecting Vision and Language with Video Localized Narratives

    Authors: Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari

    Abstract: We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narrati… ▽ More

    Submitted 15 March, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: Accepted at CVPR 2023

  7. arXiv:2209.12118  [pdf, other

    cs.CV

    BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video

    Authors: Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, Deva Ramanan

    Abstract: Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to eac… ▽ More

    Submitted 22 November, 2022; v1 submitted 24 September, 2022; originally announced September 2022.

  8. arXiv:2102.11859  [pdf, other

    cs.CV

    STEP: Segmenting and Tracking Every Pixel

    Authors: Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen

    Abstract: The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sp… ▽ More

    Submitted 7 December, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: Accepted to NeurIPS 2021 Track on Datasets and Benchmarks. Code: https://github.com/google-research/deeplab2

  9. arXiv:2011.01142  [pdf, other

    cs.CV

    Reducing the Annotation Effort for Video Object Segmentation Datasets

    Authors: Paul Voigtlaender, Lishu Luo, Chun Yuan, Yong Jiang, Bastian Leibe

    Abstract: For further progress in video object segmentation (VOS), larger, more diverse, and more challenging datasets will be necessary. However, densely labeling every frame with pixel masks does not scale to large datasets. We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations and investigate how far such pseudo-labels can ca… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: Accepted at WACV 2021

  10. arXiv:1911.12836  [pdf, other

    cs.CV

    Siam R-CNN: Visual Tracking by Re-Detection

    Authors: Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, Bastian Leibe

    Abstract: We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and… ▽ More

    Submitted 2 April, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: CVPR 2020 camera-ready version

  11. arXiv:1904.04552  [pdf, other

    cs.CV

    BoLTVOS: Box-Level Tracking for Video Object Segmentation

    Authors: Paul Voigtlaender, Jonathon Luiten, Bastian Leibe

    Abstract: We approach video object segmentation (VOS) by splitting the task into two sub-tasks: bounding box level tracking, followed by bounding box segmentation. Following this paradigm, we present BoLTVOS (Box-Level Tracking for VOS), which consists of an R-CNN detector conditioned on the first-frame bounding box to detect the object of interest, a temporal consistency rescoring algorithm, and a Box2Seg… ▽ More

    Submitted 29 December, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

  12. arXiv:1903.00362  [pdf, other

    cs.CV

    Large-Scale Object Mining for Object Discovery from Unlabeled Video

    Authors: Aljosa Osep, Paul Voigtlaender, Jonathon Luiten, Stefan Breuers, Bastian Leibe

    Abstract: This paper addresses the problem of object discovery from unlabeled driving videos captured in a realistic automotive setting. Identifying recurring object categories in such raw video streams is a very challenging problem. Not only do object candidates first have to be localized in the input images, but many interesting object categories occur relatively infrequently. Object discovery will theref… ▽ More

    Submitted 29 April, 2019; v1 submitted 28 February, 2019; originally announced March 2019.

    Comments: Updated version of ICRA'19 paper (additional qualitative results); arXiv admin note: text overlap with arXiv:1712.08832

  13. arXiv:1902.09513  [pdf, other

    cs.CV

    FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation

    Authors: Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, Liang-Chieh Chen

    Abstract: Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding together wi… ▽ More

    Submitted 8 April, 2019; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: CVPR 2019 camera-ready version

    Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019

  14. arXiv:1902.03604  [pdf, other

    cs.CV

    MOTS: Multi-Object Tracking and Segmentation

    Authors: Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, Bastian Leibe

    Abstract: This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend exis… ▽ More

    Submitted 8 April, 2019; v1 submitted 10 February, 2019; originally announced February 2019.

    Comments: CVPR 2019 camera-ready version

    Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019

  15. arXiv:1901.09260  [pdf, other

    cs.CV cs.RO

    4D Generic Video Object Proposals

    Authors: Aljosa Osep, Paul Voigtlaender, Mark Weber, Jonathon Luiten, Bastian Leibe

    Abstract: Many high-level video understanding methods require input in the form of object proposals. Currently, such proposals are predominantly generated with the help of networks that were trained for detecting and segmenting a set of known object classes, which limits their applicability to cases where all objects of interest are represented in the training set. This is a restriction for automotive scena… ▽ More

    Submitted 20 May, 2020; v1 submitted 26 January, 2019; originally announced January 2019.

    Comments: ICRA 2020

  16. arXiv:1809.07316  [pdf, other

    cs.CV

    Towards Large-Scale Video Video Object Mining

    Authors: Aljosa Osep, Paul Voigtlaender, Jonathon Luiten, Stefan Breuers, Bastian Leibe

    Abstract: We propose to leverage a generic object tracker in order to perform object mining in large-scale unlabeled videos, captured in a realistic automotive setting. We present a dataset of more than 360'000 automatically mined object tracks from 10+ hours of video data (560'000 frames) and propose a method for automated novel category discovery and detector learning. In addition, we show preliminary res… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

    Comments: 4 pages, 3 figures, 1 table. ECCV 2018 Workshop on Interactive and Adaptive Learning in an Open World

  17. arXiv:1807.09190  [pdf, other

    cs.CV

    PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation

    Authors: Jonathon Luiten, Paul Voigtlaender, Bastian Leibe

    Abstract: We address semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations. Towards this goal, we present the PReMVOS algorithm (Proposal-generation, Refinement and Merging for Video Object Segmentation). Our method separates this problem into two steps, first generat… ▽ More

    Submitted 3 November, 2018; v1 submitted 24 July, 2018; originally announced July 2018.

    Comments: Accepted for publication in ACCV18

  18. arXiv:1805.04398  [pdf, other

    cs.CV

    Iteratively Trained Interactive Segmentation

    Authors: Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe

    Abstract: Deep learning requires large amounts of training data to be effective. For the task of object segmentation, manually labeling data is very expensive, and hence interactive methods are needed. Following recent approaches, we develop an interactive object segmentation system which uses user input in the form of clicks as the input to a convolutional network. While previous methods use heuristic clic… ▽ More

    Submitted 11 May, 2018; originally announced May 2018.

  19. arXiv:1712.08832  [pdf, other

    cs.CV

    Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video

    Authors: Aljoša Ošep, Paul Voigtlaender, Jonathon Luiten, Stefan Breuers, Bastian Leibe

    Abstract: We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applying this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. Base… ▽ More

    Submitted 23 December, 2017; originally announced December 2017.

    Comments: CVPR'18 submission

  20. arXiv:1712.07920  [pdf, other

    cs.CV

    Track, then Decide: Category-Agnostic Vision-based Multi-Object Tracking

    Authors: Aljoša Ošep, Wolfgang Mehner, Paul Voigtlaender, Bastian Leibe

    Abstract: The most common paradigm for vision-based multi-object tracking is tracking-by-detection, due to the availability of reliable detectors for several important object categories such as cars and pedestrians. However, future mobile systems will need a capability to cope with rich human-made environments, in which obtaining detectors for every possible object category would be infeasible. In this pape… ▽ More

    Submitted 21 December, 2017; originally announced December 2017.

    Comments: ICRA'18 submission

  21. arXiv:1706.09364  [pdf, other

    cs.CV

    Online Adaptation of Convolutional Neural Networks for Video Object Segmentation

    Authors: Paul Voigtlaender, Bastian Leibe

    Abstract: We tackle the task of semi-supervised video object segmentation, i.e. segmenting the pixels belonging to an object in the video using the ground truth pixel mask for the first frame. We build on the recently introduced one-shot video object segmentation (OSVOS) approach which uses a pretrained network and fine-tunes it on the first frame. While achieving impressive performance, at test time OSVOS… ▽ More

    Submitted 1 August, 2017; v1 submitted 28 June, 2017; originally announced June 2017.

    Comments: Accepted at BMVC 2017. This version contains minor changes for the camera ready version

  22. arXiv:1608.00895  [pdf, other

    cs.LG cs.CL cs.NE

    RETURNN: The RWTH Extensible Training framework for Universal Recurrent Neural Networks

    Authors: Patrick Doetsch, Albert Zeyer, Paul Voigtlaender, Ilya Kulikov, Ralf Schlüter, Hermann Ney

    Abstract: In this work we release our extensible and easily configurable neural network training software. It provides a rich set of functional layers with a particular focus on efficient training of recurrent neural network topologies on multiple GPUs. The source of the software package is public and freely available for academic research purposes and can be used as a framework or as a standalone tool whic… ▽ More

    Submitted 10 January, 2017; v1 submitted 2 August, 2016; originally announced August 2016.

  23. arXiv:1606.06871  [pdf, other

    cs.NE cs.CL cs.LG cs.SD

    A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition

    Authors: Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, Hermann Ney

    Abstract: We present a comprehensive study of deep bidirectional long short-term memory (LSTM) recurrent neural network (RNN) based acoustic models for automatic speech recognition (ASR). We study the effect of size and depth and train models of up to 8 layers. We investigate the training aspect and study different variants of optimization methods, batching, truncated backpropagation, different regularizati… ▽ More

    Submitted 29 March, 2017; v1 submitted 22 June, 2016; originally announced June 2016.

    Comments: published on ICASSP 2017 conference, New Orleans, USA