Skip to main content

Showing 1–12 of 12 results for author: Seybold, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13195  [pdf, other

    cs.CV cs.AI

    CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

    Authors: Andrew Marmon, Grant Schindler, José Lezama, Dan Kondratyuk, Bryan Seybold, Irfan Essa

    Abstract: We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimens… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  2. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  3. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  4. arXiv:2212.10596  [pdf, other

    cs.CV

    Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

    Authors: Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

    Abstract: Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised mod… ▽ More

    Submitted 10 January, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

  5. arXiv:2205.06253  [pdf, other

    cs.CV cs.CL

    What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

    Authors: David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

    Abstract: While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In th… ▽ More

    Submitted 12 January, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  6. arXiv:2204.00679  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning Audio-Video Modalities from Image Captions

    Authors: Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

    Abstract: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new l… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  7. arXiv:2106.09251  [pdf, other

    cs.CV

    Optical Mouse: 3D Mouse Pose From Single-View Video

    Authors: Bo Hu, Bryan Seybold, Shan Yang, David Ross, Avneesh Sud, Graham Ruby, Yi Liu

    Abstract: We present a method to infer the 3D pose of mice, including the limbs and feet, from monocular videos. Many human clinical conditions and their corresponding animal models result in abnormal motion, and accurately measuring 3D motion at scale offers insights into health. The 3D poses improve classification of health-related attributes over 2D representations. The inferred poses are accurate enough… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

  8. arXiv:1910.09588  [pdf, other

    cs.LG stat.ML

    Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems

    Authors: Zhe Dong, Bryan A. Seybold, Kevin P. Murphy, Hung H. Bui

    Abstract: We propose an efficient inference method for switching nonlinear dynamical systems. The key idea is to learn an inference network which can be used as a proposal distribution for the continuous latent variables, while performing exact marginalization of the discrete latent variables. This allows us to use the reparameterization trick, and apply end-to-end training with stochastic gradient descent.… ▽ More

    Submitted 10 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

  9. arXiv:1905.07478  [pdf, other

    cs.LG stat.ML

    Dueling Decoders: Regularizing Variational Autoencoder Latent Spaces

    Authors: Bryan Seybold, Emily Fertig, Alex Alemi, Ian Fischer

    Abstract: Variational autoencoders learn unsupervised data representations, but these models frequently converge to minima that fail to preserve meaningful semantic information. For example, variational autoencoders with autoregressive decoders often collapse into autodecoders, where they learn to ignore the encoder input. In this work, we demonstrate that adding an auxiliary decoder to regularize the laten… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.

    Comments: 16 pages, 9 figures, supplemental

  10. arXiv:1804.07667  [pdf, other

    cs.CV

    Rethinking the Faster R-CNN Architecture for Temporal Action Localization

    Authors: Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar

    Abstract: We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions… ▽ More

    Submitted 20 April, 2018; originally announced April 2018.

    Comments: Accepted in CVPR 2018

  11. arXiv:1801.00908  [pdf, other

    cs.CV

    Instance Embedding Transfer to Unsupervised Video Object Segmentation

    Authors: Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C. -C. Jay Kuo

    Abstract: We propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames, which allo… ▽ More

    Submitted 26 February, 2018; v1 submitted 3 January, 2018; originally announced January 2018.

    Comments: To appear in CVPR 2018

  12. arXiv:1609.09430  [pdf, other

    cs.SD cs.LG stat.ML

    CNN Architectures for Large-Scale Audio Classification

    Authors: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson

    Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th… ▽ More

    Submitted 10 January, 2017; v1 submitted 29 September, 2016; originally announced September 2016.

    Comments: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions