Skip to main content

Showing 1–7 of 7 results for author: Sirotenko, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.13217  [pdf, other

    cs.CV cs.AI

    VideoPrism: A Foundational Visual Encoder for Video Understanding

    Authors: Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

    Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic… ▽ More

    Submitted 15 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024. v2: added retrieval results on MSRVTT (1K-A), more data analyses, and ablation studies

  2. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  3. arXiv:2311.05770  [pdf, other

    cs.CV

    PolyMaX: General Dense Prediction with Mask Transformer

    Authors: Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Liang-Chieh Chen

    Abstract: Dense prediction tasks, such as semantic segmentation, depth estimation, and surface normal prediction, can be easily formulated as per-pixel classification (discrete outputs) or regression (continuous outputs). This per-pixel prediction paradigm has remained popular due to the prevalence of fully convolutional networks. However, on the recent frontier of segmentation task, the community has been… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  4. arXiv:2309.12172  [pdf, other

    cs.CV

    SANPO: A Scene Understanding, Accessibility, Navigation, Pathfinding, Obstacle Avoidance Dataset

    Authors: Sagar M. Waghmare, Kimberly Wilber, Dave Hawkey, Xuan Yang, Matthew Wilson, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Lars Pandikow, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko

    Abstract: We introduce SANPO, a large-scale egocentric video dataset focused on dense prediction in outdoor environments. It contains stereo video sessions collected across diverse outdoor environments, as well as rendered synthetic video sessions. (Synthetic data was provided by Parallel Domain.) All sessions have (dense) depth and odometry labels. All synthetic sessions and a subset of real sessions have… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: 10 pages plus additional references. 13 figures

  5. arXiv:2307.03166  [pdf, other

    cs.CV

    VideoGLUE: Video General Understanding Evaluation of Foundation Models

    Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

    Abstract: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoG… ▽ More

    Submitted 1 December, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Fixes some typos and include project open-source page: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

  6. arXiv:2203.04888  [pdf, other

    cs.LG cs.CV

    Efficient Image Representation Learning with Federated Sampled Softmax

    Authors: Sagar M. Waghmare, Hang Qi, Huizhong Chen, Mikhail Sirotenko, Tomer Meron

    Abstract: Learning image representations on decentralized data can bring many benefits in cases where data cannot be aggregated across data silos. Softmax cross entropy loss is highly effective and commonly used for learning image representations. Using a large number of classes has proven to be particularly beneficial for the descriptive power of such representations in centralized learning. However, doing… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: 15 pages, 10 figures, 4 tables and 1 algorithm

  7. arXiv:2004.12276  [pdf, other

    cs.CV cs.LG eess.IV

    Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset

    Authors: Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, Claire Cardie, Bharath Hariharan, Hartwig Adam, Serge Belongie

    Abstract: In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes). The proposed task requires both localizing an object and describing its properties. To illustrate the various aspects of this task, we focus on th… ▽ More

    Submitted 18 July, 2020; v1 submitted 25 April, 2020; originally announced April 2020.

    Comments: eccv2020