Skip to main content

Showing 1–1 of 1 results for author: Christl, D

.
  1. arXiv:2304.10505  [pdf

    cs.CV cs.AI cs.LG

    Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

    Authors: Kastan Day, Daniel Christl, Rohan Salvi, Pranav Sriram

    Abstract: We present Video Pre-trained Transformer. VPT uses four SOTA encoder models from prior work to convert a video into a sequence of compact embeddings. Our backbone, based on a reference Flan-T5-11B architecture, learns a universal representation of the video that is a non-linear sum of the encoder models. It learns using an autoregressive causal language modeling loss by predicting the words spoken… ▽ More

    Submitted 24 March, 2023; originally announced April 2023.