Skip to main content

Showing 1–11 of 11 results for author: Weyand, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.13217  [pdf, other

    cs.CV cs.AI

    VideoPrism: A Foundational Visual Encoder for Video Understanding

    Authors: Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

    Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic… ▽ More

    Submitted 15 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024. v2: added retrieval results on MSRVTT (1K-A), more data analyses, and ablation studies

  2. arXiv:2307.03166  [pdf, other

    cs.CV

    VideoGLUE: Video General Understanding Evaluation of Foundation Models

    Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

    Abstract: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoG… ▽ More

    Submitted 1 December, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Fixes some typos and include project open-source page: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

  3. arXiv:2206.01326  [pdf, other

    cs.CV cs.CY cs.LG

    Improving Fairness in Large-Scale Object Recognition by CrowdSourced Demographic Information

    Authors: Zu Kim, André Araujo, Bingyi Cao, Cam Askew, Jack Sim, Mike Green, N'Mah Fodiatu Yilla, Tobias Weyand

    Abstract: There has been increasing awareness of ethical issues in machine learning, and fairness has become an important research topic. Most fairness efforts in computer vision have been focused on human sensing applications and preventing discrimination by people's physical attributes such as race, skin color or age by increasing visual representation for particular demographic groups. We argue that ML f… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  4. arXiv:2108.08874  [pdf, other

    cs.CV

    Towards A Fairer Landmark Recognition Dataset

    Authors: Zu Kim, André Araujo, Bingyi Cao, Cam Askew, Jack Sim, Mike Green, N'Mah Fodiatu Yilla, Tobias Weyand

    Abstract: We introduce a new landmark recognition dataset, which is created with a focus on fair worldwide representation. While previous work proposes to collect as many images as possible from web repositories, we instead argue that such approaches can lead to biased data. To create a more comprehensive and equitable dataset, we start by defining the fair relevance of a landmark to the world population. T… ▽ More

    Submitted 6 June, 2022; v1 submitted 19 August, 2021; originally announced August 2021.

    Comments: Please cite the full detailed version of the paper instead: Improving Fairness in Large-Scale Object Recognition by CrowdSourced Demographic Information arXiv:2206.01326

  5. arXiv:2103.03375  [pdf, other

    cs.CV cs.LG

    Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food

    Authors: Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, Jack Sim

    Abstract: Understanding the nutritional content of food from visual data is a challenging computer vision problem, with the potential to have a positive and widespread impact on public health. Studies in this area are limited to existing datasets in the field that lack sufficient diversity or labels required for training models with nutritional understanding capability. We introduce Nutrition5k, a novel dat… ▽ More

    Submitted 22 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: 8 pages, 3 of appendices. CVPR 2021

  6. arXiv:2004.01804  [pdf, other

    cs.CV

    Google Landmarks Dataset v2 -- A Large-Scale Benchmark for Instance-Level Recognition and Retrieval

    Authors: Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim

    Abstract: While image retrieval and instance recognition techniques are progressing rapidly, there is a need for challenging datasets to accurately measure their performance -- while posing novel challenges that are relevant for practical applications. We introduce the Google Landmarks Dataset v2 (GLDv2), a new benchmark for large-scale, fine-grained instance recognition and image retrieval in the domain of… ▽ More

    Submitted 2 November, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: CVPR20 camera-ready (oral) + appendices

  7. arXiv:1808.02130  [pdf, other

    cs.CV

    CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps

    Authors: Paul Hongsuck Seo, Tobias Weyand, Jack Sim, Bohyung Han

    Abstract: Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information. This task is inherently challenging since many photos have only few, possibly ambiguous cues to their geolocation. Recent work has cast this task as a classification problem by partitioning the earth into a set of discrete cells that correspond to geographic regions. The granular… ▽ More

    Submitted 6 August, 2018; originally announced August 2018.

    Comments: ECCV 2018 accepted paper

  8. arXiv:1704.04861  [pdf, other

    cs.CV

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Authors: Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam

    Abstract: We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choo… ▽ More

    Submitted 16 April, 2017; originally announced April 2017.

  9. arXiv:1612.06321  [pdf, other

    cs.CV

    Large-Scale Image Retrieval with Attentive Deep Local Features

    Authors: Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, Bohyung Han

    Abstract: We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset. To identify semantically useful local features for image retrieval, we also propose an attention mechanism for keypoint selecti… ▽ More

    Submitted 2 February, 2018; v1 submitted 19 December, 2016; originally announced December 2016.

    Comments: ICCV 2017. Code and dataset available: https://github.com/tensorflow/models/tree/master/research/delf

  10. PlaNet - Photo Geolocation with Convolutional Neural Networks

    Authors: Tobias Weyand, Ilya Kostrikov, James Philbin

    Abstract: Is it possible to build a system to determine the location where a photo was taken using just its pixels? In general, the problem seems exceptionally difficult: it is trivial to construct situations where no location can be inferred. Yet images often contain informative cues such as landmarks, weather patterns, vegetation, road markings, and architectural details, which in combination may allow on… ▽ More

    Submitted 17 February, 2016; originally announced February 2016.

  11. Visual Landmark Recognition from Internet Photo Collections: A Large-Scale Evaluation

    Authors: Tobias Weyand, Bastian Leibe

    Abstract: The task of a visual landmark recognition system is to identify photographed buildings or objects in query photos and to provide the user with relevant information on them. With their increasing coverage of the world's landmark buildings and objects, Internet photo collections are now being used as a source for building such systems in a fully automatic fashion. This process typically consists of… ▽ More

    Submitted 18 September, 2014; originally announced September 2014.