Skip to main content

Showing 1–27 of 27 results for author: Tokmakov, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16862  [pdf, other

    cs.RO cs.CV

    Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

    Authors: Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick

    Abstract: A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations o… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Project page: https://dreamitate.cs.columbia.edu/

  2. arXiv:2405.14868  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

    Authors: Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick

    Abstract: Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this pape… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Project webpage is available at: https://gcd.cs.columbia.edu/

  3. arXiv:2401.14398  [pdf, other

    cs.CV cs.LG

    pix2gestalt: Amodal Segmentation by Synthesizing Wholes

    Authors: Ege Ozguroglu, Ruoshi Liu, Dídac Surís, Dian Chen, Achal Dave, Pavel Tokmakov, Carl Vondrick

    Abstract: We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, incl… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Website: https://gestalt.cs.columbia.edu/

  4. arXiv:2401.10831  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Understanding Video Transformers via Universal Concept Discovery

    Authors: Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G. Derpanis, Pavel Tokmakov

    Abstract: This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal wit… ▽ More

    Submitted 10 April, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

    Comments: CVPR 2024 (Highlight)

  5. arXiv:2310.06992  [pdf, other

    cs.CV

    Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

    Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki

    Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained… ▽ More

    Submitted 25 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Project page available at https://wenhsuanchu.github.io/ovtracktor/

  6. arXiv:2305.03052  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Tracking through Containers and Occluders in the Wild

    Authors: Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick

    Abstract: Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce $\textbf{TCOW}$, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the sur… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: Accepted at CVPR 2023. Project webpage is available at: https://tcow.cs.columbia.edu/

  7. arXiv:2303.15555  [pdf, other

    cs.CV

    Object Discovery from Motion-Guided Tokens

    Authors: Zhipeng Bao, Pavel Tokmakov, Yu-Xiong Wang, Adrien Gaidon, Martial Hebert

    Abstract: Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guida… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Journal ref: CVPR 2023

  8. arXiv:2303.11328  [pdf, other

    cs.CV cs.GR cs.RO

    Zero-1-to-3: Zero-shot One Image to 3D Object

    Authors: Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, Carl Vondrick

    Abstract: We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which al… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

    Comments: Website: https://zero123.cs.columbia.edu/

  9. arXiv:2302.03802  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking

    Authors: Ziqi Pang, Jie Li, Pavel Tokmakov, Dian Chen, Sergey Zagoruyko, Yu-Xiong Wang

    Abstract: This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with objec… ▽ More

    Submitted 3 April, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: CVPR 2023 Camera Ready, 15 pages, 8 figures

  10. arXiv:2212.06200  [pdf, other

    cs.CV

    Breaking the "Object" in Video Object Segmentation

    Authors: Pavel Tokmakov, Jie Li, Adrien Gaidon

    Abstract: The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset… ▽ More

    Submitted 28 March, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

  11. arXiv:2204.01784  [pdf, other

    cs.CV

    Object Permanence Emerges in a Random Walk along Memory

    Authors: Pavel Tokmakov, Allan Jabri, Jie Li, Adrien Gaidon

    Abstract: This paper proposes a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence. A central question is the choice of learning signal in cases of total occlusion. Rather than directly supervising the locations of invisible objects, we propose a self-supervised objective that requires neither human annotation, nor assumptions… ▽ More

    Submitted 13 June, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

  12. arXiv:2203.10159  [pdf, other

    cs.CV

    Discovering Objects that Can Move

    Authors: Zhipeng Bao, Pavel Tokmakov, Allan Jabri, Yu-Xiong Wang, Adrien Gaidon, Martial Hebert

    Abstract: This paper studies the problem of object discovery -- separating objects from the background without manual labels. Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions. However, by relying on appearance alone, these methods fail to separate objects from the background in cluttered scenes. This is a fundamental limitation since… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022

  13. arXiv:2104.12446  [pdf, other

    cs.CV cs.LG cs.RO

    Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty

    Authors: Boris Ivanovic, Kuan-Hui Lee, Pavel Tokmakov, Blake Wulfe, Rowan McAllister, Adrien Gaidon, Marco Pavone

    Abstract: Reasoning about the future behavior of other agents is critical to safe robot navigation. The multiplicity of plausible futures is further amplified by the uncertainty inherent to agent state estimation from data, including positions, velocities, and semantic class. Forecasting methods, however, typically neglect class uncertainty, conditioning instead only on the agent's most likely class, even t… ▽ More

    Submitted 2 March, 2022; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: 15 pages, 15 figures, 6 tables

  14. arXiv:2103.14258  [pdf, other

    cs.CV

    Learning to Track with Object Permanence

    Authors: Pavel Tokmakov, Jie Li, Wolfram Burgard, Adrien Gaidon

    Abstract: Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and association steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physi… ▽ More

    Submitted 30 September, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

  15. arXiv:2006.15731  [pdf, other

    cs.CV

    Unsupervised Learning of Video Representations via Dense Trajectory Clustering

    Authors: Pavel Tokmakov, Martial Hebert, Cordelia Schmid

    Abstract: This paper addresses the task of unsupervised learning of representations for action recognition in videos. Previous works proposed to utilize future prediction, or other domain-specific objectives to train a network, but achieved only limited success. In contrast, in the relevant field of image representation learning, simpler, discrimination-based methods have recently bridged the gap to fully-s… ▽ More

    Submitted 28 June, 2020; originally announced June 2020.

  16. arXiv:2005.10356  [pdf, other

    cs.CV

    TAO: A Large-Scale Benchmark for Tracking Any Object

    Authors: Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

    Abstract: For many years, multi-object tracking benchmarks have focused on a handful of categories. Motivated primarily by surveillance and self-driving applications, these datasets provide tracks for people, vehicles, and animals, ignoring the vast majority of objects in the world. By contrast, in the related field of object detection, the introduction of large-scale, diverse datasets (e.g., COCO) have fos… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

    Comments: Project page: http://taodataset.org/

  17. arXiv:1911.12911  [pdf, other

    cs.CV

    Unlocking the Full Potential of Small Data with Diverse Supervision

    Authors: Ziqi Pang, Zhiyuan Hu, Pavel Tokmakov, Yu-Xiong Wang, Martial Hebert

    Abstract: Virtually all of deep learning literature relies on the assumption of large amounts of available training data. Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pretraining. This assumption, however, does not always hold. For some tasks, annotating a large number of classes can be infeasible, and even collecting the images themselves can be a challen… ▽ More

    Submitted 26 April, 2021; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Learning from Limited and Imperfect Data (L2ID) Workshop @ CVPR 2021

  18. arXiv:1910.11844  [pdf, other

    cs.CV

    Learning to Track Any Object

    Authors: Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

    Abstract: Object tracking can be formulated as "finding the right object in a video". We observe that recent approaches for class-agnostic tracking tend to focus on the "finding" part, but largely overlook the "object" part of the task, essentially doing a template matching over a frame in a sliding-window. In contrast, class-specific trackers heavily rely on object priors in the form of category-specific o… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: To be presented at the Holistic Video Understanding workshop at ICCV

  19. arXiv:1904.12993  [pdf, other

    cs.CV

    A Study on Action Detection in the Wild

    Authors: Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

    Abstract: The recent introduction of the AVA dataset for action detection has caused a renewed interest to this problem. Several approaches have been recently proposed that improved the performance. However, all of them have ignored the main difficulty of the AVA dataset - its realistic distribution of training and test examples. This dataset was collected by exhaustive annotation of human action in uncurat… ▽ More

    Submitted 9 June, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

  20. arXiv:1902.03715  [pdf, other

    cs.CV

    Towards Segmenting Anything That Moves

    Authors: Achal Dave, Pavel Tokmakov, Deva Ramanan

    Abstract: Detecting and segmenting individual objects, regardless of their category, is crucial for many applications such as action detection or robotic interaction. While this problem has been well-studied under the classic formulation of spatio-temporal grou**, state-of-the-art approaches do not make use of learning-based methods. To bridge this gap, we propose a simple learning-based approach for spat… ▽ More

    Submitted 31 March, 2020; v1 submitted 10 February, 2019; originally announced February 2019.

    Comments: Website: http://www.achaldave.com/projects/anything-that-moves/. Code: https://github.com/achalddave/segment-any-moving

  21. arXiv:1812.09213  [pdf, other

    cs.CV

    Learning Compositional Representations for Few-Shot Recognition

    Authors: Pavel Tokmakov, Yu-Xiong Wang, Martial Hebert

    Abstract: One of the key limitations of modern deep learning approaches lies in the amount of data required to train them. Humans, by contrast, can learn to recognize novel categories from just a few examples. Instrumental to this rapid learning ability is the compositional structure of concept representations in the human brain --- something that deep learning models are lacking. In this work, we make a st… ▽ More

    Submitted 17 August, 2019; v1 submitted 21 December, 2018; originally announced December 2018.

  22. arXiv:1812.03544  [pdf, other

    cs.CV

    A Structured Model For Action Detection

    Authors: Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

    Abstract: A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand. While this is an obviously attractive approach, it is not applicable in all scenarios. We claim that action detection is one such cha… ▽ More

    Submitted 5 June, 2019; v1 submitted 9 December, 2018; originally announced December 2018.

  23. arXiv:1712.01127  [pdf, other

    cs.CV

    Learning to Segment Moving Objects

    Authors: Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

    Abstract: We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to c… ▽ More

    Submitted 1 December, 2017; originally announced December 2017.

    Comments: arXiv admin note: text overlap with arXiv:1704.05737, arXiv:1612.07217

  24. arXiv:1704.05737  [pdf, other

    cs.CV

    Learning Video Object Segmentation with Visual Memory

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a "visual memory" in video, i.… ▽ More

    Submitted 12 July, 2017; v1 submitted 19 April, 2017; originally announced April 2017.

  25. arXiv:1612.07217  [pdf, other

    cs.CV

    Learning Motion Patterns in Videos

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved. We address this challenging task by learning motion patterns in videos. The core of our approach is a fully convolutional network, which is learned entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation. This encoder-decoder style archite… ▽ More

    Submitted 10 April, 2017; v1 submitted 21 December, 2016; originally announced December 2016.

  26. arXiv:1603.07188  [pdf, other

    cs.CV

    Weakly-Supervised Semantic Segmentation using Motion Cues

    Authors: Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

    Abstract: Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations, they need additional constraints, such as the size of an object, to obtain reasonable performance. To address this issue, w… ▽ More

    Submitted 21 April, 2017; v1 submitted 23 March, 2016; originally announced March 2016.

    Comments: Extended version of our ECCV 2016 paper

  27. arXiv:1410.3125  [pdf, other

    cs.AI cs.LO cs.PL math.OC

    Relational Linear Programs

    Authors: Kristian Kersting, Martin Mladenov, Pavel Tokmakov

    Abstract: We propose relational linear programming, a simple framework for combing linear programs (LPs) and logic programs. A relational linear program (RLP) is a declarative LP template defining the objective and the constraints through the logical concepts of objects, relations, and quantified variables. This allows one to express the LP objective and constraints relationally for a varying number of indi… ▽ More

    Submitted 12 October, 2014; originally announced October 2014.