Search | arXiv e-print repository

STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos

Authors: Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa

Abstract: We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations… ▽ More We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks. △ Less

Submitted 9 September, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: Accepted at ICCV 2023

arXiv:2207.04398 [pdf, other]

Self-supervised Learning with Local Contrastive Loss for Detection and Semantic Segmentation

Authors: Ashraful Islam, Ben Lundell, Harpreet Sawhney, Sudipta Sinha, Peter Morales, Richard J. Radke

Abstract: We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervise… ▽ More We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervised learning methods with minimal overhead. We evaluate our SSL approach on two downstream tasks -- object detection and semantic segmentation, using COCO, PASCAL VOC, and CityScapes datasets. Our method outperforms the existing state-of-the-art SSL approaches by 1.9% on COCO object detection, 1.4% on PASCAL VOC detection, and 0.6% on CityScapes segmentation. △ Less

Submitted 7 December, 2022; v1 submitted 10 July, 2022; originally announced July 2022.

Comments: accepted to WACV 2023

arXiv:1907.09382 [pdf, other]

doi 10.1109/TPAMI.2021.3058606

Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition

Authors: Huseyin Coskun, Zeeshan Zia, Bugra Tekin, Federica Bogo, Nassir Navab, Federico Tombari, Harpreet Sawhney

Abstract: The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method for few-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain, which provides primitive action labels, to… ▽ More The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method for few-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain, which provides primitive action labels, to a different target domain using only a handful of examples. Visual cues we employ include object-object interactions, hand grasps and motion within regions that are a function of hand locations. We employ a framework based on meta-learning to extract the distinctive and domain invariant components of the deployed visual cues. This enables transfer of action classification models across public datasets captured with diverse scene and action configurations. We present comparative results of our transfer learning methodology and report superior results over state-of-the-art action classification approaches for both inter-class and inter-dataset transfer. △ Less

Submitted 7 December, 2021; v1 submitted 22 July, 2019; originally announced July 2019.

Comments: Paper has been accepted in Transactions on Pattern Analysis and Machine Intelligence

Journal ref: year = {5555}, volume = {}, number = {01}, issn = {1939-3539}, pages = {1-1},

arXiv:1604.03130 [pdf]

Video Analysis for Body-worn Cameras in Law Enforcement

Authors: Jason J. Corso, Alexandre Alahi, Kristen Grauman, Gregory D. Hager, Louis-Philippe Morency, Harpreet Sawhney, Yaser Sheikh

Abstract: The social conventions and expectations around the appropriate use of imaging and video has been transformed by the availability of video cameras in our pockets. The impact on law enforcement can easily be seen by watching the nightly news; more and more arrests, interventions, or even routine stops are being caught on cell phones or surveillance video, with both positive and negative consequences… ▽ More The social conventions and expectations around the appropriate use of imaging and video has been transformed by the availability of video cameras in our pockets. The impact on law enforcement can easily be seen by watching the nightly news; more and more arrests, interventions, or even routine stops are being caught on cell phones or surveillance video, with both positive and negative consequences. This proliferation of the use of video has led law enforcement to look at the potential benefits of incorporating video capture systematically in their day to day operations. At the same time, recognition of the inevitability of widespread use of video for police operations has caused a rush to deploy all types of cameras, including body worn cameras. However, the vast majority of police agencies have limited experience in utilizing video to its full advantage, and thus do not have the capability to fully realize the value of expanding their video capabilities. In this white paper, we highlight some of the technology needs and challenges of body-worn cameras, and we relate these needs to the relevant state of the art in computer vision and multimedia research. We conclude with a set of recommendations. △ Less

Submitted 7 May, 2018; v1 submitted 11 April, 2016; originally announced April 2016.

Comments: A Computing Community Consortium (CCC) white paper, 9 pages

arXiv:1512.00818 [pdf, other]

Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos

Authors: Mohamed Elhoseiny, **gen Liu, Hui Cheng, Harpreet Sawhney, Ahmed Elgammal

Abstract: We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following direction… ▽ More We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) automatically determining relevance of concepts/attributes to a free text query, which could be useful for other applications, and (c) retrieving videos by free text event query (e.g., "changing a vehicle tire") based on their content. We embed videos into a distributional semantic space and then measure the similarity between videos and the event query in a free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC metric. It is also an order of magnitude faster. △ Less

Submitted 15 December, 2015; v1 submitted 2 December, 2015; originally announced December 2015.

Comments: To appear in AAAI 2016

arXiv:1510.07317 [pdf, other]

Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Authors: S. Hussain Raza, Omar Javed, Aveek Das, Harpreet Sawhney, Hui Cheng, Irfan Essa

Abstract: We present an algorithm to estimate depth in dynamic video scenes. We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene. Using our method, depth can be estimated from unconstrained videos with no requirement of camera pose estimation, and with significant background/foreground motions. We start by decomposing a video into s… ▽ More We present an algorithm to estimate depth in dynamic video scenes. We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene. Using our method, depth can be estimated from unconstrained videos with no requirement of camera pose estimation, and with significant background/foreground motions. We start by decomposing a video into spatio-temporal regions. For each spatio-temporal region, we learn the relationship of depth to visual appearance, motion, and geometric classes. Then we infer the depth information of new scenes using piecewise planar parametrization estimated within a Markov random field (MRF) framework by combining appearance to depth learned map**s and occlusion boundary guided smoothness constraints. Subsequently, we perform temporal smoothing to obtain temporally consistent depth maps. To evaluate our depth estimation algorithm, we provide a novel dataset with ground truth depth for outdoor video scenes. We present a thorough evaluation of our algorithm on our new dataset and the publicly available Make3d static image dataset. △ Less

Submitted 25 October, 2015; originally announced October 2015.

Comments: British Machine Vision Conference (BMVC) 2014

arXiv:cs/0109043

PUC Autonomy and Policy Innovation: Local Telephone Competition in Arkansas and New York

Authors: Hokyu Lee, Harmeet Sawhney

Abstract: In the pre-divestiture era, the regulatory environment in the U.S. was fairly uniform and harmonious with the FCC setting the course and the accommodative state PUCs making corresponding changes in their own policies. The divestiture fractured this monolithic system as it forced the PUCs to respond to new forces unleashed in their own backyards. Soon there was great diversity in the overall regu… ▽ More In the pre-divestiture era, the regulatory environment in the U.S. was fairly uniform and harmonious with the FCC setting the course and the accommodative state PUCs making corresponding changes in their own policies. The divestiture fractured this monolithic system as it forced the PUCs to respond to new forces unleashed in their own backyards. Soon there was great diversity in the overall regulatory landscape. Within this new environment, there is considerable disparity among the PUCs in terms of their ability to implement new ideas. This paper seeks to understand the structural factors that influence the latitude of regulatory action by PUCs via a comparative study of local telephone competition policy making in Arkansas and New York. The analysis suggests that the presence or absence of countervailing forces determines the relative autonomy the PUCs enjoy and thereby their ability to introduce new ideas into their states. △ Less

Submitted 21 September, 2001; originally announced September 2001.

Comments: 29th TPRC Conference, 2001

Report number: TPRC-2001-026 ACM Class: K.4.m

Showing 1–7 of 7 results for author: Sawhney, H