-
STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos
Authors:
Anshul Shah,
Benjamin Lundell,
Harpreet Sawhney,
Rama Chellappa
Abstract:
We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations…
▽ More
We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks.
△ Less
Submitted 9 September, 2023; v1 submitted 2 January, 2023;
originally announced January 2023.
-
Self-supervised Learning with Local Contrastive Loss for Detection and Semantic Segmentation
Authors:
Ashraful Islam,
Ben Lundell,
Harpreet Sawhney,
Sudipta Sinha,
Peter Morales,
Richard J. Radke
Abstract:
We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervise…
▽ More
We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervised learning methods with minimal overhead. We evaluate our SSL approach on two downstream tasks -- object detection and semantic segmentation, using COCO, PASCAL VOC, and CityScapes datasets. Our method outperforms the existing state-of-the-art SSL approaches by 1.9% on COCO object detection, 1.4% on PASCAL VOC detection, and 0.6% on CityScapes segmentation.
△ Less
Submitted 7 December, 2022; v1 submitted 10 July, 2022;
originally announced July 2022.
-
Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition
Authors:
Huseyin Coskun,
Zeeshan Zia,
Bugra Tekin,
Federica Bogo,
Nassir Navab,
Federico Tombari,
Harpreet Sawhney
Abstract:
The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method for few-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain, which provides primitive action labels, to…
▽ More
The lack of large-scale real datasets with annotations makes transfer learning a necessity for video activity understanding. We aim to develop an effective method for few-shot transfer learning for first-person action classification. We leverage independently trained local visual cues to learn representations that can be transferred from a source domain, which provides primitive action labels, to a different target domain using only a handful of examples. Visual cues we employ include object-object interactions, hand grasps and motion within regions that are a function of hand locations. We employ a framework based on meta-learning to extract the distinctive and domain invariant components of the deployed visual cues. This enables transfer of action classification models across public datasets captured with diverse scene and action configurations. We present comparative results of our transfer learning methodology and report superior results over state-of-the-art action classification approaches for both inter-class and inter-dataset transfer.
△ Less
Submitted 7 December, 2021; v1 submitted 22 July, 2019;
originally announced July 2019.
-
Video Analysis for Body-worn Cameras in Law Enforcement
Authors:
Jason J. Corso,
Alexandre Alahi,
Kristen Grauman,
Gregory D. Hager,
Louis-Philippe Morency,
Harpreet Sawhney,
Yaser Sheikh
Abstract:
The social conventions and expectations around the appropriate use of imaging and video has been transformed by the availability of video cameras in our pockets. The impact on law enforcement can easily be seen by watching the nightly news; more and more arrests, interventions, or even routine stops are being caught on cell phones or surveillance video, with both positive and negative consequences…
▽ More
The social conventions and expectations around the appropriate use of imaging and video has been transformed by the availability of video cameras in our pockets. The impact on law enforcement can easily be seen by watching the nightly news; more and more arrests, interventions, or even routine stops are being caught on cell phones or surveillance video, with both positive and negative consequences. This proliferation of the use of video has led law enforcement to look at the potential benefits of incorporating video capture systematically in their day to day operations. At the same time, recognition of the inevitability of widespread use of video for police operations has caused a rush to deploy all types of cameras, including body worn cameras. However, the vast majority of police agencies have limited experience in utilizing video to its full advantage, and thus do not have the capability to fully realize the value of expanding their video capabilities. In this white paper, we highlight some of the technology needs and challenges of body-worn cameras, and we relate these needs to the relevant state of the art in computer vision and multimedia research. We conclude with a set of recommendations.
△ Less
Submitted 7 May, 2018; v1 submitted 11 April, 2016;
originally announced April 2016.
-
Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos
Authors:
Mohamed Elhoseiny,
**gen Liu,
Hui Cheng,
Harpreet Sawhney,
Ahmed Elgammal
Abstract:
We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following direction…
▽ More
We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) automatically determining relevance of concepts/attributes to a free text query, which could be useful for other applications, and (c) retrieving videos by free text event query (e.g., "changing a vehicle tire") based on their content. We embed videos into a distributional semantic space and then measure the similarity between videos and the event query in a free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC metric. It is also an order of magnitude faster.
△ Less
Submitted 15 December, 2015; v1 submitted 2 December, 2015;
originally announced December 2015.
-
Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries
Authors:
S. Hussain Raza,
Omar Javed,
Aveek Das,
Harpreet Sawhney,
Hui Cheng,
Irfan Essa
Abstract:
We present an algorithm to estimate depth in dynamic video scenes. We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene. Using our method, depth can be estimated from unconstrained videos with no requirement of camera pose estimation, and with significant background/foreground motions. We start by decomposing a video into s…
▽ More
We present an algorithm to estimate depth in dynamic video scenes. We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene. Using our method, depth can be estimated from unconstrained videos with no requirement of camera pose estimation, and with significant background/foreground motions. We start by decomposing a video into spatio-temporal regions. For each spatio-temporal region, we learn the relationship of depth to visual appearance, motion, and geometric classes. Then we infer the depth information of new scenes using piecewise planar parametrization estimated within a Markov random field (MRF) framework by combining appearance to depth learned map**s and occlusion boundary guided smoothness constraints. Subsequently, we perform temporal smoothing to obtain temporally consistent depth maps. To evaluate our depth estimation algorithm, we provide a novel dataset with ground truth depth for outdoor video scenes. We present a thorough evaluation of our algorithm on our new dataset and the publicly available Make3d static image dataset.
△ Less
Submitted 25 October, 2015;
originally announced October 2015.
-
PUC Autonomy and Policy Innovation: Local Telephone Competition in Arkansas and New York
Authors:
Hokyu Lee,
Harmeet Sawhney
Abstract:
In the pre-divestiture era, the regulatory environment in the U.S. was fairly uniform and harmonious with the FCC setting the course and the accommodative state PUCs making corresponding changes in their own policies. The divestiture fractured this monolithic system as it forced the PUCs to respond to new forces unleashed in their own backyards. Soon there was great diversity in the overall regu…
▽ More
In the pre-divestiture era, the regulatory environment in the U.S. was fairly uniform and harmonious with the FCC setting the course and the accommodative state PUCs making corresponding changes in their own policies. The divestiture fractured this monolithic system as it forced the PUCs to respond to new forces unleashed in their own backyards. Soon there was great diversity in the overall regulatory landscape. Within this new environment, there is considerable disparity among the PUCs in terms of their ability to implement new ideas. This paper seeks to understand the structural factors that influence the latitude of regulatory action by PUCs via a comparative study of local telephone competition policy making in Arkansas and New York. The analysis suggests that the presence or absence of countervailing forces determines the relative autonomy the PUCs enjoy and thereby their ability to introduce new ideas into their states.
△ Less
Submitted 21 September, 2001;
originally announced September 2001.