-
AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space
Authors:
Ali Mottaghi,
Mohammad Abdullah Jamal,
Serena Yeung,
Omid Mohareri
Abstract:
Semi-supervised domain adaptation (SSDA) presents a critical hurdle in computer vision, especially given the frequent scarcity of labeled data in real-world settings. This scarcity often causes foundation models, trained on extensive datasets, to underperform when applied to new domains. AdaEmbed, our newly proposed methodology for SSDA, offers a promising solution to these challenges. Leveraging…
▽ More
Semi-supervised domain adaptation (SSDA) presents a critical hurdle in computer vision, especially given the frequent scarcity of labeled data in real-world settings. This scarcity often causes foundation models, trained on extensive datasets, to underperform when applied to new domains. AdaEmbed, our newly proposed methodology for SSDA, offers a promising solution to these challenges. Leveraging the potential of unlabeled data, AdaEmbed facilitates the transfer of knowledge from a labeled source domain to an unlabeled target domain by learning a shared embedding space. By generating accurate and uniform pseudo-labels based on the established embedding space, the model overcomes the limitations of conventional SSDA, thus enhancing performance significantly. Our method's effectiveness is validated through extensive experiments on benchmark datasets such as DomainNet, Office-Home, and VisDA-C, where AdaEmbed consistently outperforms all the baselines, setting a new state of the art for SSDA. With its straightforward implementation and high data efficiency, AdaEmbed stands out as a robust and pragmatic solution for real-world scenarios, where labeled data is scarce. To foster further research and application in this area, we are sharing the codebase of our unified framework for semi-supervised domain adaptation.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition in the Operating Room
Authors:
Idris Hamoud,
Muhammad Abdullah Jamal,
Vinkle Srivastav,
Didier Mutter,
Nicolas Padoy,
Omid Mohareri
Abstract:
Surgical robotics holds much promise for improving patient safety and clinician experience in the Operating Room (OR). However, it also comes with new challenges, requiring strong team coordination and effective OR management. Automatic detection of surgical activities is a key requirement for develo** AI-based intelligent tools to tackle these challenges. The current state-of-the-art surgical a…
▽ More
Surgical robotics holds much promise for improving patient safety and clinician experience in the Operating Room (OR). However, it also comes with new challenges, requiring strong team coordination and effective OR management. Automatic detection of surgical activities is a key requirement for develo** AI-based intelligent tools to tackle these challenges. The current state-of-the-art surgical activity recognition methods however operate on image-based representations and depend on large-scale labeled datasets whose collection is time-consuming and resource-expensive. This work proposes a new sample-efficient and object-based approach for surgical activity recognition in the OR. Our method focuses on the geometric arrangements between clinicians and surgical devices, thus utilizing the significant object interaction dynamics in the OR. We conduct experiments in a low-data regime study for long video activity recognition. We also benchmark our method againstother object-centric approaches on clip-level action classification and show superior performance.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
Authors:
Muhammad Abdullah Jamal,
Omid Mohareri
Abstract:
We present a new pre-training strategy called M$^{3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectivel…
▽ More
We present a new pre-training strategy called M$^{3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth estimation. Experiments show that M$^{3}$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further evaluate our method on low-data regime and demonstrate its superior data efficiency compared to current state-of-the-art approaches.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
SurgMAE: Masked Autoencoders for Long Surgical Video Analysis
Authors:
Muhammad Abdullah Jamal,
Omid Mohareri
Abstract:
There has been a growing interest in using deep learning models for processing long surgical videos, in order to automatically detect clinical/operational activities and extract metrics that can enable workflow efficiency tools and applications. However, training such models require vast amounts of labeled data which is costly and not scalable. Recently, self-supervised learning has been explored…
▽ More
There has been a growing interest in using deep learning models for processing long surgical videos, in order to automatically detect clinical/operational activities and extract metrics that can enable workflow efficiency tools and applications. However, training such models require vast amounts of labeled data which is costly and not scalable. Recently, self-supervised learning has been explored in computer vision community to reduce the burden of the annotation cost. Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs) by predicting the randomly masked regions given the visible patches of an image or a video clip, and have shown superior performance on benchmark datasets. However, the application of MAE in surgical data remains unexplored. In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain. We propose SurgMAE, which is a novel architecture with a masking strategy based on sampling high spatio-temporal tokens for MAE. We provide an empirical study of SurgMAE on two large scale long surgical video datasets, and find that our method outperforms several baselines in low data regime. We conduct extensive ablation studies to show the efficacy of our approach and also demonstrate it's superior performance on UCF-101 to prove it's generalizability in non-surgical datasets as well.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Lasting Diversity and Superior Runtime Guarantees for the $(μ+1)$ Genetic Algorithm
Authors:
Benjamin Doerr,
Aymen Echarghaoui,
Mohammed Jamal,
Martin S. Krejca
Abstract:
Most evolutionary algorithms (EAs) used in practice employ crossover. In contrast, only for few and mostly artificial examples a runtime advantage from crossover could be proven with mathematical means. The most convincing such result shows that the $(μ+1)$ genetic algorithm (GA) with population size $μ=O(n)$ optimizes jump functions with gap size $k \ge 3$ in time $O(n^k / μ+ n^{k-1}\log n)$, bea…
▽ More
Most evolutionary algorithms (EAs) used in practice employ crossover. In contrast, only for few and mostly artificial examples a runtime advantage from crossover could be proven with mathematical means. The most convincing such result shows that the $(μ+1)$ genetic algorithm (GA) with population size $μ=O(n)$ optimizes jump functions with gap size $k \ge 3$ in time $O(n^k / μ+ n^{k-1}\log n)$, beating the $Θ(n^k)$ runtime of many mutation-based EAs. This result builds on a proof that the GA occasionally and then for an expected number of $Ω(μ^2)$ iterations has a population that is not dominated by a single genotype.
In this work, we show that this diversity persist with high probability for a time exponential in $μ$ (instead of quadratic). From this better understanding of the population diversity, we obtain stronger runtime guarantees, among them the statement that for all $c\ln(n)\leμ\le n/\log n$, with $c$ a suitable constant, the runtime of the $(μ+1)$ GA on $\mathrm{Jump}_k$, with $k \ge 3$, is $O(n^{k-1})$. Consequently, already with logarithmic population sizes, the GA gains a speed-up of order $Ω(n)$ from crossover.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Multi-Modal Unsupervised Pre-Training for Surgical Operating Room Workflow Analysis
Authors:
Muhammad Abdullah Jamal,
Omid Mohareri
Abstract:
Data-driven approaches to assist operating room (OR) workflow analysis depend on large curated datasets that are time consuming and expensive to collect. On the other hand, we see a recent paradigm shift from supervised learning to self-supervised and/or unsupervised learning approaches that can learn representations from unlabeled datasets. In this paper, we leverage the unlabeled data captured i…
▽ More
Data-driven approaches to assist operating room (OR) workflow analysis depend on large curated datasets that are time consuming and expensive to collect. On the other hand, we see a recent paradigm shift from supervised learning to self-supervised and/or unsupervised learning approaches that can learn representations from unlabeled datasets. In this paper, we leverage the unlabeled data captured in robotic surgery ORs and propose a novel way to fuse the multi-modal data for a single video frame or image. Instead of producing different augmentations (or 'views') of the same image or video frame which is a common practice in self-supervised learning, we treat the multi-modal data as different views to train the model in an unsupervised manner via clustering. We compared our method with other state of the art methods and results show the superior performance of our approach on surgical video activity recognition and semantic segmentation.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
An Empirical Study on Activity Recognition in Long Surgical Videos
Authors:
Zhuohong He,
Ali Mottaghi,
Aidean Sharghi,
Muhammad Abdullah Jamal,
Omid Mohareri
Abstract:
Activity recognition in surgical videos is a key research area for develo** next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and tempora…
▽ More
Activity recognition in surgical videos is a key research area for develo** next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and temporal models to find architectures that yield the strongest performance for surgical activity recognition. We first benchmark the models performance on a large-scale activity recognition dataset containing over 800 surgery videos captured in multiple clinical operating rooms. We further evaluate the models on the two smaller public datasets, the Cholec80 and Cataract-101 datasets, containing only 80 and 101 videos respectively. We empirically found that Swin-Transformer+BiGRU temporal model yielded strong performance on both datasets. Finally, we investigate the adaptability of the model to new domains by fine-tuning models to a new hospital and experimenting with a recent unsupervised domain adaptation approach.
△ Less
Submitted 6 September, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from a Domain Adaptation Perspective
Authors:
Muhammad Abdullah Jamal,
Matthew Brown,
Ming-Hsuan Yang,
Liqiang Wang,
Boqing Gong
Abstract:
Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions seen by a machine learning model and our expectation of the model to perform well on all classes. We analyze this mismatch from a domain adaptation point of view. First of all, we connect existing class-balanced methods for long-tailed classification to target s…
▽ More
Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions seen by a machine learning model and our expectation of the model to perform well on all classes. We analyze this mismatch from a domain adaptation point of view. First of all, we connect existing class-balanced methods for long-tailed classification to target shift, a well-studied scenario in domain adaptation. The connection reveals that these methods implicitly assume that the training data and test data share the same class-conditioned distribution, which does not hold in general and especially for the tail classes. While a head class could contain abundant and diverse training examples that well represent the expected data at inference time, the tail classes are often short of representative training data. To this end, we propose to augment the classic class-balanced learning by explicitly estimating the differences between the class-conditioned distributions with a meta-learning approach. We validate our approach with six benchmark datasets and three loss functions.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
Task-Agnostic Meta-Learning for Few-shot Learning
Authors:
Muhammad Abdullah Jamal,
Guo-Jun Qi,
Mubarak Shah
Abstract:
Meta-learning approaches have been proposed to tackle the few-shot learning problem.Typically, a meta-learner is trained on a variety of tasks in the hopes of being generalizable to new tasks. However, the generalizability on new tasks of a meta-learner could be fragile when it is over-trained on existing tasks during meta-training phase. In other words, the initial model of a meta-learner could b…
▽ More
Meta-learning approaches have been proposed to tackle the few-shot learning problem.Typically, a meta-learner is trained on a variety of tasks in the hopes of being generalizable to new tasks. However, the generalizability on new tasks of a meta-learner could be fragile when it is over-trained on existing tasks during meta-training phase. In other words, the initial model of a meta-learner could be too biased towards existing tasks to adapt to new tasks, especially when only very few examples are available to update the model. To avoid a biased meta-learner and improve its generalizability, we propose a novel paradigm of Task-Agnostic Meta-Learning (TAML) algorithms. Specifically, we present an entropy-based approach that meta-learns an unbiased initial model with the largest uncertainty over the output labels by preventing it from over-performing in classification tasks. Alternatively, a more general inequality-minimization TAML is presented for more ubiquitous scenarios by directly minimizing the inequality of initial losses beyond the classification tasks wherever a suitable loss can be defined.Experiments on benchmarked datasets demonstrate that the proposed approaches outperform compared meta-learning algorithms in both few-shot classification and reinforcement learning tasks.
△ Less
Submitted 20 May, 2018;
originally announced May 2018.