Search | arXiv e-print repository

LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization

Authors: Sheng Liu, Cong Phuoc Huynh, Cong Chen, Maxim Arap, Raffay Hamid

Abstract: We present a simple yet effective self-supervised pre-training method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various vi… ▽ More We present a simple yet effective self-supervised pre-training method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various visual attributes, e.g., defocus blur, contrast, saturation, of the region specified by the generated mask. We then pre-train image harmonization models by recovering the original image from the perturbed image. Secondly, we introduce an image harmonization model, namely SwinIH, by retrofitting the Swin Transformer [27] with a combination of local and global self-attention mechanisms. Pre-training SwinIH with LEMaRT results in a new state of the art for image harmonization, while being label-efficient, i.e., consuming less annotated data for fine-tuning than existing methods. Notably, on iHarmony4 dataset [8], SwinIH outperforms the state of the art, i.e., SCS-Co [16] by a margin of 0.4 dB when it is fine-tuned on only 50% of the training data, and by 1.0 dB when it is trained on the full training dataset. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by CVPR'23, 19 pages

arXiv:2303.14526 [pdf, other]

Selective Structured State-Spaces for Long-Form Video Understanding

Authors: Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, Raffay Hamid

Abstract: Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all image-tokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation,… ▽ More Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all image-tokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation, we present a novel Selective S4 (i.e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos. Unlike previous mask-based token reduction methods used in transformers, our S5 model avoids the dense self-attention calculation by making use of the guidance of the momentum-updated S4 model. This enables our model to efficiently discard less informative tokens and adapt to various long-form video understanding tasks more effectively. However, as is the case for most token reduction methods, the informative image tokens could be dropped incorrectly. To improve the robustness and the temporal horizon of our model, we propose a novel long-short masked contrastive learning (LSMCL) approach that enables our model to predict longer temporal context using shorter input videos. We present extensive comparative results using three challenging long-form video understanding datasets (LVU, COIN and Breakfast), demonstrating that our approach consistently outperforms the previous state-of-the-art S4 model by up to 9.6% accuracy while reducing its memory footprint by 23%. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR 2023

arXiv:2206.08429 [pdf, other]

Scalable Temporal Localization of Sensitive Activities in Movies and TV Episodes

Authors: Xiang Hao, **gxiang Chen, Shixing Chen, Ahmed Saad, Raffay Hamid

Abstract: To help customers make better-informed viewing choices, video-streaming services try to moderate their content and provide more visibility into which portions of their movies and TV episodes contain age-appropriate material (e.g., nudity, sex, violence, or drug-use). Supervised models to localize these sensitive activities require large amounts of clip-level labeled data which is hard to obtain, w… ▽ More To help customers make better-informed viewing choices, video-streaming services try to moderate their content and provide more visibility into which portions of their movies and TV episodes contain age-appropriate material (e.g., nudity, sex, violence, or drug-use). Supervised models to localize these sensitive activities require large amounts of clip-level labeled data which is hard to obtain, while weakly-supervised models to this end usually do not offer competitive accuracy. To address this challenge, we propose a novel Coarse2Fine network designed to make use of readily obtainable video-level weak labels in conjunction with sparse clip-level labels of age-appropriate activities. Our model aggregates frame-level predictions to make video-level classifications and is therefore able to leverage sparse clip-level labels along with video-level labels. Furthermore, by performing frame-level predictions in a hierarchical manner, our approach is able to overcome the label-imbalance problem caused due to the rare-occurrence nature of age-appropriate content. We present comparative results of our approach using 41,234 movies and TV episodes (~3 years of video-content) from 521 sub-genres and 250 countries making it by far the largest-scale empirical analysis of age-appropriate activity localization in long-form videos ever published. Our approach offers 107.2% relative mAP improvement (from 5.5% to 11.4%) over existing state-of-the-art activity-localization approaches. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2204.04588 [pdf, other]

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

Authors: Alex Andonian, Shixing Chen, Raffay Hamid

Abstract: The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image… ▽ More The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data. Our model distills its own knowledge to dynamically generate soft-alignment targets for a subset of images and captions in every minibatch, which are then used to update its parameters. Extensive evaluation across 14 benchmark datasets shows that our method consistently outperforms its CLIP counterpart in multiple settings, including: (a) zero-shot classification, (b) linear probe transfer, and (c) image-text retrieval, without incurring added computational cost. Analysis using an ImageNet-based robustness test-bed reveals that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP itself. Lastly, pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples. △ Less

Submitted 9 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR 2022

arXiv:2204.02509 [pdf, other]

Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows

Authors: Sheng Liu, Xiaohan Nie, Raffay Hamid

Abstract: Existing approaches for Structure from Motion (SfM) produce impressive 3-D reconstruction results especially when using imagery captured with large parallax. However, to create engaging video-content in movies and TV shows, the amount by which a camera can be moved while filming a particular shot is often limited. The resulting small-motion parallax between video frames makes standard geometry-bas… ▽ More Existing approaches for Structure from Motion (SfM) produce impressive 3-D reconstruction results especially when using imagery captured with large parallax. However, to create engaging video-content in movies and TV shows, the amount by which a camera can be moved while filming a particular shot is often limited. The resulting small-motion parallax between video frames makes standard geometry-based SfM approaches not as effective for movies and TV shows. To address this challenge, we propose a simple yet effective approach that uses single-frame depth-prior obtained from a pretrained network to significantly improve geometry-based SfM for our small-parallax setting. To this end, we first use the depth-estimates of the detected keypoints to reconstruct the point cloud and camera-pose for initial two-view reconstruction. We then perform depth-regularized optimization to register new images and triangulate the new points during incremental reconstruction. To comprehensively evaluate our approach, we introduce a new dataset (StudioSfM) consisting of 130 shots with 21K frames from 15 studio-produced videos that are manually annotated by a professional CG studio. We demonstrate that our approach: (a) significantly improves the quality of 3-D reconstruction for our small-parallax setting, (b) does not cause any degradation for data with large-parallax, and (c) maintains the generalizability and scalability of geometry-based sparse SfM. Our dataset can be obtained at https://github.com/amazon-research/small-baseline-camera-tracking. △ Less

Submitted 5 April, 2022; originally announced April 2022.

arXiv:2202.10650 [pdf, other]

Movies2Scenes: Using Movie Metadata to Learn Scene Representation

Authors: Shixing Chen, Chun-Hao Liu, Xiang Hao, Xiaohan Nie, Maxim Arap, Raffay Hamid

Abstract: Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose… ▽ More Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation. Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other. Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets. Notably, our learned representation offers an average improvement of 7.9% on the seven classification tasks and 9.7% improvement on the two regression tasks in LVU dataset. Furthermore, using a newly collected movie dataset, we present comparative results of our scene representation on a set of video moderation tasks to demonstrate its generalizability on previously less explored tasks. △ Less

Submitted 29 March, 2023; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: Accepted to CVPR 2023

arXiv:2105.08506 [pdf, other]

COVID-19 Detection in Computed Tomography Images with 2D and 3D Approaches

Authors: Sara Atito Ali Ahmed, Mehmet Can Yavuz, Mehmet Umut Sen, Fatih Gulsen, Onur Tutar, Bora Korkmazer, Cesur Samanci, Sabri Sirolu, Rauf Hamid, Ali Ergun Eryurekli, Toghrul Mammadov, Berrin Yanikoglu

Abstract: Detecting COVID-19 in computed tomography (CT) or radiography images has been proposed as a supplement to the definitive RT-PCR test. We present a deep learning ensemble for detecting COVID-19 infection, combining slice-based (2D) and volume-based (3D) approaches. The 2D system detects the infection on each CT slice independently, combining them to obtain the patient-level decision via different m… ▽ More Detecting COVID-19 in computed tomography (CT) or radiography images has been proposed as a supplement to the definitive RT-PCR test. We present a deep learning ensemble for detecting COVID-19 infection, combining slice-based (2D) and volume-based (3D) approaches. The 2D system detects the infection on each CT slice independently, combining them to obtain the patient-level decision via different methods (averaging and long-short term memory networks). The 3D system takes the whole CT volume to arrive to the patient-level decision in one step. A new high resolution chest CT scan dataset, called the IST-C dataset, is also collected in this work. The proposed ensemble, called IST-CovNet, obtains 90.80% accuracy and 0.95 AUC score overall on the IST-C dataset in detecting COVID-19 among normal controls and other types of lung pathologies; and 93.69% accuracy and 0.99 AUC score on the publicly available MosMed dataset that consists of COVID-19 scans and normal controls only. The system is deployed at Istanbul University Cerrahpasa School of Medicine. △ Less

Submitted 20 May, 2021; v1 submitted 16 May, 2021; originally announced May 2021.

arXiv:2104.13537 [pdf, other]

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Authors: Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, Raffay Hamid

Abstract: Scenes play a crucial role in breaking the storyline of movies and TV episodes into semantically cohesive parts. However, given their complex temporal structure, finding scene boundaries can be a challenging task requiring large amounts of labeled training data. To address this challenge, we present a self-supervised shot contrastive learning approach (ShotCoL) to learn a shot representation that… ▽ More Scenes play a crucial role in breaking the storyline of movies and TV episodes into semantically cohesive parts. However, given their complex temporal structure, finding scene boundaries can be a challenging task requiring large amounts of labeled training data. To address this challenge, we present a self-supervised shot contrastive learning approach (ShotCoL) to learn a shot representation that maximizes the similarity between nearby shots compared to randomly selected shots. We show how to apply our learned shot representation for the task of scene boundary detection to offer state-of-the-art performance on the MovieNet dataset while requiring only ~25% of the training labels, using 9x fewer model parameters and offering 7x faster runtime. To assess the effectiveness of ShotCoL on novel applications of scene boundary detection, we take on the problem of finding timestamps in movies and TV episodes where video-ads can be inserted while offering a minimally disruptive viewing experience. To this end, we collected a new dataset called AdCuepoints with 3,975 movies and TV episodes, 2.2 million shots and 19,119 minimally disruptive ad cue-point labels. We present a thorough empirical analysis on this dataset demonstrating the effectiveness of ShotCoL for ad cue-points detection. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: Accepted to CVPR 2021

arXiv:1906.11495 [pdf, other]

Guidelines for develo** optical clocks with $10^{-18}$ fractional frequency uncertainty

Authors: Moustafa Abdel-Hafiz, Piotr Ablewski, Ali Al-Masoudi, Héctor Álvarez Martínez, Petr Balling, Geoffrey Barwood, Erik Benkler, Marcin Bober, Mateusz Borkowski, William Bowden, Roman Ciuryło, Hubert Cybulski, Alexandre Didier, Miroslav Doležal, Sören Dörscher, Stephan Falke, Rachel M. Godun, Ramiz Hamid, Ian R. Hill, Richard Hobson, Nils Huntemann, Yann Le Coq, Rodolphe Le Targat, Thomas Legero, Thomas Lindvall , et al. (20 additional authors not shown)

Abstract: There has been tremendous progress in the performance of optical frequency standards since the first proposals to carry out precision spectroscopy on trapped, single ions in the 1970s. The estimated fractional frequency uncertainty of today's leading optical standards is currently in the $10^{-18}$ range, approximately two orders of magnitude better than that of the best caesium primary frequency… ▽ More There has been tremendous progress in the performance of optical frequency standards since the first proposals to carry out precision spectroscopy on trapped, single ions in the 1970s. The estimated fractional frequency uncertainty of today's leading optical standards is currently in the $10^{-18}$ range, approximately two orders of magnitude better than that of the best caesium primary frequency standards. This exceptional accuracy and stability is resulting in a growing number of research groups develo** optical clocks. While good review papers covering the topic already exist, more practical guidelines are needed as a complement. The purpose of this document is therefore to provide technical guidance for researchers starting in the field of optical clocks. The target audience includes national metrology institutes (NMIs) wanting to set up optical clocks (or subsystems thereof) and PhD students and postdocs entering the field. Another potential audience is academic groups with experience in atomic physics and atom or ion trap**, but with less experience of time and frequency metrology and optical clock requirements. These guidelines have arisen from the scope of the EMPIR project "Optical clocks with $1 \times 10^{-18}$ uncertainty" (OC18). Therefore, the examples are from European laboratories even though similar work is carried out all over the world. The goal of OC18 was to push the development of optical clocks by improving each of the necessary subsystems: ultrastable lasers, neutral-atom and single-ion traps, and interrogation techniques. This document shares the knowledge acquired by the OC18 project consortium and gives practical guidance on each of these aspects. △ Less

Submitted 13 August, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

Comments: 130 pages, Table 5.3 corrected in v2

arXiv:1905.01116 [pdf, other]

doi 10.1103/PhysRevApplied.10.024027

Tailored design of mode-locking dynamics for low-noise frequency comb generation

Authors: Çağrı Şenel, Ramiz Hamid, Cihangir Erdoğan, Mehmet Çelik, Fatih Őmer Ilday

Abstract: We report on a mode-locked laser design using Yb-doped fiber lasers for low-noise frequency-comb generation. The frequency comb covers the spectral range from $700$ to $1400$ nm. Although this range is more practical for many measurements than that produced by the more commonly used Er-fiber lasers, it has been addressed in only a handful of reports, mainly due to the difficulty of generating a fu… ▽ More We report on a mode-locked laser design using Yb-doped fiber lasers for low-noise frequency-comb generation. The frequency comb covers the spectral range from $700$ to $1400$ nm. Although this range is more practical for many measurements than that produced by the more commonly used Er-fiber lasers, it has been addressed in only a handful of reports, mainly due to the difficulty of generating a fully coherent supercontinuum at $1\;μ$m. We overcome this difficulty by a tailored design of the mode-locking dynamics that succeeds in generating energetic 33 fs-long pulses without even using higher-order dispersion compensation, while ensuring that the laser operates with net zero cavity dispersion for low-noise supercontinuum generation. After locking to a Cs atomic clock, this frequency comb is used for absolute-frequency measurements of a Nd:YAG-I$_2$ laser to verify its accuracy by comparison with results from the International Committee for Weights and Measures. After this verification, it is further used to measure the absolute frequency of a 543-nm two-mode stabilized He-Ne laser, which is routinely used for length measurements in our institute, thus verifying its practical utility in metrology applications. The entire setup is built with readily available components for easy duplication by other researchers. △ Less

Submitted 3 May, 2019; originally announced May 2019.

Journal ref: Phys. Rev. Applied, vol. 10, no. 2, pp. 024027, 2018

arXiv:1404.5351 [pdf, other]

Fast Approximate Matching of Cell-Phone Videos for Robust Background Subtraction

Authors: Raffay Hamid, Atish Das Sarma, Dennis DeCoste, Neel Sundaresan

Abstract: We identify a novel instance of the background subtraction problem that focuses on extracting near-field foreground objects captured using handheld cameras. Given two user-generated videos of a scene, one with and the other without the foreground object(s), our goal is to efficiently generate an output video with only the foreground object(s) present in it. We cast this challenge as a spatio-tempo… ▽ More We identify a novel instance of the background subtraction problem that focuses on extracting near-field foreground objects captured using handheld cameras. Given two user-generated videos of a scene, one with and the other without the foreground object(s), our goal is to efficiently generate an output video with only the foreground object(s) present in it. We cast this challenge as a spatio-temporal frame matching problem, and propose an efficient solution for it that exploits the temporal smoothness of the video sequences. We present theoretical analyses for the error bounds of our approach, and validate our findings using a detailed set of simulation experiments. Finally, we present the results of our approach tested on multiple real videos captured using handheld cameras, and compare them to several alternate foreground extraction approaches. △ Less

Submitted 21 April, 2014; originally announced April 2014.

arXiv:1404.0466 [pdf, other]

piCholesky: Polynomial Interpolation of Multiple Cholesky Factors for Efficient Approximate Cross-Validation

Authors: Da Kuang, Alex Gittens, Raffay Hamid

Abstract: The dominant cost in solving least-square problems using Newton's method is often that of factorizing the Hessian matrix over multiple values of the regularization parameter ($λ$). We propose an efficient way to interpolate the Cholesky factors of the Hessian matrix computed over a small set of $λ$ values. This approximation enables us to optimally minimize the hold-out error while incurring only… ▽ More The dominant cost in solving least-square problems using Newton's method is often that of factorizing the Hessian matrix over multiple values of the regularization parameter ($λ$). We propose an efficient way to interpolate the Cholesky factors of the Hessian matrix computed over a small set of $λ$ values. This approximation enables us to optimally minimize the hold-out error while incurring only a fraction of the cost compared to exact cross-validation. We provide a formal error bound for our approximation scheme and present solutions to a set of key implementation challenges that allow our approach to maximally exploit the compute power of modern architectures. We present a thorough empirical analysis over multiple datasets to show the effectiveness of our approach. △ Less

Submitted 10 June, 2015; v1 submitted 2 April, 2014; originally announced April 2014.

arXiv:1312.4626 [pdf, other]

Compact Random Feature Maps

Authors: Raffay Hamid, Ying Xiao, Alex Gittens, Dennis DeCoste

Abstract: Kernel approximation using randomized feature maps has recently gained a lot of interest. In this work, we identify that previous approaches for polynomial kernel approximation create maps that are rank deficient, and therefore do not utilize the capacity of the projected feature space effectively. To address this challenge, we propose compact random feature maps (CRAFTMaps) to approximate polynom… ▽ More Kernel approximation using randomized feature maps has recently gained a lot of interest. In this work, we identify that previous approaches for polynomial kernel approximation create maps that are rank deficient, and therefore do not utilize the capacity of the projected feature space effectively. To address this challenge, we propose compact random feature maps (CRAFTMaps) to approximate polynomial kernels more concisely and accurately. We prove the error bounds of CRAFTMaps demonstrating their superior kernel reconstruction performance compared to the previous approximation schemes. We show how structured random matrices can be used to efficiently generate CRAFTMaps, and present a single-pass algorithm using CRAFTMaps to learn non-linear multi-class classifiers. We present experiments on multiple standard data-sets with performance competitive with state-of-the-art results. △ Less

Submitted 16 December, 2013; originally announced December 2013.

Comments: 9 pages

arXiv:1212.3834 [pdf]

Coherent Population Trap** resonances on lower atomic levels of Doppler broadened optical lines

Authors: Ersoy Sahin, Gonul Ozen, Ramiz Hamid, Mehmet Celik, Azad Ch. Izmailov

Abstract: We have detected and analysed narrow high-contrast coherent population trap** (CPT) resonances, which are induced in absorption of the weak probe light beam by the counterpropagating two-frequency pum** radiation. Our experimental investigations have been carried out on example of nonclosed three level Lambda systems formed by spectral components of the Doppler broadened D2 line of cesium atom… ▽ More We have detected and analysed narrow high-contrast coherent population trap** (CPT) resonances, which are induced in absorption of the weak probe light beam by the counterpropagating two-frequency pum** radiation. Our experimental investigations have been carried out on example of nonclosed three level Lambda systems formed by spectral components of the Doppler broadened D2 line of cesium atoms. We have established that CPT resonances in transmission of the probe beam (in the cesium vapor), at definite conditions, may have not only more contrast but also much lesser width in comparison with well- known CPT resonances in transmission of the corresponding two-frequency pum** radiation. Thus CPT resonances, detected by the elaborated method, may be used in atomic frequency standards and sensitive magnetometers (based on the CPT phenomenon) and also in ultahigh resolution spectroscopy of atoms and molecules. △ Less

Submitted 16 December, 2012; originally announced December 2012.

arXiv:1201.5250 [pdf]

doi 10.1134/S1054660X12060096

High contrast resonances of the coherent population trap** on sublevels of the ground atomic term

Authors: Ersoy Sahin, Ramiz Hamid, Cengiz Birlikseven, Gönül Özen, Azad Ch. Izmailov

Abstract: We have detected and analyzed narrow, high contrast coherent population trap** resonances, which appear in transmission of the probe monochromatic light beam under action of the counterpropagating two-frequency laser radiation, on example of the nonclosed three level Λ-system formed by spectral components of the Doppler broadened D2 line of cesium atoms (in the cell with the rarefied Cs vapor).… ▽ More We have detected and analyzed narrow, high contrast coherent population trap** resonances, which appear in transmission of the probe monochromatic light beam under action of the counterpropagating two-frequency laser radiation, on example of the nonclosed three level Λ-system formed by spectral components of the Doppler broadened D2 line of cesium atoms (in the cell with the rarefied Cs vapor). These nontrivial resonances are determined directly by the trapped atomic population on the definite lower level of the Λ-system and may be used in atomic frequency standards, sensitive magnetometers and in ultrahigh resolution laser spectroscopy of atoms and molecules. △ Less

Submitted 25 January, 2012; originally announced January 2012.

Comments: 4

Showing 1–15 of 15 results for author: Hamid, R