-
Cavity Induced Extraordinary Optical Transmission and Active Modulation with Graphene
Authors:
Yifei Zhang,
Baoqing Zhang,
Mingming Feng,
Haotian Ling,
Xijian Zhang,
Yiming Wang,
Xiaomu Wang,
Qingpu Wang,
Aimin Song
Abstract:
Extraordinary optical transmission (EOT) is a phenomenon of exceptional light transmission through a metallic film with hole arrays enhanced by surface plasmon (SP) resonance, which stimulates renewed research hotspots in metamaterials, subwavelength optics, and plasmonics. Below the frequency of the first order SP mode, f_pl0, the metallic film typically shows strong reflection and no EOT. Here,…
▽ More
Extraordinary optical transmission (EOT) is a phenomenon of exceptional light transmission through a metallic film with hole arrays enhanced by surface plasmon (SP) resonance, which stimulates renewed research hotspots in metamaterials, subwavelength optics, and plasmonics. Below the frequency of the first order SP mode, f_pl0, the metallic film typically shows strong reflection and no EOT. Here, we report an unusual EOT phenomenon below fpl0, i.e., beyond the long-held spectral boundary of classic EOTs. It is induced by a novel bound surface state in a Fabry-Perot(F-P) cavity comprising a holey gold film and a silicon-air interface. By tailoring the cavity length, EOT phenomenon has been pushed deep into the sub-wavelength region by a factor of as large as 20%, and EOT frequency comb with cavity function has been achieved. Due to the enhanced slow-wave effect as the frequency approaches fpl0, the cavity induced EOT gradually merges with the first order SP EOT. Distinguishing from the classic EOT phenomenon, no transmission zero is found between these two EOTs, which dramatically broadens the EOT bandwidth by a factor of 10 at terahertz (THz) frequencies. Furthermore, the EOT transmittance is actively modulated with graphene, achieving a large modulation range from 0.5 to 0.25 under a sub-volt bias from -0.3 to 0.5 V at 500 GHz. To the best of the authors' knowledge, both the modulation range and the low bias are among the best for active EOT devices with graphene to date. Such a structure provides a new strategy for miniaturizing sensing devices, high-power sources, and broadband photonics as well as their active control in the THz regime.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
Simple Coding Techniques for Many-Hop Relaying
Authors:
Yan Hao Ling,
Jonathan Scarlett
Abstract:
In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error…
▽ More
In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error probability) and is significantly simpler than existing protocols that do so. In addition, we characterize the optimal low-noise and high-noise scaling laws of the information velocity, and we adapt our 1-bit protocol to transmit $k$ bits over $m$ hops with $O(m+k)$ transmission time.
△ Less
Submitted 7 December, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.
-
Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images
Authors:
Gongyang Li,
Zhi Liu,
Weisi Lin,
Haibin Ling
Abstract:
In the computer vision community, great progresses have been achieved in salient object detection from natural scene images (NSI-SOD); by contrast, salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. The unique characteristics of optical RSIs, such as scales, illuminations and imaging orientations, bring significant differences between NS…
▽ More
In the computer vision community, great progresses have been achieved in salient object detection from natural scene images (NSI-SOD); by contrast, salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. The unique characteristics of optical RSIs, such as scales, illuminations and imaging orientations, bring significant differences between NSI-SOD and RSI-SOD. In this paper, we propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. Specifically, MCCNet is based on the general encoder-decoder architecture, and contains a novel key component named Multi-Content Complementation Module (MCCM), which bridges the encoder and the decoder. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features, and exploit the content complementarity between them to highlight salient regions over various scales in RSI features through the attention mechanism. Besides, we comprehensively introduce pixel-level, map-level and metric-aware losses in the training phase. Extensive experiments on two popular datasets demonstrate that the proposed MCCNet outperforms 23 state-of-the-art methods, including both NSI-SOD and RSI-SOD methods. The code and results of our method are available at https://github.com/MathLee/MCCNet.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
SwinTrack: A Simple and Strong Baseline for Transformer Tracking
Authors:
Liting Lin,
Heng Fan,
Zhipeng Zhang,
Yong Xu,
Haibin Ling
Abstract:
Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet…
▽ More
Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0.713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https://github.com/LitingLin/SwinTrack.
△ Less
Submitted 13 October, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Searching the Search Space of Vision Transformer
Authors:
Minghao Chen,
Kan Wu,
Bolin Ni,
Houwen Peng,
Bei Liu,
Jianlong Fu,
Hongyang Chao,
Haibin Ling
Abstract:
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to g…
▽ More
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https://github.com/microsoft/Cream.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
Swee** Plasma Frequency of Terahertz Surface Plasmon Polaritons with Graphene
Authors:
Mingming Feng,
Baoqing Zhang,
Haotian Ling,
Zihao Zhang,
Yiming Wang,
Yilin Wang,
Xijian Zhang,
**rang Hua,
Qingpu Wang,
Aimin Song,
Yifei Zhang
Abstract:
Plasma frequency is the spectral boundary for low-loss propagation and evanescent decay of surface plasmon polariton (SPP) waves, which corresponds to a high cut-off phenomenon and is typically utilized for identifying SPPs. At terahertz (THz) frequencies, a metal line with periodic metallic grooves can mimic the conventional optical SPPs, which is referred to as designer SPPs. Theoretically, the…
▽ More
Plasma frequency is the spectral boundary for low-loss propagation and evanescent decay of surface plasmon polariton (SPP) waves, which corresponds to a high cut-off phenomenon and is typically utilized for identifying SPPs. At terahertz (THz) frequencies, a metal line with periodic metallic grooves can mimic the conventional optical SPPs, which is referred to as designer SPPs. Theoretically, the plasma frequency of THz SPPs decreases as the groove depth increases. Here, by replacing the metallic grooves with graphene sheets, dynamically swee** SPP plasma frequency is demonstrated for the first time. The metal-graphene hybrid structure comprises a metal line with periodic graphene grooves, a thin-layer ion gel for gating graphene, and metallic tips for uniforming gate field. As the chemical potential changes, the average conductivity of graphene is modulated so that the effective depth of the graphene grooves changes, which sweeps the plasma frequency of THz SPPs consequently. Both simulated and experimental data demonstrate a red shift of plasma frequency from 195 to 180 GHz at a low bias from -0.5 to 0.5 V. The proposed structure reveals a novel approach to control the on/off status of SPP propagation in the THz range.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
EditGAN: High-Precision Semantic Image Editing
Authors:
Huan Ling,
Karsten Kreis,
Daiqing Li,
Seung Wook Kim,
Antonio Torralba,
Sanja Fidler
Abstract:
Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editin…
▽ More
Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e.g., drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentations, requiring only a handful of labeled examples, making it a scalable tool for editing. Specifically, we embed an image into the GAN latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find editing vectors in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom, while preserving full image quality.We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Osteoporosis Prescreening using Panoramic Radiographs through a Deep Convolutional Neural Network with Attention Mechanism
Authors:
Heng Fan,
Jiaxiang Ren,
Jie Yang,
Yi-Xian Qin,
Haibin Ling
Abstract:
Objectives. The aim of this study was to investigate whether a deep convolutional neural network (CNN) with an attention module can detect osteoporosis on panoramic radiographs.
Study Design. A dataset of 70 panoramic radiographs (PRs) from 70 different subjects of age between 49 to 60 was used, including 49 subjects with osteoporosis and 21 normal subjects. We utilized the leave-one-out cross-v…
▽ More
Objectives. The aim of this study was to investigate whether a deep convolutional neural network (CNN) with an attention module can detect osteoporosis on panoramic radiographs.
Study Design. A dataset of 70 panoramic radiographs (PRs) from 70 different subjects of age between 49 to 60 was used, including 49 subjects with osteoporosis and 21 normal subjects. We utilized the leave-one-out cross-validation approach to generate 70 training and test splits. Specifically, for each split, one image was used for testing and the remaining 69 images were used for training. A deep convolutional neural network (CNN) using the Siamese architecture was implemented through a fine-tuning process to classify an PR image using patches extracted from eight representative trabecula bone areas (Figure 1). In order to automatically learn the importance of different PR patches, an attention module was integrated into the deep CNN. Three metrics, including osteoporosis accuracy (OPA), non-osteoporosis accuracy (NOPA) and overall accuracy (OA), were utilized for performance evaluation.
Results. The proposed baseline CNN approach achieved the OPA, NOPA and OA scores of 0.667, 0.878 and 0.814, respectively. With the help of the attention module, the OPA, NOPA and OA scores were further improved to 0.714, 0.939 and 0.871, respectively.
Conclusions. The proposed method obtained promising results using deep CNN with an attention module, which might be applied to osteoporosis prescreening.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization
Authors:
Tao Sun,
Huaming Ling,
Zuoqiang Shi,
Dongsheng Li,
Bao Wang
Abstract:
Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparamete…
▽ More
Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparameter, we propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for quadratic optimization. Our proposed adaptive heavy ball momentum can improve stochastic gradient descent (SGD) and Adam. SGD and Adam with the newly designed adaptive momentum are more robust to large learning rates, converge faster, and generalize better than the baselines. We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation. Finally, we provide convergence guarantees for SGD and Adam with the proposed adaptive momentum.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Topological study of a Bogoliubov-de Gennes system of pseudo spin-$1/2$ bosons with conserved magnetization in a honeycomb lattice
Authors:
Hong Y. Ling,
Ben Kain
Abstract:
We consider a Bogolibov-de Geenes (BdG) Hamiltonian, which is a non-Hermitian Hamiltonian with pseudo-Hermiticity, for a system of (pseudo) spin-$1/2$ bosons in a honeycomb lattice under the condition that the population difference between the two spin components, i.e., magnetization, is a constant. Such a system is capable of acting as a topological amplifier, under time-reversal symmetry, with s…
▽ More
We consider a Bogolibov-de Geenes (BdG) Hamiltonian, which is a non-Hermitian Hamiltonian with pseudo-Hermiticity, for a system of (pseudo) spin-$1/2$ bosons in a honeycomb lattice under the condition that the population difference between the two spin components, i.e., magnetization, is a constant. Such a system is capable of acting as a topological amplifier, under time-reversal symmetry, with stable bulk bands but unstable edge modes which can be populated at an exponentially fast rate. We quantitatively study the topological properties of this model within the framework of the 38-fold way for non-Hermitian systems. We find, through the symmetry analysis of the Bloch Hamiltonian, that this model is classified either as two copies of symmetry class AIII+$η_-$ or two copies of symmetry class A+$η$ depending on whether the (total) system is time-reversal-symmetric, where $η$ is the matrix representing pseudo-Hermiticity and $η_-$ indicates that pseudo-Hermiticity and chiral symmetry operators anticommute. We prove, within the context of non-Hermitian physics where eigenstates obey the bi-orthonormality relation, that a stable bulk is characterized by a single topological invariant, the Chern number for the Haldane model, independent of pairing interactions. We construct a convenient analytical description for the edge modes of the Haldane model in semi-infinite planes, which is expected to be useful for models built upon copies of the Haldane model across a broad array of disciplines. We adapt the theorem in our recent work [Phys. Rev. A 104, 013305 (2021)] to pseudo-Hermitian Hamiltonians that are less restrictive than BdG Hamiltonians and apply it to highlight that the vanishing of an unconventional commutator between number-conserving and number-nonconserving parts of the Hamiltonian indicates whether a system can be made to act as a topological amplifier.
△ Less
Submitted 8 June, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Deep Learning Approach Protecting Privacy in Camera-Based Critical Applications
Authors:
Gautham Ramajayam,
Tao Sun,
Chiu C. Tan,
Lannan Luo,
Haibin Ling
Abstract:
Many critical applications rely on cameras to capture video footage for analytical purposes. This has led to concerns about these cameras accidentally capturing more information than is necessary. In this paper, we propose a deep learning approach towards protecting privacy in camera-based systems. Instead of specifying specific objects (e.g. faces) are privacy sensitive, our technique distinguish…
▽ More
Many critical applications rely on cameras to capture video footage for analytical purposes. This has led to concerns about these cameras accidentally capturing more information than is necessary. In this paper, we propose a deep learning approach towards protecting privacy in camera-based systems. Instead of specifying specific objects (e.g. faces) are privacy sensitive, our technique distinguishes between salient (visually prominent) and non-salient objects based on the intuition that the latter is unlikely to be needed by the application.
△ Less
Submitted 4 October, 2021;
originally announced October 2021.
-
Joint Graph Learning and Matching for Semantic Feature Correspondence
Authors:
He Liu,
Tao Wang,
Yidong Li,
Congyan Lang,
Yi **,
Haibin Ling
Abstract:
In recent years, powered by the learned discriminative representation via graph neural network (GNN) models, deep graph matching methods have made great progresses in the task of matching semantic features. However, these methods usually rely on heuristically generated graph patterns, which may introduce unreliable relationships to hurt the matching performance. In this paper, we propose a joint \…
▽ More
In recent years, powered by the learned discriminative representation via graph neural network (GNN) models, deep graph matching methods have made great progresses in the task of matching semantic features. However, these methods usually rely on heuristically generated graph patterns, which may introduce unreliable relationships to hurt the matching performance. In this paper, we propose a joint \emph{graph learning and matching} network, named GLAM, to explore reliable graph structures for boosting graph matching. GLAM adopts a pure attention-based framework for both graph learning and graph matching. Specifically, it employs two types of attention mechanisms, self-attention and cross-attention for the task. The self-attention discovers the relationships between features and to further update feature representations over the learnt structures; and the cross-attention computes cross-graph correlations between the two feature sets to be matched for feature reconstruction. Moreover, the final matching solution is directly derived from the output of the cross-attention layer, without employing a specific matching decision module. The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k), and it outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks. Furthermore, the graph patterns learnt by our model are validated to be able to remarkably enhance previous deep graph matching methods by replacing their handcrafted graph structures with the learnt ones.
△ Less
Submitted 17 November, 2021; v1 submitted 1 September, 2021;
originally announced September 2021.
-
AGKD-BML: Defense Against Adversarial Attack by Attention Guided Knowledge Distillation and Bi-directional Metric Learning
Authors:
Hong Wang,
Yuefan Deng,
Shinjae Yoo,
Haibin Ling,
Yuewei Lin
Abstract:
While deep neural networks have shown impressive performance in many tasks, they are fragile to carefully designed adversarial attacks. We propose a novel adversarial training-based model by Attention Guided Knowledge Distillation and Bi-directional Metric Learning (AGKD-BML). The attention knowledge is obtained from a weight-fixed model trained on a clean dataset, referred to as a teacher model,…
▽ More
While deep neural networks have shown impressive performance in many tasks, they are fragile to carefully designed adversarial attacks. We propose a novel adversarial training-based model by Attention Guided Knowledge Distillation and Bi-directional Metric Learning (AGKD-BML). The attention knowledge is obtained from a weight-fixed model trained on a clean dataset, referred to as a teacher model, and transferred to a model that is under training on adversarial examples (AEs), referred to as a student model. In this way, the student model is able to focus on the correct region, as well as correcting the intermediate features corrupted by AEs to eventually improve the model accuracy. Moreover, to efficiently regularize the representation in feature space, we propose a bidirectional metric learning. Specifically, given a clean image, it is first attacked to its most confusing class to get the forward AE. A clean image in the most confusing class is then randomly picked and attacked back to the original class to get the backward AE. A triplet loss is then used to shorten the representation distance between original image and its AE, while enlarge that between the forward and backward AEs. We conduct extensive adversarial robustness experiments on two widely used datasets with different attacks. Our proposed AGKD-BML model consistently outperforms the state-of-the-art approaches. The code of AGKD-BML will be available at: https://github.com/hongw579/AGKD-BML.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth
Authors:
Mengyang Pu,
Ya** Huang,
Qingji Guan,
Haibin Ling
Abstract:
As a fundamental building block in computer vision, edges can be categorised into four types according to the discontinuity in surface-Reflectance, Illumination, surface-Normal or Depth. While great progress has been made in detecting generic or individual types of edges, it remains under-explored to comprehensively study all four edge types together. In this paper, we propose a novel neural netwo…
▽ More
As a fundamental building block in computer vision, edges can be categorised into four types according to the discontinuity in surface-Reflectance, Illumination, surface-Normal or Depth. While great progress has been made in detecting generic or individual types of edges, it remains under-explored to comprehensively study all four edge types together. In this paper, we propose a novel neural network solution, RINDNet, to jointly detect all four types of edges. Taking into consideration the distinct attributes of each type of edges and the relationship between them, RINDNet learns effective representations for each of them and works in three stages. In stage I, RINDNet uses a common backbone to extract features shared by all edges. Then in stage II it branches to prepare discriminative features for each edge type by the corresponding decoder. In stage III, an independent decision head for each type aggregates the features from previous stages to predict the initial results. Additionally, an attention module learns attention maps for all types to capture the underlying relations between them, and these maps are combined with initial results to generate the final edge detection results. For training and evaluation, we construct the first public benchmark, BSDS-RIND, with all four types of edges carefully annotated. In our experiments, RINDNet yields promising results in comparison with state-of-the-art methods. Additional analysis is presented in supplementary material.
△ Less
Submitted 1 August, 2021;
originally announced August 2021.
-
From Monopoly to Competition: Optimal Contests Prevail
Authors:
Xiaotie Deng,
Yotam Gafni,
Ron Lavi,
Tao Lin,
Hongyi Ling
Abstract:
We study competition among contests in a general model that allows for an arbitrary and heterogeneous space of contest design, where the goal of the contest designers is to maximize the contestants' sum of efforts. Our main result shows that optimal contests in the monopolistic setting (i.e., those that maximize the sum of efforts in a model with a single contest) form an equilibrium in the model…
▽ More
We study competition among contests in a general model that allows for an arbitrary and heterogeneous space of contest design, where the goal of the contest designers is to maximize the contestants' sum of efforts. Our main result shows that optimal contests in the monopolistic setting (i.e., those that maximize the sum of efforts in a model with a single contest) form an equilibrium in the model with competition among contests. Under a very natural assumption these contests are in fact dominant, and the equilibria that they form are unique. Moreover, equilibria with the optimal contests are Pareto-optimal even in cases where other equilibria emerge. In many natural cases, they also maximize the social welfare.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
VisDrone-CC2020: The Vision Meets Drone Crowd Counting Challenge Results
Authors:
Dawei Du,
Longyin Wen,
Pengfei Zhu,
Heng Fan,
Qinghua Hu,
Haibin Ling,
Mubarak Shah,
Junwen Pan,
Ali Al-Ali,
Amr Mohamed,
Bakour Imene,
Bin Dong,
Binyu Zhang,
Bouchali Hadia Nesma,
Chenfeng Xu,
Chenzhen Duan,
Ciro Castiello,
Corrado Mencar,
Dingkang Liang,
Florian Krüger,
Gennaro Vessio,
Giovanna Castellano,
Jieru Wang,
Junyu Gao,
Khalid Abualsaud
, et al. (30 additional authors not shown)
Abstract:
Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint. However, there are few algorithms focusing on crowd counting on the drone-captured data due to the lack of comprehensive datasets. To this end, we collect a large-scale dataset and organize the Vision Meets Drone Crowd C…
▽ More
Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint. However, there are few algorithms focusing on crowd counting on the drone-captured data due to the lack of comprehensive datasets. To this end, we collect a large-scale dataset and organize the Vision Meets Drone Crowd Counting Challenge (VisDrone-CC2020) in conjunction with the 16th European Conference on Computer Vision (ECCV 2020) to promote the developments in the related fields. The collected dataset is formed by $3,360$ images, including $2,460$ images for training, and $900$ images for testing. Specifically, we manually annotate persons with points in each video frame. There are $14$ algorithms from $15$ institutes submitted to the VisDrone-CC2020 Challenge. We provide a detailed analysis of the evaluation results and conclude the challenge. More information can be found at the website: \url{http://www.aiskyeye.com/}.
△ Less
Submitted 19 July, 2021;
originally announced July 2021.
-
AutoFormer: Searching Transformers for Visual Recognition
Authors:
Minghao Chen,
Houwen Peng,
Jianlong Fu,
Haibin Ling
Abstract:
Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In…
▽ More
Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search. AutoFormer entangles the weights of different blocks in the same layers during supernet training. Benefiting from the strategy, the trained supernet allows thousands of subnets to be very well-trained. Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch. Besides, the searched models, which we refer to AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify the transferability of AutoFormer by providing the performance on downstream benchmarks and distillation experiments. Code and models are available at https://github.com/microsoft/AutoML.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
CBNet: A Composite Backbone Network Architecture for Object Detection
Authors:
Tingting Liang,
Xiaojie Chu,
Yudong Liu,
Yongtao Wang,
Zhi Tang,
Wei Chu,
**gdong Chen,
Haibin Ling
Abstract:
Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In…
▽ More
Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6$\times$. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2.
△ Less
Submitted 18 October, 2022; v1 submitted 1 July, 2021;
originally announced July 2021.
-
DeepMMSA: A Novel Multimodal Deep Learning Method for Non-small Cell Lung Cancer Survival Analysis
Authors:
Yujiao Wu,
Jie Ma,
Xiaoshui Huang,
Sai Ho Ling,
Steven Weidong Su
Abstract:
Lung cancer is the leading cause of cancer death worldwide. The critical reason for the deaths is delayed diagnosis and poor prognosis. With the accelerated development of deep learning techniques, it has been successfully applied extensively in many real-world applications, including health sectors such as medical image interpretation and disease diagnosis. By combining more modalities that being…
▽ More
Lung cancer is the leading cause of cancer death worldwide. The critical reason for the deaths is delayed diagnosis and poor prognosis. With the accelerated development of deep learning techniques, it has been successfully applied extensively in many real-world applications, including health sectors such as medical image interpretation and disease diagnosis. By combining more modalities that being engaged in the processing of information, multimodal learning can extract better features and improve predictive ability. The conventional methods for lung cancer survival analysis normally utilize clinical data and only provide a statistical probability. To improve the survival prediction accuracy and help prognostic decision-making in clinical practice for medical experts, we for the first time propose a multimodal deep learning method for non-small cell lung cancer (NSCLC) survival analysis, named DeepMMSA. This method leverages CT images in combination with clinical data, enabling the abundant information hold within medical images to be associate with lung cancer survival information. We validate our method on the data of 422 NSCLC patients from The Cancer Imaging Archive (TCIA). Experimental results support our hypothesis that there is an underlying relationship between prognostic information and radiomic images. Besides, quantitative results showing that the established multimodal model can be applied to traditional method and has the potential to break bottleneck of existing methods and increase the the percentage of concordant pairs(right predicted pairs) in overall population by 4%.
△ Less
Submitted 12 June, 2021;
originally announced June 2021.
-
Channel DropBlock: An Improved Regularization Method for Fine-Grained Visual Classification
Authors:
Yifeng Ding,
Shuwei Dong,
Yujun Tong,
Zhanyu Ma,
Bo Xiao,
Haibin Ling
Abstract:
Classifying the sub-categories of an object from the same super-category (e.g., bird) in a fine-grained visual classification (FGVC) task highly relies on mining multiple discriminative features. Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakl…
▽ More
Classifying the sub-categories of an object from the same super-category (e.g., bird) in a fine-grained visual classification (FGVC) task highly relies on mining multiple discriminative features. Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakly-supervised fashion. In this work, we propose a lightweight yet effective regularization method named Channel DropBlock (CDB), in combination with two alternative correlation metrics, to address this problem. The key idea is to randomly mask out a group of correlated channels during training to destruct features from co-adaptations and thus enhance feature representations. Extensive experiments on three benchmark FGVC datasets show that CDB effectively improves the performance.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
DFGC 2021: A DeepFake Game Competition
Authors:
Bo Peng,
Hongxing Fan,
Wei Wang,
**g Dong,
Yuezun Li,
Siwei Lyu,
Qi Li,
Zhenan Sun,
Han Chen,
Baoying Chen,
Yanjie Hu,
Shenghai Luo,
Junrui Huang,
Yutong Yao,
Boyuan Liu,
Hefei Ling,
Guosheng Zhang,
Zhiliang Xu,
Changtao Miao,
Changlei Lu,
Shan He,
Xiaoyan Wu,
Wanyi Zhuang
Abstract:
This paper presents a summary of the DFGC 2021 competition. DeepFake technology is develo** fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between curren…
▽ More
This paper presents a summary of the DFGC 2021 competition. DeepFake technology is develo** fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between current state-of-the-art DeepFake creation and detection methods. In this paper, we present the organization, results and top solutions of this competition and also share our insights obtained during this event. We also release the DFGC-21 testing dataset collected from our participants to further benefit the research community.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation
Authors:
Xinyi Li,
Haibin Ling
Abstract:
Camera pose estimation or camera relocalization is the centerpiece in numerous computer vision tasks such as visual odometry, structure from motion (SfM) and SLAM. In this paper we propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. In contrast with prior work where the pose regression is mainly guided by photometric…
▽ More
Camera pose estimation or camera relocalization is the centerpiece in numerous computer vision tasks such as visual odometry, structure from motion (SfM) and SLAM. In this paper we propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. In contrast with prior work where the pose regression is mainly guided by photometric consistency, TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes and is trained towards the graph consistency and accuracy instead, yielding significantly higher computational efficiency. By leveraging graph transformer layers with edge features and enabling tensorized adjacency matrix, TransCamP dynamically captures the global attention and thus endows the pose graph with evolving structures to achieve improved robustness and accuracy. In addition, optional temporal transformer layers actively enhance the spatiotemporal inter-frame relation for sequential inputs. Evaluation of the proposed network on various public benchmarks demonstrates that TransCamP outperforms state-of-the-art approaches.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Optimal Rates of Teaching and Learning Under Uncertainty
Authors:
Yan Hao Ling,
Jonathan Scarlett
Abstract:
In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the…
▽ More
In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as $n$ increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy $D\big(\frac{1}{2}\|\max(p,q)\big)$, where $p$ and $q$ are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in $n$. We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols.
△ Less
Submitted 7 December, 2022; v1 submitted 13 April, 2021;
originally announced April 2021.
-
DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort
Authors:
Yuxuan Zhang,
Huan Ling,
Jun Gao,
Kangxue Yin,
Jean-Francois Lafleche,
Adela Barriuso,
Antonio Torralba,
Sanja Fidler
Abstract:
We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code…
▽ More
We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as our method.
△ Less
Submitted 19 April, 2021; v1 submitted 13 April, 2021;
originally announced April 2021.
-
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking
Authors:
Minghao Chen,
Houwen Peng,
Jianlong Fu,
Haibin Ling
Abstract:
Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models…
▽ More
Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models simultaneously as an alternative way to find powerful models. Searching for ensembles is non-trivial and has two key challenges: enlarged search space and potentially more complexity for the searched model. In this paper, we propose a one-shot neural ensemble architecture search (NEAS) solution that addresses the two challenges. For the first challenge, we introduce a novel diversity-based metric to guide search space shrinking, considering both the potentiality and diversity of candidate operators. For the second challenge, we enable a new search dimension to learn layer sharing among different models for efficiency purposes. The experiments on ImageNet clearly demonstrate that our solution can improve the supernet's capacity of ranking ensemble architectures, and further lead to better search results. The discovered architectures achieve superior performance compared with state-of-the-arts such as MobileNetV3 and EfficientNet families under aligned settings. Moreover, we evaluate the generalization ability and robustness of our searched architecture on the COCO detection benchmark and achieve a 3.1% improvement on AP compared with MobileNetV3. Codes and models are available at https://github.com/researchmm/NEAS.
△ Less
Submitted 16 July, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking
Authors:
Peng Chu,
Jiang Wang,
Quanzeng You,
Haibin Ling,
Zicheng Liu
Abstract:
Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked o…
▽ More
Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.
△ Less
Submitted 3 April, 2021; v1 submitted 31 March, 2021;
originally announced April 2021.
-
Magnetic skyrmion crystal at a topological insulator surface
Authors:
Stefan Divic,
Henry Ling,
T. Pereg-Barnea,
Arun Paramekanti
Abstract:
We consider a magnetic skyrmion crystal formed at the surface of a topological insulator. Incorporating the exchange interaction between the helical Dirac surface states and the periodic Néel or Bloch skyrmion texture, we obtain the resulting electronic band structure and discuss the constraints that symmetries impose on the energies and Berry curvature. We find substantive qualitative differences…
▽ More
We consider a magnetic skyrmion crystal formed at the surface of a topological insulator. Incorporating the exchange interaction between the helical Dirac surface states and the periodic Néel or Bloch skyrmion texture, we obtain the resulting electronic band structure and discuss the constraints that symmetries impose on the energies and Berry curvature. We find substantive qualitative differences between the Néel and Bloch cases, with the latter generically permitting a multiband low energy tight-binding representation whose parameters are tightly constrained by symmetries. We explicitly compute the associated Wannier orbitals, which resemble the ringlike chiral bound states of helical Dirac fermions coupled to a single skyrmion in a ferromagnetic background. We construct a two-band tight-binding model with real nearest-neighbor hop**s which captures the salient topological features of the low-energy bands. Our results are relevant to magnetic topological insulators (TIs), as well as to TI-magnetic thin film heterostructures, in which skyrmion crystals may be stabilized.
△ Less
Submitted 13 February, 2022; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Hands-on Guidance for Distilling Object Detectors
Authors:
Yangyang Qin,
Hefei Ling,
Zhenghai He,
Yuxuan Shi,
Lei Wu
Abstract:
Knowledge distillation can lead to deploy-friendly networks against the plagued computational complexity problem, but previous methods neglect the feature hierarchy in detectors. Motivated by this, we propose a general framework for detection distillation. Our method, called Hands-on Guidance Distillation, distills the latent knowledge of all stage features for imposing more comprehensive supervis…
▽ More
Knowledge distillation can lead to deploy-friendly networks against the plagued computational complexity problem, but previous methods neglect the feature hierarchy in detectors. Motivated by this, we propose a general framework for detection distillation. Our method, called Hands-on Guidance Distillation, distills the latent knowledge of all stage features for imposing more comprehensive supervision, and focuses on the essence simultaneously for promoting more intense knowledge absorption. Specifically, a series of novel mechanisms are designed elaborately, including correspondence establishment for consistency, hands-on imitation loss measure and re-weighted optimization from both micro and macro perspectives. We conduct extensive evaluations with different distillation configurations over VOC and COCO datasets, which show better performance on accuracy and speed trade-offs. Meanwhile, feasibility experiments on different structural networks further prove the robustness of our HGD.
△ Less
Submitted 12 May, 2021; v1 submitted 26 March, 2021;
originally announced March 2021.
-
Character Controllers Using Motion VAEs
Authors:
Hung Yu Ling,
Fabio Zinno,
George Cheng,
Michiel van de Panne
Abstract:
A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern…
▽ More
A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern its evolution over time. Planning or control algorithms can then use this action space to generate desired motions. In particular, we use deep reinforcement learning to learn controllers that achieve goal-directed movements. We demonstrate the effectiveness of the approach on multiple tasks. We further evaluate system-design choices and describe the current limitations of Motion VAEs.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Light-Matter Coupling in Scalable Van der Waals Superlattices
Authors:
Pawan Kumar,
Jason Lynch,
Baokun Song,
Haonan Ling,
Francisco Barrera,
Huiqin Zhang,
Surendra B. Anantharaman,
Jagrit Digani,
Haoyue Zhu,
Tanushree H. Choudhury,
Clifford McAleese,
Xiaochen Wang,
Ben R. Conran,
Oliver Whear,
Michael J. Motala,
Michael Snure,
Christopher Muratore,
Joan M. Redwing,
Nicholas R. Glavin,
Eric A. Stach,
Artur R. Davoyan,
Deep Jariwala
Abstract:
Two-dimensional (2D) crystals have renewed opportunities in design and assembly of artificial lattices without the constraints of epitaxy. However, the lack of thickness control in exfoliated van der Waals (vdW) layers prevents realization of repeat units with high fidelity. Recent availability of uniform, wafer-scale samples permits engineering of both electronic and optical dispersions in stacks…
▽ More
Two-dimensional (2D) crystals have renewed opportunities in design and assembly of artificial lattices without the constraints of epitaxy. However, the lack of thickness control in exfoliated van der Waals (vdW) layers prevents realization of repeat units with high fidelity. Recent availability of uniform, wafer-scale samples permits engineering of both electronic and optical dispersions in stacks of disparate 2D layers with multiple repeating units. We present optical dispersion engineering in a superlattice structure comprised of alternating layers of 2D excitonic chalcogenides and dielectric insulators. By carefully designing the unit cell parameters, we demonstrate > 90 % narrowband absorption in < 4 nm active layer excitonic absorber medium at room temperature, concurrently with enhanced photoluminescence in cm2 samples. These superlattices show evidence of strong light-matter coupling and exciton-polariton formation with geometry-tunable coupling constants. Our results demonstrate proof of concept structures with engineered optical properties and pave the way for a broad class of scalable, designer optical metamaterials from atomically-thin layers.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection
Authors:
Tingting Liang,
Yongtao Wang,
Zhi Tang,
Guosheng Hu,
Haibin Ling
Abstract:
Recently, neural architecture search (NAS) has been exploited to design feature pyramid networks (FPNs) and achieved promising results for visual object detection. Encouraged by the success, we propose a novel One-Shot Path Aggregation Network Architecture Search (OPANAS) algorithm, which significantly improves both searching efficiency and detection accuracy. Specifically, we first introduce six…
▽ More
Recently, neural architecture search (NAS) has been exploited to design feature pyramid networks (FPNs) and achieved promising results for visual object detection. Encouraged by the success, we propose a novel One-Shot Path Aggregation Network Architecture Search (OPANAS) algorithm, which significantly improves both searching efficiency and detection accuracy. Specifically, we first introduce six heterogeneous information paths to build our search space, namely top-down, bottom-up, fusing-splitting, scale-equalizing, skip-connect and none. Second, we propose a novel search space of FPNs, in which each FPN candidate is represented by a densely-connected directed acyclic graph (each node is a feature pyramid and each edge is one of the six heterogeneous information paths). Third, we propose an efficient one-shot search method to find the optimal path aggregation architecture, that is, we first train a super-net and then find the optimal candidate with an evolutionary algorithm. Experimental results demonstrate the efficacy of the proposed OPANAS for object detection: (1) OPANAS is more efficient than state-of-the-art methods (e.g., NAS-FPN and Auto-FPN), at significantly smaller searching cost (e.g., only 4 GPU days on MS-COCO); (2) the optimal architecture found by OPANAS significantly improves main-stream detectors including RetinaNet, Faster R-CNN and Cascade R-CNN, by 2.3-3.2 % mAP comparing to their FPN counterparts; and (3) a new state-of-the-art accuracy-speed trade-off (52.2 % mAP at 7.6 FPS) at smaller training costs than comparable state-of-the-arts. Code will be released at https://github.com/VDIGPKU/OPANAS.
△ Less
Submitted 11 March, 2021; v1 submitted 7 March, 2021;
originally announced March 2021.
-
On the Robustness of Multi-View Rotation Averaging
Authors:
Xinyi Li,
Haibin Ling
Abstract:
Rotation averaging is a synchronization process on single or multiple rotation groups, and is a fundamental problem in many computer vision tasks such as multi-view structure from motion (SfM). Specifically, rotation averaging involves the recovery of an underlying pose-graph consistency from pairwise relative camera poses. Specifically, given pairwise motion in rotation groups, especially 3-dimen…
▽ More
Rotation averaging is a synchronization process on single or multiple rotation groups, and is a fundamental problem in many computer vision tasks such as multi-view structure from motion (SfM). Specifically, rotation averaging involves the recovery of an underlying pose-graph consistency from pairwise relative camera poses. Specifically, given pairwise motion in rotation groups, especially 3-dimensional rotation groups (\eg, $\mathbb{SO}(3)$), one is interested in recovering the original signal of multiple rotations with respect to a fixed frame. In this paper, we propose a robust framework to solve multiple rotation averaging problem, especially in the cases that a significant amount of noisy measurements are present. By introducing the $ε$-cycle consistency term into the solver, we enable the robust initialization scheme to be implemented into the IRLS solver. Instead of conducting the costly edge removal, we implicitly constrain the negative effect of erroneous measurements by weight reducing, such that IRLS failures caused by poor initialization can be effectively avoided. Experiment results demonstrate that our proposed approach outperforms state of the arts on various benchmarks.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
Personal Fixations-Based Object Segmentation with Object Localization and Boundary Preservation
Authors:
Gongyang Li,
Zhi Liu,
Ran Shi,
Zheng Hu,
Weijie Wei,
Yong Wu,
Mengke Huang,
Haibin Ling
Abstract:
As a natural way for human-computer interaction, fixation provides a promising solution for interactive image segmentation. In this paper, we focus on Personal Fixations-based Object Segmentation (PFOS) to address issues in previous studies, such as the lack of appropriate dataset and the ambiguity in fixations-based interaction. In particular, we first construct a new PFOS dataset by carefully co…
▽ More
As a natural way for human-computer interaction, fixation provides a promising solution for interactive image segmentation. In this paper, we focus on Personal Fixations-based Object Segmentation (PFOS) to address issues in previous studies, such as the lack of appropriate dataset and the ambiguity in fixations-based interaction. In particular, we first construct a new PFOS dataset by carefully collecting pixel-level binary annotation data over an existing fixation prediction dataset, such dataset is expected to greatly facilitate the study along the line. Then, considering characteristics of personal fixations, we propose a novel network based on Object Localization and Boundary Preservation (OLBP) to segment the gazed objects. Specifically, the OLBP network utilizes an Object Localization Module (OLM) to analyze personal fixations and locates the gazed objects based on the interpretation. Then, a Boundary Preservation Module (BPM) is designed to introduce additional boundary information to guard the completeness of the gazed objects. Moreover, OLBP is organized in the mixed bottom-up and top-down manner with multiple types of deep supervision. Extensive experiments on the constructed PFOS dataset show the superiority of the proposed OLBP network over 17 state-of-the-art methods, and demonstrate the effectiveness of the proposed OLM and BPM components. The constructed PFOS dataset and the proposed OLBP network are available at https://github.com/MathLee/OLBPNet4PFOS.
△ Less
Submitted 22 January, 2021;
originally announced January 2021.
-
Modeling Deep Learning Based Privacy Attacks on Physical Mail
Authors:
Bingyao Huang,
Ruyi Lian,
Dimitris Samaras,
Haibin Ling
Abstract:
Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content…
▽ More
Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content as learning the map** from the camera-captured envelope front face image to the hidden content, then we explicitly model the map** as a combination of perspective transformation, image dehazing and denoising using a deep convolutional neural network, named Neural-STE (See-Through-Envelope). We show experimentally that hidden content details, such as texture and image structure, can be clearly recovered. Finally, our formulation and model allow us to design envelopes that can counter deep learning-based privacy attacks on physical mail.
△ Less
Submitted 25 March, 2021; v1 submitted 21 December, 2020;
originally announced December 2020.
-
Political Posters Identification with Appearance-Text Fusion
Authors:
Xuan Qin,
Meizhu Liu,
Yifan Hu,
Christina Moo,
Christian M. Riblet,
Changwei Hu,
Kevin Yen,
Haibin Ling
Abstract:
In this paper, we propose a method that efficiently utilizes appearance features and text vectors to accurately classify political posters from other similar political images. The majority of this work focuses on political posters that are designed to serve as a promotion of a certain political event, and the automated identification of which can lead to the generation of detailed statistics and m…
▽ More
In this paper, we propose a method that efficiently utilizes appearance features and text vectors to accurately classify political posters from other similar political images. The majority of this work focuses on political posters that are designed to serve as a promotion of a certain political event, and the automated identification of which can lead to the generation of detailed statistics and meets the judgment needs in a variety of areas. Starting with a comprehensive keyword list for politicians and political events, we curate for the first time an effective and practical political poster dataset containing 13K human-labeled political images, including 3K political posters that explicitly support a movement or a campaign. Second, we make a thorough case study for this dataset and analyze common patterns and outliers of political posters. Finally, we propose a model that combines the power of both appearance and text information to classify political posters with significantly high accuracy.
△ Less
Submitted 19 December, 2020;
originally announced December 2020.
-
SPAA: Stealthy Projector-based Adversarial Attacks on Deep Image Classifiers
Authors:
Bingyao Huang,
Haibin Ling
Abstract:
Light-based adversarial attacks use spatial augmented reality (SAR) techniques to fool image classifiers by altering the physical light condition with a controllable light source, e.g., a projector. Compared with physical attacks that place hand-crafted adversarial objects, projector-based ones obviate modifying the physical entities, and can be performed transiently and dynamically by altering th…
▽ More
Light-based adversarial attacks use spatial augmented reality (SAR) techniques to fool image classifiers by altering the physical light condition with a controllable light source, e.g., a projector. Compared with physical attacks that place hand-crafted adversarial objects, projector-based ones obviate modifying the physical entities, and can be performed transiently and dynamically by altering the projection pattern. However, subtle light perturbations are insufficient to fool image classifiers, due to the complex environment and project-and-capture process. Thus, existing approaches focus on projecting clearly perceptible adversarial patterns, while the more interesting yet challenging goal, stealthy projector-based attack, remains open. In this paper, for the first time, we formulate this problem as an end-to-end differentiable process and propose a Stealthy Projector-based Adversarial Attack (SPAA) solution. In SPAA, we approximate the real Project-and-Capture process using a deep neural network named PCNet, then we include PCNet in the optimization of projector-based attacks such that the generated adversarial projection is physically plausible. Finally, to generate both robust and stealthy adversarial projections, we propose an algorithm that uses minimum perturbation and adversarial confidence thresholds to alternate between the adversarial loss and stealthiness loss optimization. Our experimental evaluations show that SPAA clearly outperforms other methods by achieving higher attack success rates and meanwhile being stealthier, for both targeted and untargeted attacks.
△ Less
Submitted 17 March, 2022; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Selection Rule for Topological Amplifiers in Bogoliubov de Gennes Systems
Authors:
Hong Y. Ling,
Ben Kain
Abstract:
Dynamical instability is an inherent feature of bosonic systems described by the Bogoliubov de Geenes (BdG) Hamiltonian. Since it causes the BdG system to collapse, it is generally thought that it should be avoided. Recently, there has been much effort to harness this instability for the benefit of creating a topological amplifier with stable bulk bands but unstable edge modes which can be populat…
▽ More
Dynamical instability is an inherent feature of bosonic systems described by the Bogoliubov de Geenes (BdG) Hamiltonian. Since it causes the BdG system to collapse, it is generally thought that it should be avoided. Recently, there has been much effort to harness this instability for the benefit of creating a topological amplifier with stable bulk bands but unstable edge modes which can be populated at an exponentially fast rate. We present a theorem for determining the stability of states with energies sufficiently away from zero, in terms of an unconventional commutator between the number conserving part and number nonconserving part of the BdG Hamiltonian. We apply the theorem to a generalization of a model from Galilo et al. [Phys. Rev. Lett, 115, 245302(2015)] for creating a topological amplifier in an interacting spin-1 atom system in a honeycomb lattice through a quench process. We use this model to illustrate how the vanishing of the unconventional commutator selects the symmetries for a system so that its bulk states are stable against (weak) pairing interactions. We find that as long as time reversal symmetry is preserved, our system can act like a topological amplifier, even in the presence of an onsite staggered potential which breaks the inversion symmetry.
△ Less
Submitted 20 August, 2021; v1 submitted 30 November, 2020;
originally announced November 2020.
-
CRACT: Cascaded Regression-Align-Classification for Robust Visual Tracking
Authors:
Heng Fan,
Haibin Ling
Abstract:
High quality object proposals are crucial in visual tracking algorithms that utilize region proposal network (RPN). Refinement of these proposals, typically by box regression and classification in parallel, has been popularly adopted to boost tracking performance. However, it still meets problems when dealing with complex and dynamic background. Thus motivated, in this paper we introduce an improv…
▽ More
High quality object proposals are crucial in visual tracking algorithms that utilize region proposal network (RPN). Refinement of these proposals, typically by box regression and classification in parallel, has been popularly adopted to boost tracking performance. However, it still meets problems when dealing with complex and dynamic background. Thus motivated, in this paper we introduce an improved proposal refinement module, Cascaded Regression-Align-Classification (CRAC), which yields new state-of-the-art performances on many benchmarks.
First, having observed that the offsets from box regression can serve as guidance for proposal feature refinement, we design CRAC as a cascade of box regression, feature alignment and box classification. The key is to bridge box regression and classification via an alignment step, which leads to more accurate features for proposal classification with improved robustness. To address the variation in object appearance, we introduce an identification-discrimination component for box classification, which leverages offline reliable fine-grained template and online rich background information to distinguish the target from background. Moreover, we present pyramid RoIAlign that benefits CRAC by exploiting both the local and global cues of proposals. During inference, tracking proceeds by ranking all refined proposals and selecting the best one. In experiments on seven benchmarks including OTB-2015, UAV123, NfS, VOT-2018, TrackingNet, GOT-10k and LaSOT, our CRACT exhibits very promising results in comparison with state-of-the-art competitors and runs in real-time.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
GMOT-40: A Benchmark for Generic Multiple Object Tracking
Authors:
Hexin Bai,
Wensheng Cheng,
Peng Chu,
Juehuan Liu,
Kai Zhang,
Haibin Ling
Abstract:
Multiple Object Tracking (MOT) has witnessed remarkable advances in recent years. However, existing studies dominantly request prior knowledge of the tracking target, and hence may not generalize well to unseen categories. In contrast, Generic Multiple Object Tracking (GMOT), which requires little prior information about the target, is largely under-explored. In this paper, we make contributions t…
▽ More
Multiple Object Tracking (MOT) has witnessed remarkable advances in recent years. However, existing studies dominantly request prior knowledge of the tracking target, and hence may not generalize well to unseen categories. In contrast, Generic Multiple Object Tracking (GMOT), which requires little prior information about the target, is largely under-explored. In this paper, we make contributions to boost the study of GMOT in three aspects. First, we construct the first public GMOT dataset, dubbed GMOT-40, which contains 40 carefully annotated sequences evenly distributed among 10 object categories. In addition, two tracking protocols are adopted to evaluate different characteristics of tracking algorithms. Second, by noting the lack of devoted tracking algorithms, we have designed a series of baseline GMOT algorithms. Third, we perform a thorough evaluation on GMOT-40, involving popular MOT algorithms (with necessary modifications) and the proposed baselines. We will release the GMOT-40 benchmark, the evaluation results, as well as the baseline algorithm to the public upon the publication of the paper.
△ Less
Submitted 7 April, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Transparent Object Tracking Benchmark
Authors:
Heng Fan,
Halady Akhilesha Miththanthaya,
Harshit,
Siranjiv Ramana Rajan,
Xiaoqiong Liu,
Zhilin Zou,
Yuewei Lin,
Haibin Ling
Abstract:
Visual tracking has achieved considerable progress in recent years. However, current research in the field mainly focuses on tracking of opaque objects, while little attention is paid to transparent object tracking. In this paper, we make the first attempt in exploring this problem by proposing a Transparent Object Tracking Benchmark (TOTB). Specifically, TOTB consists of 225 videos (86K frames) f…
▽ More
Visual tracking has achieved considerable progress in recent years. However, current research in the field mainly focuses on tracking of opaque objects, while little attention is paid to transparent object tracking. In this paper, we make the first attempt in exploring this problem by proposing a Transparent Object Tracking Benchmark (TOTB). Specifically, TOTB consists of 225 videos (86K frames) from 15 diverse transparent object categories. Each sequence is manually labeled with axis-aligned bounding boxes. To the best of our knowledge, TOTB is the first benchmark dedicated to transparent object tracking. In order to understand how existing trackers perform and to provide comparison for future research on TOTB, we extensively evaluate 25 state-of-the-art tracking algorithms. The evaluation results exhibit that more efforts are needed to improve transparent object tracking. Besides, we observe some nontrivial findings from the evaluation that are discrepant with some common beliefs in opaque object tracking. For example, we find that deeper features are not always good for improvements. Moreover, to encourage future research, we introduce a novel tracker, named TransATOM, which leverages transparency features for tracking and surpasses all 25 evaluated approaches by a large margin. By releasing TOTB, we expect to facilitate future research and application of transparent object tracking in both the academia and industry. The TOTB and evaluation results as well as TransATOM are available at https://hengfan2010.github.io/projects/TOTB.
△ Less
Submitted 1 August, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
Pushing the Envelope of Rotation Averaging for Visual SLAM
Authors:
Xinyi Li,
Lin Yuan,
Longin Jan Latecki,
Haibin Ling
Abstract:
As an essential part of structure from motion (SfM) and Simultaneous Localization and Map** (SLAM) systems, motion averaging has been extensively studied in the past years and continues to attract surging research attention. While canonical approaches such as bundle adjustment are predominantly inherited in most of state-of-the-art SLAM systems to estimate and update the trajectory in the robot…
▽ More
As an essential part of structure from motion (SfM) and Simultaneous Localization and Map** (SLAM) systems, motion averaging has been extensively studied in the past years and continues to attract surging research attention. While canonical approaches such as bundle adjustment are predominantly inherited in most of state-of-the-art SLAM systems to estimate and update the trajectory in the robot navigation, the practical implementation of bundle adjustment in SLAM systems is intrinsically limited by the high computational complexity, unreliable convergence and strict requirements of ideal initializations. In this paper, we lift these limitations and propose a novel optimization backbone for visual SLAM systems, where we leverage rotation averaging to improve the accuracy, efficiency and robustness of conventional monocular SLAM pipelines. In our approach, we first decouple the rotational and translational parameters in the camera rigid body transformation and convert the high-dimensional non-convex nonlinear problem into tractable linear subproblems in lower dimensions, and show that the subproblems can be solved independently with proper constraints. We apply the scale parameter with $l_1$-norm in the pose-graph optimization to address the rotation averaging robustness against outliers. We further validate the global optimality of our proposed approach, revisit and address the initialization schemes, pure rotational scene handling and outlier treatments. We demonstrate that our approach can exhibit up to 10x faster speed with comparable accuracy against the state of the art on public benchmarks.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
Pose Estimation of Specular and Symmetrical Objects
Authors:
Jiaming Hu,
Hongyi Ling,
Priyam Parashar,
Aayush Naik,
Henrik Christensen
Abstract:
In the robotic industry, specular and textureless metallic components are ubiquitous. The 6D pose estimation of such objects with only a monocular RGB camera is difficult because of the absence of rich texture features. Furthermore, the appearance of specularity heavily depends on the camera viewpoint and environmental light conditions making traditional methods, like template matching, fail. In t…
▽ More
In the robotic industry, specular and textureless metallic components are ubiquitous. The 6D pose estimation of such objects with only a monocular RGB camera is difficult because of the absence of rich texture features. Furthermore, the appearance of specularity heavily depends on the camera viewpoint and environmental light conditions making traditional methods, like template matching, fail. In the last 30 years, pose estimation of the specular object has been a consistent challenge, and most related works require massive knowledge modeling effort for light setups, environment, or the object surface. On the other hand, recent works exhibit the feasibility of 6D pose estimation on a monocular camera with convolutional neural networks(CNNs) however they mostly use opaque objects for evaluation. This paper provides a data-driven solution to estimate the 6D pose of specular objects for gras** them, proposes a cost function for handling symmetry, and demonstrates experimental results showing the system's feasibility.
△ Less
Submitted 31 October, 2020;
originally announced November 2020.
-
Motion Planning Combines Psychological Safety and Motion Prediction for a Sense Motive Robot
Authors:
He**g Ling,
Guoliang Liu,
Guohui Tian
Abstract:
Human safety is the most important demand for human robot interaction and collaboration (HRIC), which not only refers to physical safety, but also includes psychological safety. Although many robots with different configurations have entered our living and working environments, the human safety problem is still an ongoing research problem in human-robot coexistence scenarios. This paper addresses…
▽ More
Human safety is the most important demand for human robot interaction and collaboration (HRIC), which not only refers to physical safety, but also includes psychological safety. Although many robots with different configurations have entered our living and working environments, the human safety problem is still an ongoing research problem in human-robot coexistence scenarios. This paper addresses the human safety issue by covering both the physical safety and psychological safety aspects. First, we introduce an adaptive robot velocity control and step size adjustment method according to human facial expressions, such that the robot can adjust its movement to keep safety when the human emotion is unusual. Second, we predict the human motion by detecting the suddenly changes of human head pose and gaze direction, such that the robot can infer whether the human attention is distracted, predict the next move of human and rebuild a repulsive force to avoid potential collision. Finally, we demonstrate our idea using a 7 DOF TIAGo robot in a dynamic HRIC environment, which shows that the robot becomes sense motive, and responds to human action and emotion changes quickly and efficiently.
△ Less
Submitted 23 October, 2020; v1 submitted 29 September, 2020;
originally announced October 2020.
-
Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering
Authors:
Yuxuan Zhang,
Wenzheng Chen,
Huan Ling,
Jun Gao,
Yinan Zhang,
Antonio Torralba,
Sanja Fidler
Abstract:
Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire…
▽ More
Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers.
△ Less
Submitted 20 April, 2021; v1 submitted 18 October, 2020;
originally announced October 2020.
-
Bone Feature Segmentation in Ultrasound Spine Image with Robustness to Speckle and Regular Occlusion Noise
Authors:
Zixun Huang,
Li-Wen Wang,
Frank H. F. Leung,
Sunetra Banerjee,
De Yang,
Timothy Lee,
Juan Lyu,
Sai Ho Ling,
Yong-** Zheng
Abstract:
3D ultrasound imaging shows great promise for scoliosis diagnosis thanks to its low-costing, radiation-free and real-time characteristics. The key to accessing scoliosis by ultrasound imaging is to accurately segment the bone area and measure the scoliosis degree based on the symmetry of the bone features. The ultrasound images tend to contain many speckles and regular occlusion noise which is dif…
▽ More
3D ultrasound imaging shows great promise for scoliosis diagnosis thanks to its low-costing, radiation-free and real-time characteristics. The key to accessing scoliosis by ultrasound imaging is to accurately segment the bone area and measure the scoliosis degree based on the symmetry of the bone features. The ultrasound images tend to contain many speckles and regular occlusion noise which is difficult, tedious and time-consuming for experts to find out the bony feature. In this paper, we propose a robust bone feature segmentation method based on the U-net structure for ultrasound spine Volume Projection Imaging (VPI) images. The proposed segmentation method introduces a total variance loss to reduce the sensitivity of the model to small-scale and regular occlusion noise. The proposed approach improves 2.3% of Dice score and 1% of AUC score as compared with the u-net model and shows high robustness to speckle and regular occlusion noise.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
All van der Waals Integrated Nanophotonics
Authors:
Haonan Ling,
Renjie Li,
Artur R. Davoyan
Abstract:
Integrated optics is at the heart of a wide range of systems from remote sensing and communications to computing and quantum information processing. Demand for smaller and more energy efficient structures stimulates search for more advanced material platforms. Here, we propose a concept of an all van der Waals photonics, where we show that electronically bulk transition metal dichalcogenide (TMDC)…
▽ More
Integrated optics is at the heart of a wide range of systems from remote sensing and communications to computing and quantum information processing. Demand for smaller and more energy efficient structures stimulates search for more advanced material platforms. Here, we propose a concept of an all van der Waals photonics, where we show that electronically bulk transition metal dichalcogenide (TMDC) semiconductors are well fitted for the design of key optical components for nanoscale and integrated photonics. Specifically, we demonstrate theoretically that owing to low optical loss and high refractive index across near-infrared and telecom frequency bands, components made of bulk TMDCs can potentially outperform counterparts made of conventional 3D semiconductors, such as Si and III/Vs. We discuss several key quantum and classical optical components and show that bulk TMDCs may pave the way to smaller footprint devices, more energy efficient electro-optical modulators, and stronger quantum light-materials interaction. Enhanced optical performance, ease of integration, and a wide selection of materials suggest that bulk TMDCs may complement and, potentially, replace existing integrated photonics systems.
△ Less
Submitted 17 September, 2020;
originally announced September 2020.
-
LaSOT: A High-quality Large-scale Single Object Tracking Benchmark
Authors:
Heng Fan,
Hexin Bai,
Liting Lin,
Fan Yang,
Peng Chu,
Ge Deng,
Sijia Yu,
Harshit,
Mingzhen Huang,
Juehuan Liu,
Yong Xu,
Chunyuan Liao,
Lin Yuan,
Haibin Ling
Abstract:
Despite great recent advances in visual tracking, its further development, including both algorithm design and evaluation, is limited due to lack of dedicated large-scale benchmarks. To address this problem, we present LaSOT, a high-quality Large-scale Single Object Tracking benchmark. LaSOT contains a diverse selection of 85 object classes, and offers 1,550 totaling more than 3.87 million frames.…
▽ More
Despite great recent advances in visual tracking, its further development, including both algorithm design and evaluation, is limited due to lack of dedicated large-scale benchmarks. To address this problem, we present LaSOT, a high-quality Large-scale Single Object Tracking benchmark. LaSOT contains a diverse selection of 85 object classes, and offers 1,550 totaling more than 3.87 million frames. Each video frame is carefully and manually annotated with a bounding box. This makes LaSOT, to our knowledge, the largest densely annotated tracking benchmark. Our goal in releasing LaSOT is to provide a dedicated high quality platform for both training and evaluation of trackers. The average video length of LaSOT is around 2,500 frames, where each video contains various challenge factors that exist in real world video footage,such as the targets disappearing and re-appearing. These longer video lengths allow for the assessment of long-term trackers. To take advantage of the close connection between visual appearance and natural language, we provide language specification for each video in LaSOT. We believe such additions will allow for future research to use linguistic features to improve tracking. Two protocols, full-overlap and one-shot, are designated for flexible assessment of trackers. We extensively evaluate 48 baseline trackers on LaSOT with in-depth analysis, and results reveal that there still exists significant room for improvement. The complete benchmark, tracking results as well as analysis are available at http://vision.cs.stonybrook.edu/~lasot/.
△ Less
Submitted 11 September, 2020; v1 submitted 7 September, 2020;
originally announced September 2020.
-
ScribbleBox: Interactive Annotation Framework for Video Object Segmentation
Authors:
Bowen Chen,
Huan Ling,
Xiaohui Zeng,
Gao Jun,
Ziyue Xu,
Sanja Fidler
Abstract:
Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box t…
▽ More
Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in the box placements, thus typically only a few clicks are needed to annotate tracked boxes to a sufficient accuracy. Segmentation masks are corrected via scribbles which are efficiently propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks per box track, and 4 frames of scribble annotation.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
Feature Space Augmentation for Long-Tailed Data
Authors:
Peng Chu,
Xiao Bian,
Shaopeng Liu,
Haibin Ling
Abstract:
Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advan…
▽ More
Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advanced methods on data re-sampling and augmentation are among the best practices to alleviate the data imbalance problem. However, the other part of the problem about the under-represented classes will have to rely on additional knowledge to recover the missing information.
In this work, we present a novel approach to address the long-tailed problem by augmenting the under-represented classes in the feature space with the features learned from the classes with ample samples. In particular, we decompose the features of each class into a class-generic component and a class-specific component using class activation maps. Novel samples of under-represented classes are then generated on the fly during training stages by fusing the class-specific features from the under-represented classes with the class-generic features from confusing classes. Our results on different datasets such as iNaturalist, ImageNet-LT, Places-LT and a long-tailed version of CIFAR have shown the state of the art performances.
△ Less
Submitted 9 August, 2020;
originally announced August 2020.
-
End-to-end Full Projector Compensation
Authors:
Bingyao Huang,
Tao Sun,
Haibin Ling
Abstract:
Full projector compensation aims to modify a projector input image to compensate for both geometric and photometric disturbance of the projection surface. Traditional methods usually solve the two parts separately and may suffer from suboptimal solutions. In this paper, we propose the first end-to-end differentiable solution, named CompenNeSt++, to solve the two problems jointly. First, we propose…
▽ More
Full projector compensation aims to modify a projector input image to compensate for both geometric and photometric disturbance of the projection surface. Traditional methods usually solve the two parts separately and may suffer from suboptimal solutions. In this paper, we propose the first end-to-end differentiable solution, named CompenNeSt++, to solve the two problems jointly. First, we propose a novel geometric correction subnet, named War**Net, which is designed with a cascaded coarse-to-fine structure to learn the sampling grid directly from sampling images. Second, we propose a novel photometric compensation subnet, named CompenNeSt, which is designed with a siamese architecture to capture the photometric interactions between the projection surface and the projected images, and to use such information to compensate the geometrically corrected images. By concatenating War**Net with CompenNeSt, CompenNeSt++ accomplishes full projector compensation and is end-to-end trainable. Third, to improve practicability, we propose a novel synthetic data-based pre-training strategy to significantly reduce the number of training images and training time. Moreover, we construct the first setup-independent full compensation benchmark to facilitate future studies. In thorough experiments, our method shows clear advantages over prior art with promising compensation quality and meanwhile being practically convenient.
△ Less
Submitted 7 January, 2021; v1 submitted 30 July, 2020;
originally announced August 2020.