Search | arXiv e-print repository

Cavity Induced Extraordinary Optical Transmission and Active Modulation with Graphene

Authors: Yifei Zhang, Baoqing Zhang, Mingming Feng, Haotian Ling, Xijian Zhang, Yiming Wang, Xiaomu Wang, Qingpu Wang, Aimin Song

Abstract: Extraordinary optical transmission (EOT) is a phenomenon of exceptional light transmission through a metallic film with hole arrays enhanced by surface plasmon (SP) resonance, which stimulates renewed research hotspots in metamaterials, subwavelength optics, and plasmonics. Below the frequency of the first order SP mode, f_pl0, the metallic film typically shows strong reflection and no EOT. Here,… ▽ More Extraordinary optical transmission (EOT) is a phenomenon of exceptional light transmission through a metallic film with hole arrays enhanced by surface plasmon (SP) resonance, which stimulates renewed research hotspots in metamaterials, subwavelength optics, and plasmonics. Below the frequency of the first order SP mode, f_pl0, the metallic film typically shows strong reflection and no EOT. Here, we report an unusual EOT phenomenon below fpl0, i.e., beyond the long-held spectral boundary of classic EOTs. It is induced by a novel bound surface state in a Fabry-Perot(F-P) cavity comprising a holey gold film and a silicon-air interface. By tailoring the cavity length, EOT phenomenon has been pushed deep into the sub-wavelength region by a factor of as large as 20%, and EOT frequency comb with cavity function has been achieved. Due to the enhanced slow-wave effect as the frequency approaches fpl0, the cavity induced EOT gradually merges with the first order SP EOT. Distinguishing from the classic EOT phenomenon, no transmission zero is found between these two EOTs, which dramatically broadens the EOT bandwidth by a factor of 10 at terahertz (THz) frequencies. Furthermore, the EOT transmittance is actively modulated with graphene, achieving a large modulation range from 0.5 to 0.25 under a sub-volt bias from -0.3 to 0.5 V at 500 GHz. To the best of the authors' knowledge, both the modulation range and the low bias are among the best for active EOT devices with graphene to date. Such a structure provides a new strategy for miniaturizing sensing devices, high-power sources, and broadband photonics as well as their active control in the THz regime. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: 12 pages, 4 figures

arXiv:2112.07120 [pdf, other]

Simple Coding Techniques for Many-Hop Relaying

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error… ▽ More In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error probability) and is significantly simpler than existing protocols that do so. In addition, we characterize the optimal low-noise and high-noise scaling laws of the information velocity, and we adapt our 1-bit protocol to transmit $k$ bits over $m$ hops with $O(m+k)$ transmission time. △ Less

Submitted 7 December, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: IEEE Transactions on Information Theory, Volume 68, Issue 11, pp. 7043-7053, Nov. 2022

arXiv:2112.01932 [pdf, other]

doi 10.1109/TGRS.2021.3131221

Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images

Authors: Gongyang Li, Zhi Liu, Weisi Lin, Haibin Ling

Abstract: In the computer vision community, great progresses have been achieved in salient object detection from natural scene images (NSI-SOD); by contrast, salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. The unique characteristics of optical RSIs, such as scales, illuminations and imaging orientations, bring significant differences between NS… ▽ More In the computer vision community, great progresses have been achieved in salient object detection from natural scene images (NSI-SOD); by contrast, salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. The unique characteristics of optical RSIs, such as scales, illuminations and imaging orientations, bring significant differences between NSI-SOD and RSI-SOD. In this paper, we propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. Specifically, MCCNet is based on the general encoder-decoder architecture, and contains a novel key component named Multi-Content Complementation Module (MCCM), which bridges the encoder and the decoder. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features, and exploit the content complementarity between them to highlight salient regions over various scales in RSI features through the attention mechanism. Besides, we comprehensively introduce pixel-level, map-level and metric-aware losses in the training phase. Extensive experiments on two popular datasets demonstrate that the proposed MCCNet outperforms 23 state-of-the-art methods, including both NSI-SOD and RSI-SOD methods. The code and results of our method are available at https://github.com/MathLee/MCCNet. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: 12 pages, 7 figures, Accepted by IEEE Transactions on Geoscience and Remote Sensing 2021

arXiv:2112.00995 [pdf, other]

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Authors: Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, Haibin Ling

Abstract: Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet… ▽ More Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOTA) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored. In this paper, we aim to further unleash the power of Transformer by proposing a simple yet efficient fully-attentional tracker, dubbed SwinTrack, within classic Siamese framework. In particular, both representation learning and feature fusion in SwinTrack leverage the Transformer architecture, enabling better feature interactions for tracking than pure CNN or hybrid CNN-Transformer frameworks. Besides, to further enhance robustness, we present a novel motion token that embeds historical target trajectory to improve tracking by providing temporal context. Our motion token is lightweight with negligible computation but brings clear gains. In our thorough experiments, SwinTrack exceeds existing approaches on multiple benchmarks. Particularly, on the challenging LaSOT, SwinTrack sets a new record with 0.713 SUC score. It also achieves SOTA results on other benchmarks. We expect SwinTrack to serve as a solid baseline for Transformer tracking and facilitate future research. Our codes and results are released at https://github.com/LitingLin/SwinTrack. △ Less

Submitted 13 October, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: 22 pages, 10 figures

Journal ref: Advances in Neural Information Processing Systems, 2022

arXiv:2111.14725 [pdf, other]

Searching the Search Space of Vision Transformer

Authors: Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling

Abstract: Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to g… ▽ More Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https://github.com/microsoft/Cream. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted to NIPS 2021

arXiv:2111.07566 [pdf]

Swee** Plasma Frequency of Terahertz Surface Plasmon Polaritons with Graphene

Authors: Mingming Feng, Baoqing Zhang, Haotian Ling, Zihao Zhang, Yiming Wang, Yilin Wang, Xijian Zhang, **rang Hua, Qingpu Wang, Aimin Song, Yifei Zhang

Abstract: Plasma frequency is the spectral boundary for low-loss propagation and evanescent decay of surface plasmon polariton (SPP) waves, which corresponds to a high cut-off phenomenon and is typically utilized for identifying SPPs. At terahertz (THz) frequencies, a metal line with periodic metallic grooves can mimic the conventional optical SPPs, which is referred to as designer SPPs. Theoretically, the… ▽ More Plasma frequency is the spectral boundary for low-loss propagation and evanescent decay of surface plasmon polariton (SPP) waves, which corresponds to a high cut-off phenomenon and is typically utilized for identifying SPPs. At terahertz (THz) frequencies, a metal line with periodic metallic grooves can mimic the conventional optical SPPs, which is referred to as designer SPPs. Theoretically, the plasma frequency of THz SPPs decreases as the groove depth increases. Here, by replacing the metallic grooves with graphene sheets, dynamically swee** SPP plasma frequency is demonstrated for the first time. The metal-graphene hybrid structure comprises a metal line with periodic graphene grooves, a thin-layer ion gel for gating graphene, and metallic tips for uniforming gate field. As the chemical potential changes, the average conductivity of graphene is modulated so that the effective depth of the graphene grooves changes, which sweeps the plasma frequency of THz SPPs consequently. Both simulated and experimental data demonstrate a red shift of plasma frequency from 195 to 180 GHz at a low bias from -0.5 to 0.5 V. The proposed structure reveals a novel approach to control the on/off status of SPP propagation in the THz range. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: 19pages, 6 figures

arXiv:2111.03186 [pdf, other]

EditGAN: High-Precision Semantic Image Editing

Authors: Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, Sanja Fidler

Abstract: Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editin… ▽ More Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e.g., drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentations, requiring only a handful of labeled examples, making it a scalable tool for editing. Specifically, we embed an image into the GAN latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find editing vectors in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom, while preserving full image quality.We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks. △ Less

Submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.09662 [pdf, other]

Osteoporosis Prescreening using Panoramic Radiographs through a Deep Convolutional Neural Network with Attention Mechanism

Authors: Heng Fan, Jiaxiang Ren, Jie Yang, Yi-Xian Qin, Haibin Ling

Abstract: Objectives. The aim of this study was to investigate whether a deep convolutional neural network (CNN) with an attention module can detect osteoporosis on panoramic radiographs. Study Design. A dataset of 70 panoramic radiographs (PRs) from 70 different subjects of age between 49 to 60 was used, including 49 subjects with osteoporosis and 21 normal subjects. We utilized the leave-one-out cross-v… ▽ More Objectives. The aim of this study was to investigate whether a deep convolutional neural network (CNN) with an attention module can detect osteoporosis on panoramic radiographs. Study Design. A dataset of 70 panoramic radiographs (PRs) from 70 different subjects of age between 49 to 60 was used, including 49 subjects with osteoporosis and 21 normal subjects. We utilized the leave-one-out cross-validation approach to generate 70 training and test splits. Specifically, for each split, one image was used for testing and the remaining 69 images were used for training. A deep convolutional neural network (CNN) using the Siamese architecture was implemented through a fine-tuning process to classify an PR image using patches extracted from eight representative trabecula bone areas (Figure 1). In order to automatically learn the importance of different PR patches, an attention module was integrated into the deep CNN. Three metrics, including osteoporosis accuracy (OPA), non-osteoporosis accuracy (NOPA) and overall accuracy (OA), were utilized for performance evaluation. Results. The proposed baseline CNN approach achieved the OPA, NOPA and OA scores of 0.667, 0.878 and 0.814, respectively. With the help of the attention module, the OPA, NOPA and OA scores were further improved to 0.714, 0.939 and 0.871, respectively. Conclusions. The proposed method obtained promising results using deep CNN with an attention module, which might be applied to osteoporosis prescreening. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 9 pages

arXiv:2110.09057 [pdf, other]

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Authors: Tao Sun, Huaming Ling, Zuoqiang Shi, Dongsheng Li, Bao Wang

Abstract: Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparamete… ▽ More Heavy ball momentum is crucial in accelerating (stochastic) gradient-based optimization algorithms for machine learning. Existing heavy ball momentum is usually weighted by a uniform hyperparameter, which relies on excessive tuning. Moreover, the calibrated fixed hyperparameter may not lead to optimal performance. In this paper, to eliminate the effort for tuning the momentum-related hyperparameter, we propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for quadratic optimization. Our proposed adaptive heavy ball momentum can improve stochastic gradient descent (SGD) and Adam. SGD and Adam with the newly designed adaptive momentum are more robust to large learning rates, converge faster, and generalize better than the baselines. We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation. Finally, we provide convergence guarantees for SGD and Adam with the proposed adaptive momentum. △ Less

Submitted 18 October, 2021; originally announced October 2021.

arXiv:2110.03139 [pdf, other]

doi 10.1103/PhysRevA.105.023319

Topological study of a Bogoliubov-de Gennes system of pseudo spin-$1/2$ bosons with conserved magnetization in a honeycomb lattice

Authors: Hong Y. Ling, Ben Kain

Abstract: We consider a Bogolibov-de Geenes (BdG) Hamiltonian, which is a non-Hermitian Hamiltonian with pseudo-Hermiticity, for a system of (pseudo) spin-$1/2$ bosons in a honeycomb lattice under the condition that the population difference between the two spin components, i.e., magnetization, is a constant. Such a system is capable of acting as a topological amplifier, under time-reversal symmetry, with s… ▽ More We consider a Bogolibov-de Geenes (BdG) Hamiltonian, which is a non-Hermitian Hamiltonian with pseudo-Hermiticity, for a system of (pseudo) spin-$1/2$ bosons in a honeycomb lattice under the condition that the population difference between the two spin components, i.e., magnetization, is a constant. Such a system is capable of acting as a topological amplifier, under time-reversal symmetry, with stable bulk bands but unstable edge modes which can be populated at an exponentially fast rate. We quantitatively study the topological properties of this model within the framework of the 38-fold way for non-Hermitian systems. We find, through the symmetry analysis of the Bloch Hamiltonian, that this model is classified either as two copies of symmetry class AIII+$η_-$ or two copies of symmetry class A+$η$ depending on whether the (total) system is time-reversal-symmetric, where $η$ is the matrix representing pseudo-Hermiticity and $η_-$ indicates that pseudo-Hermiticity and chiral symmetry operators anticommute. We prove, within the context of non-Hermitian physics where eigenstates obey the bi-orthonormality relation, that a stable bulk is characterized by a single topological invariant, the Chern number for the Haldane model, independent of pairing interactions. We construct a convenient analytical description for the edge modes of the Haldane model in semi-infinite planes, which is expected to be useful for models built upon copies of the Haldane model across a broad array of disciplines. We adapt the theorem in our recent work [Phys. Rev. A 104, 013305 (2021)] to pseudo-Hermitian Hamiltonians that are less restrictive than BdG Hamiltonians and apply it to highlight that the vanishing of an unconventional commutator between number-conserving and number-nonconserving parts of the Hamiltonian indicates whether a system can be made to act as a topological amplifier. △ Less

Submitted 8 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: 20 pages, 7 figures

Journal ref: Phys. Rev. A 105, 023319 (2022)

arXiv:2110.01676 [pdf, other]

Deep Learning Approach Protecting Privacy in Camera-Based Critical Applications

Authors: Gautham Ramajayam, Tao Sun, Chiu C. Tan, Lannan Luo, Haibin Ling

Abstract: Many critical applications rely on cameras to capture video footage for analytical purposes. This has led to concerns about these cameras accidentally capturing more information than is necessary. In this paper, we propose a deep learning approach towards protecting privacy in camera-based systems. Instead of specifying specific objects (e.g. faces) are privacy sensitive, our technique distinguish… ▽ More Many critical applications rely on cameras to capture video footage for analytical purposes. This has led to concerns about these cameras accidentally capturing more information than is necessary. In this paper, we propose a deep learning approach towards protecting privacy in camera-based systems. Instead of specifying specific objects (e.g. faces) are privacy sensitive, our technique distinguishes between salient (visually prominent) and non-salient objects based on the intuition that the latter is unlikely to be needed by the application. △ Less

Submitted 4 October, 2021; originally announced October 2021.

arXiv:2109.00240 [pdf, other]

Joint Graph Learning and Matching for Semantic Feature Correspondence

Authors: He Liu, Tao Wang, Yidong Li, Congyan Lang, Yi **, Haibin Ling

Abstract: In recent years, powered by the learned discriminative representation via graph neural network (GNN) models, deep graph matching methods have made great progresses in the task of matching semantic features. However, these methods usually rely on heuristically generated graph patterns, which may introduce unreliable relationships to hurt the matching performance. In this paper, we propose a joint \… ▽ More In recent years, powered by the learned discriminative representation via graph neural network (GNN) models, deep graph matching methods have made great progresses in the task of matching semantic features. However, these methods usually rely on heuristically generated graph patterns, which may introduce unreliable relationships to hurt the matching performance. In this paper, we propose a joint \emph{graph learning and matching} network, named GLAM, to explore reliable graph structures for boosting graph matching. GLAM adopts a pure attention-based framework for both graph learning and graph matching. Specifically, it employs two types of attention mechanisms, self-attention and cross-attention for the task. The self-attention discovers the relationships between features and to further update feature representations over the learnt structures; and the cross-attention computes cross-graph correlations between the two feature sets to be matched for feature reconstruction. Moreover, the final matching solution is directly derived from the output of the cross-attention layer, without employing a specific matching decision module. The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k), and it outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks. Furthermore, the graph patterns learnt by our model are validated to be able to remarkably enhance previous deep graph matching methods by replacing their handcrafted graph structures with the learnt ones. △ Less

Submitted 17 November, 2021; v1 submitted 1 September, 2021; originally announced September 2021.

arXiv:2108.06017 [pdf, other]

AGKD-BML: Defense Against Adversarial Attack by Attention Guided Knowledge Distillation and Bi-directional Metric Learning

Authors: Hong Wang, Yuefan Deng, Shinjae Yoo, Haibin Ling, Yuewei Lin

Abstract: While deep neural networks have shown impressive performance in many tasks, they are fragile to carefully designed adversarial attacks. We propose a novel adversarial training-based model by Attention Guided Knowledge Distillation and Bi-directional Metric Learning (AGKD-BML). The attention knowledge is obtained from a weight-fixed model trained on a clean dataset, referred to as a teacher model,… ▽ More While deep neural networks have shown impressive performance in many tasks, they are fragile to carefully designed adversarial attacks. We propose a novel adversarial training-based model by Attention Guided Knowledge Distillation and Bi-directional Metric Learning (AGKD-BML). The attention knowledge is obtained from a weight-fixed model trained on a clean dataset, referred to as a teacher model, and transferred to a model that is under training on adversarial examples (AEs), referred to as a student model. In this way, the student model is able to focus on the correct region, as well as correcting the intermediate features corrupted by AEs to eventually improve the model accuracy. Moreover, to efficiently regularize the representation in feature space, we propose a bidirectional metric learning. Specifically, given a clean image, it is first attacked to its most confusing class to get the forward AE. A clean image in the most confusing class is then randomly picked and attacked back to the original class to get the backward AE. A triplet loss is then used to shorten the representation distance between original image and its AE, while enlarge that between the forward and backward AEs. We conduct extensive adversarial robustness experiments on two widely used datasets with different attacks. Our proposed AGKD-BML model consistently outperforms the state-of-the-art approaches. The code of AGKD-BML will be available at: https://github.com/hongw579/AGKD-BML. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: ICCV 2021 paper

arXiv:2108.00616 [pdf, other]

RINDNet: Edge Detection for Discontinuity in Reflectance, Illumination, Normal and Depth

Authors: Mengyang Pu, Ya** Huang, Qingji Guan, Haibin Ling

Abstract: As a fundamental building block in computer vision, edges can be categorised into four types according to the discontinuity in surface-Reflectance, Illumination, surface-Normal or Depth. While great progress has been made in detecting generic or individual types of edges, it remains under-explored to comprehensively study all four edge types together. In this paper, we propose a novel neural netwo… ▽ More As a fundamental building block in computer vision, edges can be categorised into four types according to the discontinuity in surface-Reflectance, Illumination, surface-Normal or Depth. While great progress has been made in detecting generic or individual types of edges, it remains under-explored to comprehensively study all four edge types together. In this paper, we propose a novel neural network solution, RINDNet, to jointly detect all four types of edges. Taking into consideration the distinct attributes of each type of edges and the relationship between them, RINDNet learns effective representations for each of them and works in three stages. In stage I, RINDNet uses a common backbone to extract features shared by all edges. Then in stage II it branches to prepare discriminative features for each edge type by the corresponding decoder. In stage III, an independent decision head for each type aggregates the features from previous stages to predict the initial results. Additionally, an attention module learns attention maps for all types to capture the underlying relations between them, and these maps are combined with initial results to generate the final edge detection results. For training and evaluation, we construct the first public benchmark, BSDS-RIND, with all four types of edges carefully annotated. In our experiments, RINDNet yields promising results in comparison with state-of-the-art methods. Additional analysis is presented in supplementary material. △ Less

Submitted 1 August, 2021; originally announced August 2021.

Comments: Accepted by ICCV2021

arXiv:2107.13363 [pdf, other]

doi 10.13140/RG.2.2.11070.20807

From Monopoly to Competition: Optimal Contests Prevail

Authors: Xiaotie Deng, Yotam Gafni, Ron Lavi, Tao Lin, Hongyi Ling

Abstract: We study competition among contests in a general model that allows for an arbitrary and heterogeneous space of contest design, where the goal of the contest designers is to maximize the contestants' sum of efforts. Our main result shows that optimal contests in the monopolistic setting (i.e., those that maximize the sum of efforts in a model with a single contest) form an equilibrium in the model… ▽ More We study competition among contests in a general model that allows for an arbitrary and heterogeneous space of contest design, where the goal of the contest designers is to maximize the contestants' sum of efforts. Our main result shows that optimal contests in the monopolistic setting (i.e., those that maximize the sum of efforts in a model with a single contest) form an equilibrium in the model with competition among contests. Under a very natural assumption these contests are in fact dominant, and the equilibria that they form are unique. Moreover, equilibria with the optimal contests are Pareto-optimal even in cases where other equilibria emerge. In many natural cases, they also maximize the social welfare. △ Less

Submitted 28 July, 2021; originally announced July 2021.

arXiv:2107.08766 [pdf, other]

VisDrone-CC2020: The Vision Meets Drone Crowd Counting Challenge Results

Authors: Dawei Du, Longyin Wen, Pengfei Zhu, Heng Fan, Qinghua Hu, Haibin Ling, Mubarak Shah, Junwen Pan, Ali Al-Ali, Amr Mohamed, Bakour Imene, Bin Dong, Binyu Zhang, Bouchali Hadia Nesma, Chenfeng Xu, Chenzhen Duan, Ciro Castiello, Corrado Mencar, Dingkang Liang, Florian Krüger, Gennaro Vessio, Giovanna Castellano, Jieru Wang, Junyu Gao, Khalid Abualsaud , et al. (30 additional authors not shown)

Abstract: Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint. However, there are few algorithms focusing on crowd counting on the drone-captured data due to the lack of comprehensive datasets. To this end, we collect a large-scale dataset and organize the Vision Meets Drone Crowd C… ▽ More Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint. However, there are few algorithms focusing on crowd counting on the drone-captured data due to the lack of comprehensive datasets. To this end, we collect a large-scale dataset and organize the Vision Meets Drone Crowd Counting Challenge (VisDrone-CC2020) in conjunction with the 16th European Conference on Computer Vision (ECCV 2020) to promote the developments in the related fields. The collected dataset is formed by $3,360$ images, including $2,460$ images for training, and $900$ images for testing. Specifically, we manually annotate persons with points in each video frame. There are $14$ algorithms from $15$ institutes submitted to the VisDrone-CC2020 Challenge. We provide a detailed analysis of the evaluation results and conclude the challenge. More information can be found at the website: \url{http://www.aiskyeye.com/}. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: The method description of A7 Mutil-Scale Aware based SFANet (M-SFANet) is updated and missing references are added

Journal ref: European Conference on Computer Vision. Springer, Cham, 2020: 675-691

arXiv:2107.00651 [pdf, other]

AutoFormer: Searching Transformers for Visual Recognition

Authors: Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

Abstract: Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In… ▽ More Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search. AutoFormer entangles the weights of different blocks in the same layers during supernet training. Benefiting from the strategy, the trained supernet allows thousands of subnets to be very well-trained. Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch. Besides, the searched models, which we refer to AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify the transferability of AutoFormer by providing the performance on downstream benchmarks and distillation experiments. Code and models are available at https://github.com/microsoft/AutoML. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: Github: https://github.com/microsoft/AutoML

arXiv:2107.00420 [pdf, other]

doi 10.1109/TIP.2022.3216771

CBNet: A Composite Backbone Network Architecture for Object Detection

Authors: Tingting Liang, Xiaojie Chu, Yudong Liu, Yongtao Wang, Zhi Tang, Wei Chu, **gdong Chen, Haibin Ling

Abstract: Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In… ▽ More Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6$\times$. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2. △ Less

Submitted 18 October, 2022; v1 submitted 1 July, 2021; originally announced July 2021.

Comments: IEEE Transactions on Image Processing (TIP) camera ready

arXiv:2106.06744 [pdf, other]

DeepMMSA: A Novel Multimodal Deep Learning Method for Non-small Cell Lung Cancer Survival Analysis

Authors: Yujiao Wu, Jie Ma, Xiaoshui Huang, Sai Ho Ling, Steven Weidong Su

Abstract: Lung cancer is the leading cause of cancer death worldwide. The critical reason for the deaths is delayed diagnosis and poor prognosis. With the accelerated development of deep learning techniques, it has been successfully applied extensively in many real-world applications, including health sectors such as medical image interpretation and disease diagnosis. By combining more modalities that being… ▽ More Lung cancer is the leading cause of cancer death worldwide. The critical reason for the deaths is delayed diagnosis and poor prognosis. With the accelerated development of deep learning techniques, it has been successfully applied extensively in many real-world applications, including health sectors such as medical image interpretation and disease diagnosis. By combining more modalities that being engaged in the processing of information, multimodal learning can extract better features and improve predictive ability. The conventional methods for lung cancer survival analysis normally utilize clinical data and only provide a statistical probability. To improve the survival prediction accuracy and help prognostic decision-making in clinical practice for medical experts, we for the first time propose a multimodal deep learning method for non-small cell lung cancer (NSCLC) survival analysis, named DeepMMSA. This method leverages CT images in combination with clinical data, enabling the abundant information hold within medical images to be associate with lung cancer survival information. We validate our method on the data of 422 NSCLC patients from The Cancer Imaging Archive (TCIA). Experimental results support our hypothesis that there is an underlying relationship between prognostic information and radiomic images. Besides, quantitative results showing that the established multimodal model can be applied to traditional method and has the potential to break bottleneck of existing methods and increase the the percentage of concordant pairs(right predicted pairs) in overall population by 4%. △ Less

Submitted 12 June, 2021; originally announced June 2021.

Comments: 7 Submitted to IEEE TBME

arXiv:2106.03432 [pdf, other]

Channel DropBlock: An Improved Regularization Method for Fine-Grained Visual Classification

Authors: Yifeng Ding, Shuwei Dong, Yujun Tong, Zhanyu Ma, Bo Xiao, Haibin Ling

Abstract: Classifying the sub-categories of an object from the same super-category (e.g., bird) in a fine-grained visual classification (FGVC) task highly relies on mining multiple discriminative features. Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakl… ▽ More Classifying the sub-categories of an object from the same super-category (e.g., bird) in a fine-grained visual classification (FGVC) task highly relies on mining multiple discriminative features. Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakly-supervised fashion. In this work, we propose a lightweight yet effective regularization method named Channel DropBlock (CDB), in combination with two alternative correlation metrics, to address this problem. The key idea is to randomly mask out a group of correlated channels during training to destruct features from co-adaptations and thus enhance feature representations. Extensive experiments on three benchmark FGVC datasets show that CDB effectively improves the performance. △ Less

Submitted 7 June, 2021; originally announced June 2021.

arXiv:2106.01217 [pdf, other]

doi 10.1109/IJCB52358.2021.9484387

DFGC 2021: A DeepFake Game Competition

Authors: Bo Peng, Hongxing Fan, Wei Wang, **g Dong, Yuezun Li, Siwei Lyu, Qi Li, Zhenan Sun, Han Chen, Baoying Chen, Yanjie Hu, Shenghai Luo, Junrui Huang, Yutong Yao, Boyuan Liu, Hefei Ling, Guosheng Zhang, Zhiliang Xu, Changtao Miao, Changlei Lu, Shan He, Xiaoyan Wu, Wanyi Zhuang

Abstract: This paper presents a summary of the DFGC 2021 competition. DeepFake technology is develo** fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between curren… ▽ More This paper presents a summary of the DFGC 2021 competition. DeepFake technology is develo** fast, and realistic face-swaps are increasingly deceiving and hard to detect. At the same time, DeepFake detection methods are also improving. There is a two-party game between DeepFake creators and detectors. This competition provides a common platform for benchmarking the adversarial game between current state-of-the-art DeepFake creation and detection methods. In this paper, we present the organization, results and top solutions of this competition and also share our insights obtained during this event. We also release the DFGC-21 testing dataset collected from our participants to further benefit the research community. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Journal ref: 2021 IEEE International Joint Conference on Biometrics (IJCB), 2021, pp. 1-8

arXiv:2105.14065 [pdf, other]

TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation

Authors: Xinyi Li, Haibin Ling

Abstract: Camera pose estimation or camera relocalization is the centerpiece in numerous computer vision tasks such as visual odometry, structure from motion (SfM) and SLAM. In this paper we propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. In contrast with prior work where the pose regression is mainly guided by photometric… ▽ More Camera pose estimation or camera relocalization is the centerpiece in numerous computer vision tasks such as visual odometry, structure from motion (SfM) and SLAM. In this paper we propose a neural network approach with a graph transformer backbone, namely TransCamP, to address the camera relocalization problem. In contrast with prior work where the pose regression is mainly guided by photometric consistency, TransCamP effectively fuses the image features, camera pose information and inter-frame relative camera motions into encoded graph attributes and is trained towards the graph consistency and accuracy instead, yielding significantly higher computational efficiency. By leveraging graph transformer layers with edge features and enabling tensorized adjacency matrix, TransCamP dynamically captures the global attention and thus endows the pose graph with evolving structures to achieve improved robustness and accuracy. In addition, optional temporal transformer layers actively enhance the spatiotemporal inter-frame relation for sequential inputs. Evaluation of the proposed network on various public benchmarks demonstrates that TransCamP outperforms state-of-the-art approaches. △ Less

Submitted 28 May, 2021; originally announced May 2021.

arXiv:2104.06565 [pdf, other]

Optimal Rates of Teaching and Learning Under Uncertainty

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the… ▽ More In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as $n$ increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy $D\big(\frac{1}{2}\|\max(p,q)\big)$, where $p$ and $q$ are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in $n$. We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols. △ Less

Submitted 7 December, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: IEEE Transactions on Information Theory, Volume 67, Issue 11, pp. 7067-7080, Nov. 2021. This version slightly modifies/expands the 'Existing Results' section

arXiv:2104.06490 [pdf, other]

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

Authors: Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, Sanja Fidler

Abstract: We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code… ▽ More We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100x more annotated data as our method. △ Less

Submitted 19 April, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Accepted to CVPR 2021 as an Oral paper. Webpage: https://nv-tlabs.github.io/datasetGAN/

arXiv:2104.00597 [pdf, other]

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Authors: Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

Abstract: Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models… ▽ More Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models simultaneously as an alternative way to find powerful models. Searching for ensembles is non-trivial and has two key challenges: enlarged search space and potentially more complexity for the searched model. In this paper, we propose a one-shot neural ensemble architecture search (NEAS) solution that addresses the two challenges. For the first challenge, we introduce a novel diversity-based metric to guide search space shrinking, considering both the potentiality and diversity of candidate operators. For the second challenge, we enable a new search dimension to learn layer sharing among different models for efficiency purposes. The experiments on ImageNet clearly demonstrate that our solution can improve the supernet's capacity of ranking ensemble architectures, and further lead to better search results. The discovered architectures achieve superior performance compared with state-of-the-arts such as MobileNetV3 and EfficientNet families under aligned settings. Moreover, we evaluate the generalization ability and robustness of our searched architecture on the COCO detection benchmark and achieve a 3.1% improvement on AP compared with MobileNetV3. Codes and models are available at https://github.com/researchmm/NEAS. △ Less

Submitted 16 July, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: Accepted to CVPR 2021

arXiv:2104.00194 [pdf, other]

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Authors: Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, Zicheng Liu

Abstract: Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked o… ▽ More Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets. △ Less

Submitted 3 April, 2021; v1 submitted 31 March, 2021; originally announced April 2021.

arXiv:2103.15841 [pdf, other]

doi 10.1103/PhysRevB.105.035156

Magnetic skyrmion crystal at a topological insulator surface

Authors: Stefan Divic, Henry Ling, T. Pereg-Barnea, Arun Paramekanti

Abstract: We consider a magnetic skyrmion crystal formed at the surface of a topological insulator. Incorporating the exchange interaction between the helical Dirac surface states and the periodic Néel or Bloch skyrmion texture, we obtain the resulting electronic band structure and discuss the constraints that symmetries impose on the energies and Berry curvature. We find substantive qualitative differences… ▽ More We consider a magnetic skyrmion crystal formed at the surface of a topological insulator. Incorporating the exchange interaction between the helical Dirac surface states and the periodic Néel or Bloch skyrmion texture, we obtain the resulting electronic band structure and discuss the constraints that symmetries impose on the energies and Berry curvature. We find substantive qualitative differences between the Néel and Bloch cases, with the latter generically permitting a multiband low energy tight-binding representation whose parameters are tightly constrained by symmetries. We explicitly compute the associated Wannier orbitals, which resemble the ringlike chiral bound states of helical Dirac fermions coupled to a single skyrmion in a ferromagnetic background. We construct a two-band tight-binding model with real nearest-neighbor hop**s which captures the salient topological features of the low-energy bands. Our results are relevant to magnetic topological insulators (TIs), as well as to TI-magnetic thin film heterostructures, in which skyrmion crystals may be stabilized. △ Less

Submitted 13 February, 2022; v1 submitted 29 March, 2021; originally announced March 2021.

Comments: 21 pages, 7 figures

Journal ref: Phys. Rev. B 105, 035156 (2022)

arXiv:2103.14337 [pdf, other]

Hands-on Guidance for Distilling Object Detectors

Authors: Yangyang Qin, Hefei Ling, Zhenghai He, Yuxuan Shi, Lei Wu

Abstract: Knowledge distillation can lead to deploy-friendly networks against the plagued computational complexity problem, but previous methods neglect the feature hierarchy in detectors. Motivated by this, we propose a general framework for detection distillation. Our method, called Hands-on Guidance Distillation, distills the latent knowledge of all stage features for imposing more comprehensive supervis… ▽ More Knowledge distillation can lead to deploy-friendly networks against the plagued computational complexity problem, but previous methods neglect the feature hierarchy in detectors. Motivated by this, we propose a general framework for detection distillation. Our method, called Hands-on Guidance Distillation, distills the latent knowledge of all stage features for imposing more comprehensive supervision, and focuses on the essence simultaneously for promoting more intense knowledge absorption. Specifically, a series of novel mechanisms are designed elaborately, including correspondence establishment for consistency, hands-on imitation loss measure and re-weighted optimization from both micro and macro perspectives. We conduct extensive evaluations with different distillation configurations over VOC and COCO datasets, which show better performance on accuracy and speed trade-offs. Meanwhile, feasibility experiments on different structural networks further prove the robustness of our HGD. △ Less

Submitted 12 May, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Accepted at ICME2021

arXiv:2103.14274 [pdf, other]

doi 10.1145/3386569.3392422

Character Controllers Using Motion VAEs

Authors: Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne

Abstract: A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern… ▽ More A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern its evolution over time. Planning or control algorithms can then use this action space to generate desired motions. In particular, we use deep reinforcement learning to learn controllers that achieve goal-directed movements. We demonstrate the effectiveness of the approach on multiple tasks. We further evaluate system-design choices and describe the current limitations of Motion VAEs. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: Project page: https://www.cs.ubc.ca/~hyuling/projects/mvae/ ; Code: https://github.com/electronicarts/character-motion-vaes

arXiv:2103.14028 [pdf]

doi 10.1038/s41565-021-01023-x

Light-Matter Coupling in Scalable Van der Waals Superlattices

Authors: Pawan Kumar, Jason Lynch, Baokun Song, Haonan Ling, Francisco Barrera, Huiqin Zhang, Surendra B. Anantharaman, Jagrit Digani, Haoyue Zhu, Tanushree H. Choudhury, Clifford McAleese, Xiaochen Wang, Ben R. Conran, Oliver Whear, Michael J. Motala, Michael Snure, Christopher Muratore, Joan M. Redwing, Nicholas R. Glavin, Eric A. Stach, Artur R. Davoyan, Deep Jariwala

Abstract: Two-dimensional (2D) crystals have renewed opportunities in design and assembly of artificial lattices without the constraints of epitaxy. However, the lack of thickness control in exfoliated van der Waals (vdW) layers prevents realization of repeat units with high fidelity. Recent availability of uniform, wafer-scale samples permits engineering of both electronic and optical dispersions in stacks… ▽ More Two-dimensional (2D) crystals have renewed opportunities in design and assembly of artificial lattices without the constraints of epitaxy. However, the lack of thickness control in exfoliated van der Waals (vdW) layers prevents realization of repeat units with high fidelity. Recent availability of uniform, wafer-scale samples permits engineering of both electronic and optical dispersions in stacks of disparate 2D layers with multiple repeating units. We present optical dispersion engineering in a superlattice structure comprised of alternating layers of 2D excitonic chalcogenides and dielectric insulators. By carefully designing the unit cell parameters, we demonstrate > 90 % narrowband absorption in < 4 nm active layer excitonic absorber medium at room temperature, concurrently with enhanced photoluminescence in cm2 samples. These superlattices show evidence of strong light-matter coupling and exciton-polariton formation with geometry-tunable coupling constants. Our results demonstrate proof of concept structures with engineered optical properties and pave the way for a broad class of scalable, designer optical metamaterials from atomically-thin layers. △ Less

Submitted 25 March, 2021; originally announced March 2021.

Comments: 4 figures + supporting

arXiv:2103.04507 [pdf, other]

OPANAS: One-Shot Path Aggregation Network Architecture Search for Object Detection

Authors: Tingting Liang, Yongtao Wang, Zhi Tang, Guosheng Hu, Haibin Ling

Abstract: Recently, neural architecture search (NAS) has been exploited to design feature pyramid networks (FPNs) and achieved promising results for visual object detection. Encouraged by the success, we propose a novel One-Shot Path Aggregation Network Architecture Search (OPANAS) algorithm, which significantly improves both searching efficiency and detection accuracy. Specifically, we first introduce six… ▽ More Recently, neural architecture search (NAS) has been exploited to design feature pyramid networks (FPNs) and achieved promising results for visual object detection. Encouraged by the success, we propose a novel One-Shot Path Aggregation Network Architecture Search (OPANAS) algorithm, which significantly improves both searching efficiency and detection accuracy. Specifically, we first introduce six heterogeneous information paths to build our search space, namely top-down, bottom-up, fusing-splitting, scale-equalizing, skip-connect and none. Second, we propose a novel search space of FPNs, in which each FPN candidate is represented by a densely-connected directed acyclic graph (each node is a feature pyramid and each edge is one of the six heterogeneous information paths). Third, we propose an efficient one-shot search method to find the optimal path aggregation architecture, that is, we first train a super-net and then find the optimal candidate with an evolutionary algorithm. Experimental results demonstrate the efficacy of the proposed OPANAS for object detection: (1) OPANAS is more efficient than state-of-the-art methods (e.g., NAS-FPN and Auto-FPN), at significantly smaller searching cost (e.g., only 4 GPU days on MS-COCO); (2) the optimal architecture found by OPANAS significantly improves main-stream detectors including RetinaNet, Faster R-CNN and Cascade R-CNN, by 2.3-3.2 % mAP comparing to their FPN counterparts; and (3) a new state-of-the-art accuracy-speed trade-off (52.2 % mAP at 7.6 FPS) at smaller training costs than comparable state-of-the-arts. Code will be released at https://github.com/VDIGPKU/OPANAS. △ Less

Submitted 11 March, 2021; v1 submitted 7 March, 2021; originally announced March 2021.

Comments: To appear in CVPR 2021

arXiv:2102.05454 [pdf, other]

On the Robustness of Multi-View Rotation Averaging

Authors: Xinyi Li, Haibin Ling

Abstract: Rotation averaging is a synchronization process on single or multiple rotation groups, and is a fundamental problem in many computer vision tasks such as multi-view structure from motion (SfM). Specifically, rotation averaging involves the recovery of an underlying pose-graph consistency from pairwise relative camera poses. Specifically, given pairwise motion in rotation groups, especially 3-dimen… ▽ More Rotation averaging is a synchronization process on single or multiple rotation groups, and is a fundamental problem in many computer vision tasks such as multi-view structure from motion (SfM). Specifically, rotation averaging involves the recovery of an underlying pose-graph consistency from pairwise relative camera poses. Specifically, given pairwise motion in rotation groups, especially 3-dimensional rotation groups (\eg, $\mathbb{SO}(3)$), one is interested in recovering the original signal of multiple rotations with respect to a fixed frame. In this paper, we propose a robust framework to solve multiple rotation averaging problem, especially in the cases that a significant amount of noisy measurements are present. By introducing the $ε$-cycle consistency term into the solver, we enable the robust initialization scheme to be implemented into the IRLS solver. Instead of conducting the costly edge removal, we implicitly constrain the negative effect of erroneous measurements by weight reducing, such that IRLS failures caused by poor initialization can be effectively avoided. Experiment results demonstrate that our proposed approach outperforms state of the arts on various benchmarks. △ Less

Submitted 9 February, 2021; originally announced February 2021.

arXiv:2101.09014 [pdf, other]

doi 10.1109/TIP.2020.3044440

Personal Fixations-Based Object Segmentation with Object Localization and Boundary Preservation

Authors: Gongyang Li, Zhi Liu, Ran Shi, Zheng Hu, Weijie Wei, Yong Wu, Mengke Huang, Haibin Ling

Abstract: As a natural way for human-computer interaction, fixation provides a promising solution for interactive image segmentation. In this paper, we focus on Personal Fixations-based Object Segmentation (PFOS) to address issues in previous studies, such as the lack of appropriate dataset and the ambiguity in fixations-based interaction. In particular, we first construct a new PFOS dataset by carefully co… ▽ More As a natural way for human-computer interaction, fixation provides a promising solution for interactive image segmentation. In this paper, we focus on Personal Fixations-based Object Segmentation (PFOS) to address issues in previous studies, such as the lack of appropriate dataset and the ambiguity in fixations-based interaction. In particular, we first construct a new PFOS dataset by carefully collecting pixel-level binary annotation data over an existing fixation prediction dataset, such dataset is expected to greatly facilitate the study along the line. Then, considering characteristics of personal fixations, we propose a novel network based on Object Localization and Boundary Preservation (OLBP) to segment the gazed objects. Specifically, the OLBP network utilizes an Object Localization Module (OLM) to analyze personal fixations and locates the gazed objects based on the interpretation. Then, a Boundary Preservation Module (BPM) is designed to introduce additional boundary information to guard the completeness of the gazed objects. Moreover, OLBP is organized in the mixed bottom-up and top-down manner with multiple types of deep supervision. Extensive experiments on the constructed PFOS dataset show the superiority of the proposed OLBP network over 17 state-of-the-art methods, and demonstrate the effectiveness of the proposed OLM and BPM components. The constructed PFOS dataset and the proposed OLBP network are available at https://github.com/MathLee/OLBPNet4PFOS. △ Less

Submitted 22 January, 2021; originally announced January 2021.

Comments: Accepted by IEEE TIP. Code: https://github.com/MathLee/OLBPNet4PFOS

arXiv:2012.11803 [pdf, other]

Modeling Deep Learning Based Privacy Attacks on Physical Mail

Authors: Bingyao Huang, Ruyi Lian, Dimitris Samaras, Haibin Ling

Abstract: Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content… ▽ More Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content as learning the map** from the camera-captured envelope front face image to the hidden content, then we explicitly model the map** as a combination of perspective transformation, image dehazing and denoising using a deep convolutional neural network, named Neural-STE (See-Through-Envelope). We show experimentally that hidden content details, such as texture and image structure, can be clearly recovered. Finally, our formulation and model allow us to design envelopes that can counter deep learning-based privacy attacks on physical mail. △ Less

Submitted 25 March, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

Comments: Source code: https://github.com/BingyaoHuang/Neural-STE

arXiv:2012.10728 [pdf, other]

Political Posters Identification with Appearance-Text Fusion

Authors: Xuan Qin, Meizhu Liu, Yifan Hu, Christina Moo, Christian M. Riblet, Changwei Hu, Kevin Yen, Haibin Ling

Abstract: In this paper, we propose a method that efficiently utilizes appearance features and text vectors to accurately classify political posters from other similar political images. The majority of this work focuses on political posters that are designed to serve as a promotion of a certain political event, and the automated identification of which can lead to the generation of detailed statistics and m… ▽ More In this paper, we propose a method that efficiently utilizes appearance features and text vectors to accurately classify political posters from other similar political images. The majority of this work focuses on political posters that are designed to serve as a promotion of a certain political event, and the automated identification of which can lead to the generation of detailed statistics and meets the judgment needs in a variety of areas. Starting with a comprehensive keyword list for politicians and political events, we curate for the first time an effective and practical political poster dataset containing 13K human-labeled political images, including 3K political posters that explicitly support a movement or a campaign. Second, we make a thorough case study for this dataset and analyze common patterns and outliers of political posters. Finally, we propose a model that combines the power of both appearance and text information to classify political posters with significantly high accuracy. △ Less

Submitted 19 December, 2020; originally announced December 2020.

arXiv:2012.05858 [pdf, other]

doi 10.1109/VR51125.2022.00073

SPAA: Stealthy Projector-based Adversarial Attacks on Deep Image Classifiers

Authors: Bingyao Huang, Haibin Ling

Abstract: Light-based adversarial attacks use spatial augmented reality (SAR) techniques to fool image classifiers by altering the physical light condition with a controllable light source, e.g., a projector. Compared with physical attacks that place hand-crafted adversarial objects, projector-based ones obviate modifying the physical entities, and can be performed transiently and dynamically by altering th… ▽ More Light-based adversarial attacks use spatial augmented reality (SAR) techniques to fool image classifiers by altering the physical light condition with a controllable light source, e.g., a projector. Compared with physical attacks that place hand-crafted adversarial objects, projector-based ones obviate modifying the physical entities, and can be performed transiently and dynamically by altering the projection pattern. However, subtle light perturbations are insufficient to fool image classifiers, due to the complex environment and project-and-capture process. Thus, existing approaches focus on projecting clearly perceptible adversarial patterns, while the more interesting yet challenging goal, stealthy projector-based attack, remains open. In this paper, for the first time, we formulate this problem as an end-to-end differentiable process and propose a Stealthy Projector-based Adversarial Attack (SPAA) solution. In SPAA, we approximate the real Project-and-Capture process using a deep neural network named PCNet, then we include PCNet in the optimization of projector-based attacks such that the generated adversarial projection is physically plausible. Finally, to generate both robust and stealthy adversarial projections, we propose an algorithm that uses minimum perturbation and adversarial confidence thresholds to alternate between the adversarial loss and stealthiness loss optimization. Our experimental evaluations show that SPAA clearly outperforms other methods by achieving higher attack success rates and meanwhile being stealthier, for both targeted and untargeted attacks. △ Less

Submitted 17 March, 2022; v1 submitted 10 December, 2020; originally announced December 2020.

arXiv:2011.14935 [pdf, ps, other]

doi 10.1103/PhysRevA.104.013305

Selection Rule for Topological Amplifiers in Bogoliubov de Gennes Systems

Authors: Hong Y. Ling, Ben Kain

Abstract: Dynamical instability is an inherent feature of bosonic systems described by the Bogoliubov de Geenes (BdG) Hamiltonian. Since it causes the BdG system to collapse, it is generally thought that it should be avoided. Recently, there has been much effort to harness this instability for the benefit of creating a topological amplifier with stable bulk bands but unstable edge modes which can be populat… ▽ More Dynamical instability is an inherent feature of bosonic systems described by the Bogoliubov de Geenes (BdG) Hamiltonian. Since it causes the BdG system to collapse, it is generally thought that it should be avoided. Recently, there has been much effort to harness this instability for the benefit of creating a topological amplifier with stable bulk bands but unstable edge modes which can be populated at an exponentially fast rate. We present a theorem for determining the stability of states with energies sufficiently away from zero, in terms of an unconventional commutator between the number conserving part and number nonconserving part of the BdG Hamiltonian. We apply the theorem to a generalization of a model from Galilo et al. [Phys. Rev. Lett, 115, 245302(2015)] for creating a topological amplifier in an interacting spin-1 atom system in a honeycomb lattice through a quench process. We use this model to illustrate how the vanishing of the unconventional commutator selects the symmetries for a system so that its bulk states are stable against (weak) pairing interactions. We find that as long as time reversal symmetry is preserved, our system can act like a topological amplifier, even in the presence of an onsite staggered potential which breaks the inversion symmetry. △ Less

Submitted 20 August, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

Comments: 12 pages and 3 figures. q = 0.2t_1 in the caption of fig. 1 in our published paper [Phys. Rev. A 104, 013305 (2021)] has been changed to the correct one, q = 0.4t_1

Journal ref: Phys. Rev. A 104, 013305 (2021)

arXiv:2011.12483 [pdf, other]

CRACT: Cascaded Regression-Align-Classification for Robust Visual Tracking

Authors: Heng Fan, Haibin Ling

Abstract: High quality object proposals are crucial in visual tracking algorithms that utilize region proposal network (RPN). Refinement of these proposals, typically by box regression and classification in parallel, has been popularly adopted to boost tracking performance. However, it still meets problems when dealing with complex and dynamic background. Thus motivated, in this paper we introduce an improv… ▽ More High quality object proposals are crucial in visual tracking algorithms that utilize region proposal network (RPN). Refinement of these proposals, typically by box regression and classification in parallel, has been popularly adopted to boost tracking performance. However, it still meets problems when dealing with complex and dynamic background. Thus motivated, in this paper we introduce an improved proposal refinement module, Cascaded Regression-Align-Classification (CRAC), which yields new state-of-the-art performances on many benchmarks. First, having observed that the offsets from box regression can serve as guidance for proposal feature refinement, we design CRAC as a cascade of box regression, feature alignment and box classification. The key is to bridge box regression and classification via an alignment step, which leads to more accurate features for proposal classification with improved robustness. To address the variation in object appearance, we introduce an identification-discrimination component for box classification, which leverages offline reliable fine-grained template and online rich background information to distinguish the target from background. Moreover, we present pyramid RoIAlign that benefits CRAC by exploiting both the local and global cues of proposals. During inference, tracking proceeds by ranking all refined proposals and selecting the best one. In experiments on seven benchmarks including OTB-2015, UAV123, NfS, VOT-2018, TrackingNet, GOT-10k and LaSOT, our CRACT exhibits very promising results in comparison with state-of-the-art competitors and runs in real-time. △ Less

Submitted 24 November, 2020; originally announced November 2020.

Comments: tech. report

arXiv:2011.11858 [pdf, other]

GMOT-40: A Benchmark for Generic Multiple Object Tracking

Authors: Hexin Bai, Wensheng Cheng, Peng Chu, Juehuan Liu, Kai Zhang, Haibin Ling

Abstract: Multiple Object Tracking (MOT) has witnessed remarkable advances in recent years. However, existing studies dominantly request prior knowledge of the tracking target, and hence may not generalize well to unseen categories. In contrast, Generic Multiple Object Tracking (GMOT), which requires little prior information about the target, is largely under-explored. In this paper, we make contributions t… ▽ More Multiple Object Tracking (MOT) has witnessed remarkable advances in recent years. However, existing studies dominantly request prior knowledge of the tracking target, and hence may not generalize well to unseen categories. In contrast, Generic Multiple Object Tracking (GMOT), which requires little prior information about the target, is largely under-explored. In this paper, we make contributions to boost the study of GMOT in three aspects. First, we construct the first public GMOT dataset, dubbed GMOT-40, which contains 40 carefully annotated sequences evenly distributed among 10 object categories. In addition, two tracking protocols are adopted to evaluate different characteristics of tracking algorithms. Second, by noting the lack of devoted tracking algorithms, we have designed a series of baseline GMOT algorithms. Third, we perform a thorough evaluation on GMOT-40, involving popular MOT algorithms (with necessary modifications) and the proposed baselines. We will release the GMOT-40 benchmark, the evaluation results, as well as the baseline algorithm to the public upon the publication of the paper. △ Less

Submitted 7 April, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

arXiv:2011.10875 [pdf, other]

Transparent Object Tracking Benchmark

Authors: Heng Fan, Halady Akhilesha Miththanthaya, Harshit, Siranjiv Ramana Rajan, Xiaoqiong Liu, Zhilin Zou, Yuewei Lin, Haibin Ling

Abstract: Visual tracking has achieved considerable progress in recent years. However, current research in the field mainly focuses on tracking of opaque objects, while little attention is paid to transparent object tracking. In this paper, we make the first attempt in exploring this problem by proposing a Transparent Object Tracking Benchmark (TOTB). Specifically, TOTB consists of 225 videos (86K frames) f… ▽ More Visual tracking has achieved considerable progress in recent years. However, current research in the field mainly focuses on tracking of opaque objects, while little attention is paid to transparent object tracking. In this paper, we make the first attempt in exploring this problem by proposing a Transparent Object Tracking Benchmark (TOTB). Specifically, TOTB consists of 225 videos (86K frames) from 15 diverse transparent object categories. Each sequence is manually labeled with axis-aligned bounding boxes. To the best of our knowledge, TOTB is the first benchmark dedicated to transparent object tracking. In order to understand how existing trackers perform and to provide comparison for future research on TOTB, we extensively evaluate 25 state-of-the-art tracking algorithms. The evaluation results exhibit that more efforts are needed to improve transparent object tracking. Besides, we observe some nontrivial findings from the evaluation that are discrepant with some common beliefs in opaque object tracking. For example, we find that deeper features are not always good for improvements. Moreover, to encourage future research, we introduce a novel tracker, named TransATOM, which leverages transparency features for tracking and surpasses all 25 evaluated approaches by a large margin. By releasing TOTB, we expect to facilitate future research and application of transparent object tracking in both the academia and industry. The TOTB and evaluation results as well as TransATOM are available at https://hengfan2010.github.io/projects/TOTB. △ Less

Submitted 1 August, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: Tech. Report

arXiv:2011.01163 [pdf, other]

Pushing the Envelope of Rotation Averaging for Visual SLAM

Authors: Xinyi Li, Lin Yuan, Longin Jan Latecki, Haibin Ling

Abstract: As an essential part of structure from motion (SfM) and Simultaneous Localization and Map** (SLAM) systems, motion averaging has been extensively studied in the past years and continues to attract surging research attention. While canonical approaches such as bundle adjustment are predominantly inherited in most of state-of-the-art SLAM systems to estimate and update the trajectory in the robot… ▽ More As an essential part of structure from motion (SfM) and Simultaneous Localization and Map** (SLAM) systems, motion averaging has been extensively studied in the past years and continues to attract surging research attention. While canonical approaches such as bundle adjustment are predominantly inherited in most of state-of-the-art SLAM systems to estimate and update the trajectory in the robot navigation, the practical implementation of bundle adjustment in SLAM systems is intrinsically limited by the high computational complexity, unreliable convergence and strict requirements of ideal initializations. In this paper, we lift these limitations and propose a novel optimization backbone for visual SLAM systems, where we leverage rotation averaging to improve the accuracy, efficiency and robustness of conventional monocular SLAM pipelines. In our approach, we first decouple the rotational and translational parameters in the camera rigid body transformation and convert the high-dimensional non-convex nonlinear problem into tractable linear subproblems in lower dimensions, and show that the subproblems can be solved independently with proper constraints. We apply the scale parameter with $l_1$-norm in the pose-graph optimization to address the rotation averaging robustness against outliers. We further validate the global optimality of our proposed approach, revisit and address the initialization schemes, pure rotational scene handling and outlier treatments. We demonstrate that our approach can exhibit up to 10x faster speed with comparable accuracy against the state of the art on public benchmarks. △ Less

Submitted 2 November, 2020; originally announced November 2020.

arXiv:2011.00372 [pdf, other]

Pose Estimation of Specular and Symmetrical Objects

Authors: Jiaming Hu, Hongyi Ling, Priyam Parashar, Aayush Naik, Henrik Christensen

Abstract: In the robotic industry, specular and textureless metallic components are ubiquitous. The 6D pose estimation of such objects with only a monocular RGB camera is difficult because of the absence of rich texture features. Furthermore, the appearance of specularity heavily depends on the camera viewpoint and environmental light conditions making traditional methods, like template matching, fail. In t… ▽ More In the robotic industry, specular and textureless metallic components are ubiquitous. The 6D pose estimation of such objects with only a monocular RGB camera is difficult because of the absence of rich texture features. Furthermore, the appearance of specularity heavily depends on the camera viewpoint and environmental light conditions making traditional methods, like template matching, fail. In the last 30 years, pose estimation of the specular object has been a consistent challenge, and most related works require massive knowledge modeling effort for light setups, environment, or the object surface. On the other hand, recent works exhibit the feasibility of 6D pose estimation on a monocular camera with convolutional neural networks(CNNs) however they mostly use opaque objects for evaluation. This paper provides a data-driven solution to estimate the 6D pose of specular objects for gras** them, proposes a cost function for handling symmetry, and demonstrates experimental results showing the system's feasibility. △ Less

Submitted 31 October, 2020; originally announced November 2020.

Comments: submitted to ICRA 2021

arXiv:2010.11671 [pdf, other]

Motion Planning Combines Psychological Safety and Motion Prediction for a Sense Motive Robot

Authors: He**g Ling, Guoliang Liu, Guohui Tian

Abstract: Human safety is the most important demand for human robot interaction and collaboration (HRIC), which not only refers to physical safety, but also includes psychological safety. Although many robots with different configurations have entered our living and working environments, the human safety problem is still an ongoing research problem in human-robot coexistence scenarios. This paper addresses… ▽ More Human safety is the most important demand for human robot interaction and collaboration (HRIC), which not only refers to physical safety, but also includes psychological safety. Although many robots with different configurations have entered our living and working environments, the human safety problem is still an ongoing research problem in human-robot coexistence scenarios. This paper addresses the human safety issue by covering both the physical safety and psychological safety aspects. First, we introduce an adaptive robot velocity control and step size adjustment method according to human facial expressions, such that the robot can adjust its movement to keep safety when the human emotion is unusual. Second, we predict the human motion by detecting the suddenly changes of human head pose and gaze direction, such that the robot can infer whether the human attention is distracted, predict the next move of human and rebuild a repulsive force to avoid potential collision. Finally, we demonstrate our idea using a 7 DOF TIAGo robot in a dynamic HRIC environment, which shows that the robot becomes sense motive, and responds to human action and emotion changes quickly and efficiently. △ Less

Submitted 23 October, 2020; v1 submitted 29 September, 2020; originally announced October 2020.

Comments: submitted to RAL/ICRA2021

arXiv:2010.09125 [pdf, other]

Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering

Authors: Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan Zhang, Antonio Torralba, Sanja Fidler

Abstract: Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire… ▽ More Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers. △ Less

Submitted 20 April, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

Comments: Accepted to ICLR 2021 as an Oral paper

arXiv:2010.03740 [pdf, other]

Bone Feature Segmentation in Ultrasound Spine Image with Robustness to Speckle and Regular Occlusion Noise

Authors: Zixun Huang, Li-Wen Wang, Frank H. F. Leung, Sunetra Banerjee, De Yang, Timothy Lee, Juan Lyu, Sai Ho Ling, Yong-** Zheng

Abstract: 3D ultrasound imaging shows great promise for scoliosis diagnosis thanks to its low-costing, radiation-free and real-time characteristics. The key to accessing scoliosis by ultrasound imaging is to accurately segment the bone area and measure the scoliosis degree based on the symmetry of the bone features. The ultrasound images tend to contain many speckles and regular occlusion noise which is dif… ▽ More 3D ultrasound imaging shows great promise for scoliosis diagnosis thanks to its low-costing, radiation-free and real-time characteristics. The key to accessing scoliosis by ultrasound imaging is to accurately segment the bone area and measure the scoliosis degree based on the symmetry of the bone features. The ultrasound images tend to contain many speckles and regular occlusion noise which is difficult, tedious and time-consuming for experts to find out the bony feature. In this paper, we propose a robust bone feature segmentation method based on the U-net structure for ultrasound spine Volume Projection Imaging (VPI) images. The proposed segmentation method introduces a total variance loss to reduce the sensitivity of the model to small-scale and regular occlusion noise. The proposed approach improves 2.3% of Dice score and 1% of AUC score as compared with the u-net model and shows high robustness to speckle and regular occlusion noise. △ Less

Submitted 7 October, 2020; originally announced October 2020.

Comments: SMC2020

arXiv:2009.08105 [pdf]

doi 10.1021/acsphotonics.0c01964

All van der Waals Integrated Nanophotonics

Authors: Haonan Ling, Renjie Li, Artur R. Davoyan

Abstract: Integrated optics is at the heart of a wide range of systems from remote sensing and communications to computing and quantum information processing. Demand for smaller and more energy efficient structures stimulates search for more advanced material platforms. Here, we propose a concept of an all van der Waals photonics, where we show that electronically bulk transition metal dichalcogenide (TMDC)… ▽ More Integrated optics is at the heart of a wide range of systems from remote sensing and communications to computing and quantum information processing. Demand for smaller and more energy efficient structures stimulates search for more advanced material platforms. Here, we propose a concept of an all van der Waals photonics, where we show that electronically bulk transition metal dichalcogenide (TMDC) semiconductors are well fitted for the design of key optical components for nanoscale and integrated photonics. Specifically, we demonstrate theoretically that owing to low optical loss and high refractive index across near-infrared and telecom frequency bands, components made of bulk TMDCs can potentially outperform counterparts made of conventional 3D semiconductors, such as Si and III/Vs. We discuss several key quantum and classical optical components and show that bulk TMDCs may pave the way to smaller footprint devices, more energy efficient electro-optical modulators, and stronger quantum light-materials interaction. Enhanced optical performance, ease of integration, and a wide selection of materials suggest that bulk TMDCs may complement and, potentially, replace existing integrated photonics systems. △ Less

Submitted 17 September, 2020; originally announced September 2020.

Comments: 12 pages, 5 figures

arXiv:2009.03465 [pdf, other]

LaSOT: A High-quality Large-scale Single Object Tracking Benchmark

Authors: Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, Haibin Ling

Abstract: Despite great recent advances in visual tracking, its further development, including both algorithm design and evaluation, is limited due to lack of dedicated large-scale benchmarks. To address this problem, we present LaSOT, a high-quality Large-scale Single Object Tracking benchmark. LaSOT contains a diverse selection of 85 object classes, and offers 1,550 totaling more than 3.87 million frames.… ▽ More Despite great recent advances in visual tracking, its further development, including both algorithm design and evaluation, is limited due to lack of dedicated large-scale benchmarks. To address this problem, we present LaSOT, a high-quality Large-scale Single Object Tracking benchmark. LaSOT contains a diverse selection of 85 object classes, and offers 1,550 totaling more than 3.87 million frames. Each video frame is carefully and manually annotated with a bounding box. This makes LaSOT, to our knowledge, the largest densely annotated tracking benchmark. Our goal in releasing LaSOT is to provide a dedicated high quality platform for both training and evaluation of trackers. The average video length of LaSOT is around 2,500 frames, where each video contains various challenge factors that exist in real world video footage,such as the targets disappearing and re-appearing. These longer video lengths allow for the assessment of long-term trackers. To take advantage of the close connection between visual appearance and natural language, we provide language specification for each video in LaSOT. We believe such additions will allow for future research to use linguistic features to improve tracking. Two protocols, full-overlap and one-shot, are designated for flexible assessment of trackers. We extensively evaluate 48 baseline trackers on LaSOT with in-depth analysis, and results reveal that there still exists significant room for improvement. The complete benchmark, tracking results as well as analysis are available at http://vision.cs.stonybrook.edu/~lasot/. △ Less

Submitted 11 September, 2020; v1 submitted 7 September, 2020; originally announced September 2020.

Comments: Tech Report. Update project website

arXiv:2008.09721 [pdf, other]

ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Authors: Bowen Chen, Huan Ling, Xiaohui Zeng, Gao Jun, Ziyue Xu, Sanja Fidler

Abstract: Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box t… ▽ More Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in the box placements, thus typically only a few clicks are needed to annotate tracked boxes to a sufficient accuracy. Segmentation masks are corrected via scribbles which are efficiently propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks per box track, and 4 frames of scribble annotation. △ Less

Submitted 21 August, 2020; originally announced August 2020.

arXiv:2008.03673 [pdf, other]

Feature Space Augmentation for Long-Tailed Data

Authors: Peng Chu, Xiao Bian, Shaopeng Liu, Haibin Ling

Abstract: Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advan… ▽ More Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advanced methods on data re-sampling and augmentation are among the best practices to alleviate the data imbalance problem. However, the other part of the problem about the under-represented classes will have to rely on additional knowledge to recover the missing information. In this work, we present a novel approach to address the long-tailed problem by augmenting the under-represented classes in the feature space with the features learned from the classes with ample samples. In particular, we decompose the features of each class into a class-generic component and a class-specific component using class activation maps. Novel samples of under-represented classes are then generated on the fly during training stages by fusing the class-specific features from the under-represented classes with the class-generic features from confusing classes. Our results on different datasets such as iNaturalist, ImageNet-LT, Places-LT and a long-tailed version of CIFAR have shown the state of the art performances. △ Less

Submitted 9 August, 2020; originally announced August 2020.

Comments: To be appeared in ECCV 2020

arXiv:2008.00965 [pdf, other]

doi 10.1109/TPAMI.2021.3050124

End-to-end Full Projector Compensation

Authors: Bingyao Huang, Tao Sun, Haibin Ling

Abstract: Full projector compensation aims to modify a projector input image to compensate for both geometric and photometric disturbance of the projection surface. Traditional methods usually solve the two parts separately and may suffer from suboptimal solutions. In this paper, we propose the first end-to-end differentiable solution, named CompenNeSt++, to solve the two problems jointly. First, we propose… ▽ More Full projector compensation aims to modify a projector input image to compensate for both geometric and photometric disturbance of the projection surface. Traditional methods usually solve the two parts separately and may suffer from suboptimal solutions. In this paper, we propose the first end-to-end differentiable solution, named CompenNeSt++, to solve the two problems jointly. First, we propose a novel geometric correction subnet, named War**Net, which is designed with a cascaded coarse-to-fine structure to learn the sampling grid directly from sampling images. Second, we propose a novel photometric compensation subnet, named CompenNeSt, which is designed with a siamese architecture to capture the photometric interactions between the projection surface and the projected images, and to use such information to compensate the geometrically corrected images. By concatenating War**Net with CompenNeSt, CompenNeSt++ accomplishes full projector compensation and is end-to-end trainable. Third, to improve practicability, we propose a novel synthetic data-based pre-training strategy to significantly reduce the number of training images and training time. Moreover, we construct the first setup-independent full compensation benchmark to facilitate future studies. In thorough experiments, our method shows clear advantages over prior art with promising compensation quality and meanwhile being practically convenient. △ Less

Submitted 7 January, 2021; v1 submitted 30 July, 2020; originally announced August 2020.

Comments: Source code: https://github.com/BingyaoHuang/CompenNeSt-plusplus. arXiv admin note: text overlap with arXiv:1908.06246, arXiv:1904.04335

Showing 101–150 of 368 results for author: Ling, H