-
AutoSampling: Search for Effective Data Sampling Schedules
Authors:
Ming Sun,
Haoxuan Dou,
Baopu Li,
Lei Cui,
Junjie Yan,
Wanli Ouyang
Abstract:
Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to the inherently high dimension of parameters in learning the sampling schedule. In this paper, we propose an AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local…
▽ More
Data sampling acts as a pivotal role in training deep learning models. However, an effective sampling schedule is difficult to learn due to the inherently high dimension of parameters in learning the sampling schedule. In this paper, we propose an AutoSampling method to automatically learn sampling schedules for model training, which consists of the multi-exploitation step aiming for optimal local sampling schedules and the exploration step for the ideal sampling distribution. More specifically, we achieve sampling schedule search with shortened exploitation cycle to provide enough supervision. In addition, we periodically estimate the sampling distribution from the learned sampling schedules and perturb it to search in the distribution space. The combination of two searches allows us to learn a robust sampling schedule. We apply our AutoSampling method to a variety of image classification tasks illustrating the effectiveness of the proposed method.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search
Authors:
Lumin Xu,
Yingda Guan,
Sheng **,
Wentao Liu,
Chen Qian,
** Luo,
Wanli Ouyang,
Xiaogang Wang
Abstract:
Human pose estimation has achieved significant progress in recent years. However, most of the recent methods focus on improving accuracy using complicated models and ignoring real-time efficiency. To achieve a better trade-off between accuracy and efficiency, we propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast o…
▽ More
Human pose estimation has achieved significant progress in recent years. However, most of the recent methods focus on improving accuracy using complicated models and ignoring real-time efficiency. To achieve a better trade-off between accuracy and efficiency, we propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation. In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions. In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames. To the best of our knowledge, we are the first to search for the temporal feature fusion and automatic computation allocation in videos. Extensive experiments demonstrate the effectiveness of our approach on the challenging COCO2017 and PoseTrack2018 datasets. Our discovered model family, S-ViPNAS and T-ViPNAS, achieve significantly higher inference speed (CPU real-time) without sacrificing the accuracy compared to the previous state-of-the-art methods.
△ Less
Submitted 21 May, 2021;
originally announced May 2021.
-
Learning Graph Meta Embeddings for Cold-Start Ads in Click-Through Rate Prediction
Authors:
Wentao Ouyang,
Xiuwu Zhang,
Shukui Ren,
Li Li,
Kun Zhang,
**mei Luo,
Zhaojie Liu,
Yanlong Du
Abstract:
Click-through rate (CTR) prediction is one of the most central tasks in online advertising systems. Recent deep learning-based models that exploit feature embedding and high-order data nonlinearity have shown dramatic successes in CTR prediction. However, these models work poorly on cold-start ads with new IDs, whose embeddings are not well learned yet. In this paper, we propose Graph Meta Embeddi…
▽ More
Click-through rate (CTR) prediction is one of the most central tasks in online advertising systems. Recent deep learning-based models that exploit feature embedding and high-order data nonlinearity have shown dramatic successes in CTR prediction. However, these models work poorly on cold-start ads with new IDs, whose embeddings are not well learned yet. In this paper, we propose Graph Meta Embedding (GME) models that can rapidly learn how to generate desirable initial embeddings for new ad IDs based on graph neural networks and meta learning. Previous works address this problem from the new ad itself, but ignore possibly useful information contained in existing old ads. In contrast, GMEs simultaneously consider two information sources: the new ad and existing old ads. For the new ad, GMEs exploit its associated attributes. For existing old ads, GMEs first build a graph to connect them with new ads, and then adaptively distill useful information. We propose three specific GMEs from different perspectives to explore what kind of information to use and how to distill information. In particular, GME-P uses Pre-trained neighbor ID embeddings, GME-G uses Generated neighbor ID embeddings and GME-A uses neighbor Attributes. Experimental results on three real-world datasets show that GMEs can significantly improve the prediction performance in both cold-start (i.e., no training data is available) and warm-up (i.e., a small number of training samples are collected) scenarios over five major deep learning-based CTR prediction models. GMEs can be applied to conversion rate (CVR) prediction as well.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
Layerwise Optimization by Gradient Decomposition for Continual Learning
Authors:
Shixiang Tang,
Dapeng Chen,
**guo Zhu,
Shijie Yu,
Wanli Ouyang
Abstract:
Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains. However, when learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting". To achieve the consistencies between the old tasks and the new task, one effective solution is to modify the gradient for update. Previous methods enforc…
▽ More
Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains. However, when learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting". To achieve the consistencies between the old tasks and the new task, one effective solution is to modify the gradient for update. Previous methods enforce independent gradient constraints for different tasks, while we consider these gradients contain complex information, and propose to leverage inter-task information by gradient decomposition. In particular, the gradient of an old task is decomposed into a part shared by all old tasks and a part specific to that task. The gradient for update should be close to the gradient of the new task, consistent with the gradients shared by all old tasks, and orthogonal to the space spanned by the gradients specific to the old tasks. In this way, our approach encourages common knowledge consolidation without impairing the task-specific knowledge. Furthermore, the optimization is performed for the gradients of each layer separately rather than the concatenation of all gradients as in previous works. This effectively avoids the influence of the magnitude variation of the gradients in different layers. Extensive experiments validate the effectiveness of both gradient-decomposed optimization and layer-wise updates. Our proposed method achieves state-of-the-art results on various benchmarks of continual learning.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
GMLP: Building Scalable and Flexible Graph Neural Networks with Feature-Message Passing
Authors:
Wentao Zhang,
Yu Shen,
Zheyu Lin,
Yang Li,
Xiaosen Li,
Wen Ouyang,
Yangyu Tao,
Zhi Yang,
Bin Cui
Abstract:
In recent studies, neural message passing has proved to be an effective way to design graph neural networks (GNNs), which have achieved state-of-the-art performance in many graph-based tasks. However, current neural-message passing architectures typically need to perform an expensive recursive neighborhood expansion in multiple rounds and consequently suffer from a scalability issue. Moreover, mos…
▽ More
In recent studies, neural message passing has proved to be an effective way to design graph neural networks (GNNs), which have achieved state-of-the-art performance in many graph-based tasks. However, current neural-message passing architectures typically need to perform an expensive recursive neighborhood expansion in multiple rounds and consequently suffer from a scalability issue. Moreover, most existing neural-message passing schemes are inflexible since they are restricted to fixed-hop neighborhoods and insensitive to the actual demands of different nodes. We circumvent these limitations by a novel feature-message passing framework, called Graph Multi-layer Perceptron (GMLP), which separates the neural update from the message passing. With such separation, GMLP significantly improves the scalability and efficiency by performing the message passing procedure in a pre-compute manner, and is flexible and adaptive in leveraging node feature messages over various levels of localities. We further derive novel variants of scalable GNNs under this framework to achieve the best of both worlds in terms of performance and efficiency. We conduct extensive evaluations on 11 benchmark datasets, including large-scale datasets like ogbn-products and an industrial dataset, demonstrating that GMLP achieves not only the state-of-art performance, but also high training scalability and efficiency.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection
Authors:
Junke Wang,
Zuxuan Wu,
Wenhao Ouyang,
Xintong Han,
**g**g Chen,
Ser-Nam Lim,
Yu-Gang Jiang
Abstract:
The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images…
▽ More
The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swap** and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.
△ Less
Submitted 19 April, 2022; v1 submitted 20 April, 2021;
originally announced April 2021.
-
PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop
Authors:
Hongwen Zhang,
Yating Tian,
Xinchi Zhou,
Wanli Ouyang,
Yebin Liu,
Limin Wang,
Zhenan Sun
Abstract:
Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images. By directly map** raw pixels to model parameters, these methods can produce parametric models in a feed-forward manner via neural networks. However, minor deviation in parameters may lead to noticeable misalignment between the estimated meshes and image evidences. To address this…
▽ More
Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images. By directly map** raw pixels to model parameters, these methods can produce parametric models in a feed-forward manner via neural networks. However, minor deviation in parameters may lead to noticeable misalignment between the estimated meshes and image evidences. To address this issue, we propose a Pyramidal Mesh Alignment Feedback (PyMAF) loop to leverage a feature pyramid and rectify the predicted parameters explicitly based on the mesh-image alignment status in our deep regressor. In PyMAF, given the currently predicted parameters, mesh-aligned evidences will be extracted from finer-resolution features accordingly and fed back for parameter rectification. To reduce noise and enhance the reliability of these evidences, an auxiliary pixel-wise supervision is imposed on the feature encoder, which provides mesh-image correspondence guidance for our network to preserve the most related information in spatial features. The efficacy of our approach is validated on several benchmarks, including Human3.6M, 3DPW, LSP, and COCO, where experimental results show that our approach consistently improves the mesh-image alignment of the reconstruction. The project page with code and video results can be found at https://hongwenzhang.github.io/pymaf.
△ Less
Submitted 23 August, 2021; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Delving into Localization Errors for Monocular 3D Object Detection
Authors:
Xinzhu Ma,
Yinmin Zhang,
Dan Xu,
Dongzhan Zhou,
Shuai Yi,
Haojie Li,
Wanli Ouyang
Abstract:
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving, while accurate 3D object detection from this kind of data is very challenging. In this work, by intensive diagnosis experiments, we quantify the impact introduced by each sub-task and found the `localization error' is the vital factor in restricting monocular 3D detection. Besides, we also investiga…
▽ More
Estimating 3D bounding boxes from monocular images is an essential component in autonomous driving, while accurate 3D object detection from this kind of data is very challenging. In this work, by intensive diagnosis experiments, we quantify the impact introduced by each sub-task and found the `localization error' is the vital factor in restricting monocular 3D detection. Besides, we also investigate the underlying reasons behind localization errors, analyze the issues they might bring, and propose three strategies. First, we revisit the misalignment between the center of the 2D bounding box and the projected center of the 3D object, which is a vital factor leading to low localization accuracy. Second, we observe that accurately localizing distant objects with existing technologies is almost impossible, while those samples will mislead the learned network. To this end, we propose to remove such samples from the training set for improving the overall performance of the detector. Lastly, we also propose a novel 3D IoU oriented loss for the size estimation of the object, which is not affected by `localization error'. We conduct extensive experiments on the KITTI dataset, where the proposed method achieves real-time detection and outperforms previous methods by a large margin. The code will be made available at: https://github.com/xinzhuma/monodle.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Gradient Regularized Contrastive Learning for Continual Domain Adaptation
Authors:
Shixiang Tang,
Peng Su,
Dapeng Chen,
Wanli Ouyang
Abstract:
Human beings can quickly adapt to environmental changes by leveraging learning experience. However, adapting deep neural networks to dynamic environments by machine learning algorithms remains a challenge. To better understand this issue, we study the problem of continual domain adaptation, where the model is presented with a labelled source domain and a sequence of unlabelled target domains. The…
▽ More
Human beings can quickly adapt to environmental changes by leveraging learning experience. However, adapting deep neural networks to dynamic environments by machine learning algorithms remains a challenge. To better understand this issue, we study the problem of continual domain adaptation, where the model is presented with a labelled source domain and a sequence of unlabelled target domains. The obstacles in this problem are both domain shift and catastrophic forgetting. We propose Gradient Regularized Contrastive Learning (GRCL) to solve the obstacles. At the core of our method, gradient regularization plays two key roles: (1) enforcing the gradient not to harm the discriminative ability of source features which can, in turn, benefit the adaptation ability of the model to target domains; (2) constraining the gradient not to increase the classification loss on old target domains, which enables the model to preserve the performance on old target domains when adapting to an in-coming target domain. Experiments on Digits, DomainNet and Office-Caltech benchmarks demonstrate the strong performance of our approach when compared to the other state-of-the-art methods.
△ Less
Submitted 23 March, 2021;
originally announced March 2021.
-
Real-Time Visual Object Tracking via Few-Shot Learning
Authors:
**ghao Zhou,
Bo Li,
Peng Wang,
Peixia Li,
Weihao Gan,
Wei Wu,
Junjie Yan,
Wanli Ouyang
Abstract:
Visual Object Tracking (VOT) can be seen as an extended task of Few-Shot Learning (FSL). While the concept of FSL is not new in tracking and has been previously applied by prior works, most of them are tailored to fit specific types of FSL algorithms and may sacrifice running speed. In this work, we propose a generalized two-stage framework that is capable of employing a large variety of FSL algor…
▽ More
Visual Object Tracking (VOT) can be seen as an extended task of Few-Shot Learning (FSL). While the concept of FSL is not new in tracking and has been previously applied by prior works, most of them are tailored to fit specific types of FSL algorithms and may sacrifice running speed. In this work, we propose a generalized two-stage framework that is capable of employing a large variety of FSL algorithms while presenting faster adaptation speed. The first stage uses a Siamese Regional Proposal Network to efficiently propose the potential candidates and the second stage reformulates the task of classifying these candidates to a few-shot classification problem. Following such a coarse-to-fine pipeline, the first stage proposes informative sparse samples for the second stage, where a large variety of FSL algorithms can be conducted more conveniently and efficiently. As substantiation of the second stage, we systematically investigate several forms of optimization-based few-shot learners from previous works with different objective functions, optimization methods, or solution space. Beyond that, our framework also entails a direct application of the majority of other FSL algorithms to visual tracking, enabling mutual communication between researchers on these two topics. Extensive experiments on the major benchmarks, VOT2018, OTB2015, NFS, UAV123, TrackingNet, and GOT-10k are conducted, demonstrating a desirable performance gain and a real-time speed.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
Higher Performance Visual Tracking with Dual-Modal Localization
Authors:
**ghao Zhou,
Bo Li,
Lei Qiao,
Peng Wang,
Weihao Gan,
Wei Wu,
Junjie Yan,
Wanli Ouyang
Abstract:
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 f…
▽ More
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy. While most existing works fail to operate simultaneously on both, we investigate in this work the problem of conflicting performance between accuracy and robustness. We first conduct a systematic comparison among existing methods and analyze their restrictions in terms of accuracy and robustness. Specifically, 4 formulations-offline classification (OFC), offline regression (OFR), online classification (ONC), and online regression (ONR)-are considered, categorized by the existence of online update and the types of supervision signal. To account for the problem, we resort to the idea of ensemble and propose a dual-modal framework for target localization, consisting of robust localization suppressing distractors via ONR and the accurate localization attending to the target center precisely via OFC. To yield a final representation (i.e, bounding box), we propose a simple but effective score voting strategy to involve adjacent predictions such that the final representation does not commit to a single location. Operating beyond the real-time demand, our proposed method is further validated on 8 datasets-VOT2018, VOT2019, OTB2015, NFS, UAV123, LaSOT, TrackingNet, and GOT-10k, achieving state-of-the-art performance.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
A Trident Quaternion Framework for Inertial-based Navigation Part II: Error Models and Application to Initial Alignment
Authors:
Wei Ouyang,
Yuanxin Wu
Abstract:
This work deals with error models for trident quaternion framework proposed in the companion paper (Part I) and further uses them to investigate the odometer-aided static/in-motion inertial navigation attitude alignment for land vehicles. By linearizing the trident quaternion kinematic equation, the left and right trident quaternion error models are obtained, which are found to be equivalent to th…
▽ More
This work deals with error models for trident quaternion framework proposed in the companion paper (Part I) and further uses them to investigate the odometer-aided static/in-motion inertial navigation attitude alignment for land vehicles. By linearizing the trident quaternion kinematic equation, the left and right trident quaternion error models are obtained, which are found to be equivalent to those derived from profound group affine. The two error models are used to design their corresponding extended Kalman filters (EKF), namely, the left-quaternion EKF (LQEKF) and the right-quaternion EKF (RQEKF). Simulations and field tests are conducted to evaluate their actual performances. Owing to the high estimation consistency, the L/RQEKF converge much faster in the static alignment than the traditional error model-based EKF, even under arbitrary large heading initialization. For the in-motion alignment, the L/RQEKF possess much larger convergence region than the traditional EKF does, although they still require the aid of attitude initialization so as to avoid large initial attitude errors.
△ Less
Submitted 16 May, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
A Trident Quaternion Framework for Inertial-based Navigation Part I: Rigid Motion Representation and Computation
Authors:
Wei Ouyang,
Yuanxin Wu
Abstract:
Strapdown inertial navigation research involves the parameterization and computation of the attitude, velocity and position of a rigid body in a chosen reference frame. The community has long devoted to finding the most concise and efficient representation for the strapdown inertial navigation system (INS). The current work is motivated by simplifying the existing dual quaternion representation of…
▽ More
Strapdown inertial navigation research involves the parameterization and computation of the attitude, velocity and position of a rigid body in a chosen reference frame. The community has long devoted to finding the most concise and efficient representation for the strapdown inertial navigation system (INS). The current work is motivated by simplifying the existing dual quaternion representation of the kinematic model. This paper proposes a compact and elegant representation of the body's attitude, velocity and position, with the aid of a devised trident quaternion tool in which the position is accounted for by adding a second imaginary part to the dual quaternion. Eventually, the kinematics of strapdown INS are cohesively unified in one concise differential equation, which bears the same form as the classical attitude quaternion equation. In addition, the computation of this trident quaternion-based kinematic equation is implemented with the recently proposed functional iterative integration approach. Numerical results verify the analysis and show that incorporating the new representation into the functional iterative integration scheme achieves high inertial navigation computation accuracy as well.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
Continental-scale streamflow modeling of basins with reservoirs: towards a coherent deep-learning-based strategy
Authors:
Wenyu Ouyang,
Kathryn Lawson,
Dapeng Feng,
Lei Ye,
Chi Zhang,
Chaopeng Shen
Abstract:
A large fraction of major waterways have dams influencing streamflow, which must be accounted for in large-scale hydrologic modeling. However, daily streamflow prediction for basins with dams is challenging for various modeling approaches, especially at large scales. Here we examined which types of dammed basins could be well represented by long short-term memory (LSTM) models using readily-availa…
▽ More
A large fraction of major waterways have dams influencing streamflow, which must be accounted for in large-scale hydrologic modeling. However, daily streamflow prediction for basins with dams is challenging for various modeling approaches, especially at large scales. Here we examined which types of dammed basins could be well represented by long short-term memory (LSTM) models using readily-available information, and delineated the remaining challenges. We analyzed data from 3557 basins (83% dammed) over the contiguous United States and noted strong impacts of reservoir purposes, degree of regulation (dor), and diversion on streamflow modeling. While a model trained on a widely-used reference-basin dataset performed poorly for non-reference basins, the model trained on the whole dataset presented a median Nash-Sutcliffe efficiency coefficient (NSE) of 0.74. The zero-dor, small-dor (with storage of approximately a month of average streamflow or less), and large-dor basins were found to have distinct behaviors, so migrating models between categories yielded catastrophic results, which means we must not treat small-dor basins as reference ones. However, training with pooled data from different sets yielded optimal median NSEs of 0.72, 0.79, and 0.64 for these respective groups, noticeably stronger than existing models. These results support a coherent modeling strategy where smaller dams (storing about a month of average streamflow or less) are modeled implicitly as part of basin rainfall-runoff processes; then, large-dor reservoirs of certain types can be represented explicitly. However, dammed basins must be present in the training dataset. Future work should examine separate modeling of large reservoirs for fire protection and irrigation, hydroelectric power generation, and flood control.
△ Less
Submitted 12 May, 2021; v1 submitted 12 January, 2021;
originally announced January 2021.
-
Probabilistic Graph Attention Network with Conditional Kernels for Pixel-Wise Prediction
Authors:
Dan Xu,
Xavier Alameda-Pineda,
Wanli Ouyang,
Elisa Ricci,
Xiaogang Wang,
Nicu Sebe
Abstract:
Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-sc…
▽ More
Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner. In order to further improve the learning capacity of the network structure, we propose to exploit feature dependant conditional kernels within the deep probabilistic framework. Extensive experiments are conducted on four publicly available datasets (i.e. BSDS500, NYUD-V2, KITTI, and Pascal-Context) and on three challenging pixel-wise prediction problems involving both discrete and continuous labels (i.e. monocular depth estimation, object contour prediction, and semantic segmentation). Quantitative and qualitative results demonstrate the effectiveness of the proposed latent AG-CRF model and the overall probabilistic graph attention network with feature conditional kernels for structured feature learning and pixel-wise prediction.
△ Less
Submitted 13 March, 2022; v1 submitted 7 January, 2021;
originally announced January 2021.
-
Inception Convolution with Efficient Dilation Search
Authors:
Jie Liu,
Chuming Li,
Feng Liang,
Chen Lin,
Ming Sun,
Junjie Yan,
Wanli Ouyang,
Dong Xu
Abstract:
As a variant of standard convolution, a dilated convolution can control effective receptive fields and handle large scale variance of objects without introducing additional computational costs. To fully explore the potential of dilated convolution, we proposed a new type of dilated convolution (referred to as inception convolution), where the convolution operations have independent dilation patter…
▽ More
As a variant of standard convolution, a dilated convolution can control effective receptive fields and handle large scale variance of objects without introducing additional computational costs. To fully explore the potential of dilated convolution, we proposed a new type of dilated convolution (referred to as inception convolution), where the convolution operations have independent dilation patterns among different axes, channels and layers. To develop a practical method for learning complex inception convolution based on the data, a simple but effective search algorithm, referred to as efficient dilation optimization (EDO), is developed. Based on statistical optimization, the EDO method operates in a low-cost manner and is extremely fast when it is applied on large scale datasets. Empirical results validate that our method achieves consistent performance gains for image recognition, object detection, instance segmentation, human detection, and human pose estimation. For instance, by simply replacing the 3x3 standard convolution in the ResNet-50 backbone with inception convolution, we significantly improve the AP of Faster R-CNN from 36.4% to 39.2% on MS COCO.
△ Less
Submitted 10 May, 2021; v1 submitted 25 December, 2020;
originally announced December 2020.
-
DETR for Crowd Pedestrian Detection
Authors:
Matthieu Lin,
Chuming Li,
Xingyuan Bu,
Ming Sun,
Chen Lin,
Junjie Yan,
Wanli Ouyang,
Zhidong Deng
Abstract:
Pedestrian detection in crowd scenes poses a challenging problem due to the heuristic defined map** from anchors to pedestrians and the conflict between NMS and highly overlapped pedestrians. The recently proposed end-to-end detectors(ED), DETR and deformable DETR, replace hand designed components such as NMS and anchors using the transformer architecture, which gets rid of duplicate predictions…
▽ More
Pedestrian detection in crowd scenes poses a challenging problem due to the heuristic defined map** from anchors to pedestrians and the conflict between NMS and highly overlapped pedestrians. The recently proposed end-to-end detectors(ED), DETR and deformable DETR, replace hand designed components such as NMS and anchors using the transformer architecture, which gets rid of duplicate predictions by computing all pairwise interactions between queries. Inspired by these works, we explore their performance on crowd pedestrian detection. Surprisingly, compared to Faster-RCNN with FPN, the results are opposite to those obtained on COCO. Furthermore, the bipartite match of ED harms the training efficiency due to the large ground truth number in crowd scenes. In this work, we identify the underlying motives driving ED's poor performance and propose a new decoder to address them. Moreover, we design a mechanism to leverage the less occluded visible parts of pedestrian specifically for ED, and achieve further improvements. A faster bipartite match algorithm is also introduced to make ED training on crowd dataset more practical. The proposed detector PED(Pedestrian End-to-end Detector) outperforms both previous EDs and the baseline Faster-RCNN on CityPersons and CrowdHuman. It also achieves comparable performance with state-of-the-art pedestrian detection methods. Code will be released soon.
△ Less
Submitted 18 February, 2021; v1 submitted 12 December, 2020;
originally announced December 2020.
-
Full Matching on Low Resolution for Disparity Estimation
Authors:
Hong Zhang,
Shenglun Chen,
Zhihui Wang,
Haojie Li,
Wanli Ouyang
Abstract:
A Multistage Full Matching disparity estimation scheme (MFM) is proposed in this work. We demonstrate that decouple all similarity scores directly from the low-resolution 4D volume step by step instead of estimating low-resolution 3D cost volume through focusing on optimizing the low-resolution 4D volume iteratively leads to more accurate disparity. To this end, we first propose to decompose the f…
▽ More
A Multistage Full Matching disparity estimation scheme (MFM) is proposed in this work. We demonstrate that decouple all similarity scores directly from the low-resolution 4D volume step by step instead of estimating low-resolution 3D cost volume through focusing on optimizing the low-resolution 4D volume iteratively leads to more accurate disparity. To this end, we first propose to decompose the full matching task into multiple stages of the cost aggregation module. Specifically, we decompose the high-resolution predicted results into multiple groups, and every stage of the newly designed cost aggregation module learns only to estimate the results for a group of points. This alleviates the problem of feature internal competitive when learning similarity scores of all candidates from one low-resolution 4D volume output from one stage. Then, we propose the strategy of \emph{Stages Mutual Aid}, which takes advantage of the relationship of multiple stages to boost similarity scores estimation of each stage, to solve the unbalanced prediction of multiple stages caused by serial multistage framework. Experiment results demonstrate that the proposed method achieves more accurate disparity estimation results and outperforms state-of-the-art methods on Scene Flow, KITTI 2012 and KITTI 2015 datasets.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Direct Depth Learning Network for Stereo Matching
Authors:
Hong Zhang,
Haojie Li,
Shenglun Chen,
Tiantian Yan,
Zhihui Wang,
Guo Lu,
Wanli Ouyang
Abstract:
Being a crucial task of autonomous driving, Stereo matching has made great progress in recent years. Existing stereo matching methods estimate disparity instead of depth. They treat the disparity errors as the evaluation metric of the depth estimation errors, since the depth can be calculated from the disparity according to the triangulation principle. However, we find that the error of the depth…
▽ More
Being a crucial task of autonomous driving, Stereo matching has made great progress in recent years. Existing stereo matching methods estimate disparity instead of depth. They treat the disparity errors as the evaluation metric of the depth estimation errors, since the depth can be calculated from the disparity according to the triangulation principle. However, we find that the error of the depth depends not only on the error of the disparity but also on the depth range of the points. Therefore, even if the disparity error is low, the depth error is still large, especially for the distant points. In this paper, a novel Direct Depth Learning Network (DDL-Net) is designed for stereo matching. DDL-Net consists of two stages: the Coarse Depth Estimation stage and the Adaptive-Grained Depth Refinement stage, which are all supervised by depth instead of disparity. Specifically, Coarse Depth Estimation stage uniformly samples the matching candidates according to depth range to construct cost volume and output coarse depth. Adaptive-Grained Depth Refinement stage performs further matching near the coarse depth to correct the imprecise matching and wrong matching. To make the Adaptive-Grained Depth Refinement stage robust to the coarse depth and adaptive to the depth range of the points, the Granularity Uncertainty is introduced to Adaptive-Grained Depth Refinement stage. Granularity Uncertainty adjusts the matching range and selects the candidates' features according to coarse prediction confidence and depth range. We verify the performance of DDL-Net on SceneFlow dataset and DrivingStereo dataset by different depth metrics. Results show that DDL-Net achieves an average improvement of 25% on the SceneFlow dataset and $12\%$ on the DrivingStereo dataset comparing the classical methods. More importantly, we achieve state-of-the-art accuracy at a large distance.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving
Authors:
Zhenxun Yuan,
Xiao Song,
Lei Bai,
Wengang Zhou,
Zhe Wang,
Wanli Ouyang
Abstract:
The strong demand of autonomous driving in the industry has lead to strong interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data, ignoring the temporal information of the sequence of data. In this work, we propose a new transformer, called Temporal-Channel Transformer, to model the spatia…
▽ More
The strong demand of autonomous driving in the industry has lead to strong interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data, ignoring the temporal information of the sequence of data. In this work, we propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data. As a special design of this transformer, the information encoded in the encoder is different from that in the decoder, i.e. the encoder encodes temporal-channel information of multiple frames while the decoder decodes the spatial-channel information for the current frame in a voxel-wise manner. Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames by utilizing the correlation among features from different channels and frames. On the other hand, the spatial decoder of the transformer will decode the information for each location of the current frame. Before conducting the object detection with detection head, the gate mechanism is deployed for re-calibrating the features of current frame, which filters out the object irrelevant information by repetitively refine the representation of target frame along with the up-sampling process. Experimental results show that we achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
Evolving Search Space for Neural Architecture Search
Authors:
Yuanzheng Ci,
Chen Lin,
Ming Sun,
Boyu Chen,
Hongwen Zhang,
Wanli Ouyang
Abstract:
The automation of neural architecture design has been a coveted alternative to human experts. Recent works have small search space, which is easier to optimize but has a limited upper bound of the optimal solution. Extra human design is needed for those methods to propose a more suitable space with respect to the specific task and algorithm capacity. To further enhance the degree of automation for…
▽ More
The automation of neural architecture design has been a coveted alternative to human experts. Recent works have small search space, which is easier to optimize but has a limited upper bound of the optimal solution. Extra human design is needed for those methods to propose a more suitable space with respect to the specific task and algorithm capacity. To further enhance the degree of automation for neural architecture search, we present a Neural Search-space Evolution (NSE) scheme that iteratively amplifies the results from the previous effort by maintaining an optimized search space subset. This design minimizes the necessity of a well-designed search space. We further extend the flexibility of obtainable architectures by introducing a learnable multi-branch setting. By employing the proposed method, a consistent performance gain is achieved during a progressive search over upcoming search spaces. We achieve 77.3% top-1 retrain accuracy on ImageNet with 333M FLOPs, which yielded a state-of-the-art performance among previous auto-generated architectures that do not involve knowledge distillation or weight pruning. When the latency constraint is adopted, our result also performs better than the previous best-performing mobile models with a 77.9% Top-1 retrain accuracy.
△ Less
Submitted 18 August, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
Parity-Dependent Moiré Superlattices in Graphene/h-BN Heterostructures: A Route to Mechanomutable Metamaterials
Authors:
Wengen Ouyang,
Oded Hod,
Michael Urbakh
Abstract:
The superlattice of alternating graphene/h-BN few-layered heterostructures is found to exhibit strong dependence on the parity of the number of layers within the stack. Odd-parity systems show a unique flamingo-like pattern, whereas their even-parity counterparts exhibit regular hexagonal or rectangular superlattices. When the alternating stack consists of seven layers or more, the flamingo patter…
▽ More
The superlattice of alternating graphene/h-BN few-layered heterostructures is found to exhibit strong dependence on the parity of the number of layers within the stack. Odd-parity systems show a unique flamingo-like pattern, whereas their even-parity counterparts exhibit regular hexagonal or rectangular superlattices. When the alternating stack consists of seven layers or more, the flamingo pattern becomes favorable, regardless of parity. Notably, the out-of-plane corrugation of the system strongly depends on the shape of the superstructure resulting in significant parity dependence of its mechanical properties. The predicted phenomenon originates in an intricate competition between moiré patterns develo** at the interface of consecutive layers. This mechanism is of general nature and is expected to occur in other alternating stacks of closely matched rigid layered materials as demonstrated for homogeneous alternating junctions of twisted graphene and h-BN. Our findings thus allow for the rational design of mechanomutable metamaterials based on Van der Waals heterostructures.
△ Less
Submitted 14 November, 2020;
originally announced November 2020.
-
Transient Grating Spectroscopy of Photocarrier Dynamics in Semiconducting Polymer Thin Films
Authors:
Wenkai Ouyang,
Yu Li,
Brett Yurash,
Nora Schopp,
Alejandro Vega-Flick,
Viktor Brus,
Thuc-Quyen Nguyen,
Bolin Liao
Abstract:
While charge carrier dynamics and thermal management are both keys to the operational efficiency and stability for energy-related devices, experimental techniques that can simultaneously characterize both properties are still lacking. In this paper, we use laser-induced transient grating (TG) spectroscopy to characterize thin films of the archetypal organic semiconductor regioregular poly(3-hexylt…
▽ More
While charge carrier dynamics and thermal management are both keys to the operational efficiency and stability for energy-related devices, experimental techniques that can simultaneously characterize both properties are still lacking. In this paper, we use laser-induced transient grating (TG) spectroscopy to characterize thin films of the archetypal organic semiconductor regioregular poly(3-hexylthiophene) (P3HT) and its blends with the electron acceptor [6,6]-phenyl-C61-butyric acid methyl ester (PCBM) on glass substrates. While the thermal response is determined to be dominated by the substrates, we show that the recombination dynamics of photocarriers in the organic semiconductor thin films occur on a similar timescale and can be separated from the thermal response. Our measurements indicate that the photocarrier dynamics are determined by multiple recombination processes and our extracted recombination rates are in good agreement with previous reports using other techniques. We further apply TG spectroscopy to characterize another conjugated polymer and a molecular fluorescent material to demonstrate its general applicability. Our study indicates the potential of transient grating spectroscopy to simultaneously characterize thermal transport and photocarrier dynamics in organic optoelectronic devices.
△ Less
Submitted 29 October, 2020;
originally announced October 2020.
-
Adaptive Gradient Method with Resilience and Momentum
Authors:
Jie Liu,
Chen Lin,
Chuming Li,
Lu Sheng,
Ming Sun,
Junjie Yan,
Wanli Ouyang
Abstract:
Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control the parameter-wise learning rate (e.g., Adam and RMSProp). Although they show a large improvement in convergence speed, most adaptive learning rate methods suff…
▽ More
Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control the parameter-wise learning rate (e.g., Adam and RMSProp). Although they show a large improvement in convergence speed, most adaptive learning rate methods suffer from compromised generalization compared with SGD. In this paper, we proposed an Adaptive Gradient Method with Resilience and Momentum (AdaRem), motivated by the observation that the oscillations of network parameters slow the training, and give a theoretical proof of convergence. For each parameter, AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient, and thus encourages long-term consistent parameter updating with much fewer oscillations. Comprehensive experiments have been conducted to verify the effectiveness of AdaRem when training various models on a large-scale image recognition dataset, e.g., ImageNet, which also demonstrate that our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error, respectively.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search
Authors:
Mingzhu Shen,
Feng Liang,
Ruihao Gong,
Yuhang Li,
Chuming Li,
Chen Lin,
Fengwei Yu,
Junjie Yan,
Wanli Ouyang
Abstract:
Quantization Neural Networks (QNN) have attracted a lot of attention due to their high efficiency. To enhance the quantization accuracy, prior works mainly focus on designing advanced quantization algorithms but still fail to achieve satisfactory results under the extremely low-bit case. In this work, we take an architecture perspective to investigate the potential of high-performance QNN. Therefo…
▽ More
Quantization Neural Networks (QNN) have attracted a lot of attention due to their high efficiency. To enhance the quantization accuracy, prior works mainly focus on designing advanced quantization algorithms but still fail to achieve satisfactory results under the extremely low-bit case. In this work, we take an architecture perspective to investigate the potential of high-performance QNN. Therefore, we propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides. However, a naive combination inevitably faces unacceptable time consumption or unstable training problem. To alleviate these problems, we first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models. Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and meanwhile improves the quantization accuracy. Equipped with this overall framework, dubbed as Once Quantization-Aware Training~(OQAT), our searched model family, OQATNets, achieves a new state-of-the-art compared with various architectures under different bit-widths. In particular, OQAT-2bit-M achieves 61.6% ImageNet Top-1 accuracy, outperforming 2-bit counterpart MobileNetV3 by a large margin of 9% with 10% less computation cost. A series of quantization-friendly architectures are identified easily and extensive analysis can be made to summarize the interaction between quantization and neural architectures. Codes and models are released at https://github.com/LaVieEnRoseSMZ/OQA
△ Less
Submitted 28 September, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Improving Auto-Augment via Augmentation-Wise Weight Sharing
Authors:
Keyu Tian,
Chen Lin,
Ming Sun,
Lu** Zhou,
Junjie Yan,
Wanli Ouyang
Abstract:
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would…
▽ More
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve superior accuracies compared with existing auto-augmentation search methods. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% absolute error rate reduction over the baseline augmentation.
△ Less
Submitted 22 October, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.
-
SAMOT: Switcher-Aware Multi-Object Tracking and Still Another MOT Measure
Authors:
Weitao Feng,
Zhihao Hu,
Baopu Li,
Weihao Gan,
Wei Wu,
Wanli Ouyang
Abstract:
Multi-Object Tracking (MOT) is a popular topic in computer vision. However, identity issue, i.e., an object is wrongly associated with another object of a different identity, still remains to be a challenging problem. To address it, switchers, i.e., confusing targets thatmay cause identity issues, should be focused. Based on this motivation,this paper proposes a novel switcher-aware framework for…
▽ More
Multi-Object Tracking (MOT) is a popular topic in computer vision. However, identity issue, i.e., an object is wrongly associated with another object of a different identity, still remains to be a challenging problem. To address it, switchers, i.e., confusing targets thatmay cause identity issues, should be focused. Based on this motivation,this paper proposes a novel switcher-aware framework for multi-object tracking, which consists of Spatial Conflict Graph model (SCG) and Switcher-Aware Association (SAA). The SCG eliminates spatial switch-ers within one frame by building a conflict graph and working out the optimal subgraph. The SAA utilizes additional information from potential temporal switcher across frames, enabling more accurate data association. Besides, we propose a new MOT evaluation measure, Still Another IDF score (SAIDF), aiming to focus more on identity issues.This new measure may overcome some problems of the previous measures and provide a better insight for identity issues in MOT. Finally,the proposed framework is tested under both the traditional measures and the new measure we proposed. Extensive experiments show that ourmethod achieves competitive results on all measure.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
Improving Deep Video Compression by Resolution-adaptive Flow Coding
Authors:
Zhihao Hu,
Zhenghao Chen,
Dong Xu,
Guo Lu,
Wanli Ouyang,
Shuhang Gu
Abstract:
In the learning based video compression approaches, it is an essential issue to compress pixel-level optical flow maps by develo** new motion vector (MV) encoders. In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution represent…
▽ More
In the learning based video compression approaches, it is an essential issue to compress pixel-level optical flow maps by develo** new motion vector (MV) encoders. In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution representations for both the input flow maps and the output motion features of the MV encoder. To handle complex or simple motion patterns globally, our frame-level scheme RaFC-frame automatically decides the optimal flow map resolution for each video frame. To cope different types of motion patterns locally, our block-level scheme called RaFC-block can also select the optimal resolution for each local block of motion features. In addition, the rate-distortion criterion is applied to both RaFC-frame and RaFC-block and select the optimal motion coding mode for effective flow coding. Comprehensive experiments on four benchmark datasets HEVC, VTL, UVG and MCL-JCV clearly demonstrate the effectiveness of our overall RaFC framework after combing RaFC-frame and RaFC-block for video compression.
△ Less
Submitted 13 September, 2020;
originally announced September 2020.
-
Exploring the Hierarchy in Relation Labels for Scene Graph Generation
Authors:
Yi Zhou,
Shuyang Sun,
Chao Zhang,
Yikang Li,
Wanli Ouyang
Abstract:
By assigning each relationship a single label, current approaches formulate the relationship detection as a classification problem. Under this formulation, predicate categories are treated as completely different classes. However, different from the object labels where different classes have explicit boundaries, predicates usually have overlaps in their semantic meanings. For example, sit\_on and…
▽ More
By assigning each relationship a single label, current approaches formulate the relationship detection as a classification problem. Under this formulation, predicate categories are treated as completely different classes. However, different from the object labels where different classes have explicit boundaries, predicates usually have overlaps in their semantic meanings. For example, sit\_on and stand\_on have common meanings in vertical relationships but different details of how these two objects are vertically placed. In order to leverage the inherent structures of the predicate categories, we propose to first build the language hierarchy and then utilize the Hierarchy Guided Feature Learning (HGFL) strategy to learn better region features of both the coarse-grained level and the fine-grained level. Besides, we also propose the Hierarchy Guided Module (HGM) to utilize the coarse-grained level to guide the learning of fine-grained level features. Experiments show that the proposed simple yet effective method can improve several state-of-the-art baselines by a large margin (up to $33\%$ relative gain) in terms of Recall@50 on the task of Scene Graph Generation in different datasets.
△ Less
Submitted 12 September, 2020;
originally announced September 2020.
-
BriNet: Towards Bridging the Intra-class and Inter-class Gaps in One-Shot Segmentation
Authors:
Xianghui Yang,
Bairun Wang,
Kaige Chen,
Xinchi Zhou,
Shuai Yi,
Wanli Ouyang,
Lu** Zhou
Abstract:
Few-shot segmentation focuses on the generalization of models to segment unseen object instances with limited training samples. Although tremendous improvements have been achieved, existing methods are still constrained by two factors. (1) The information interaction between query and support images is not adequate, leaving intra-class gap. (2) The object categories at the training and inference s…
▽ More
Few-shot segmentation focuses on the generalization of models to segment unseen object instances with limited training samples. Although tremendous improvements have been achieved, existing methods are still constrained by two factors. (1) The information interaction between query and support images is not adequate, leaving intra-class gap. (2) The object categories at the training and inference stages have no overlap, leaving the inter-class gap. Thus, we propose a framework, BriNet, to bridge these gaps. First, more information interactions are encouraged between the extracted features of the query and support images, i.e., using an Information Exchange Module to emphasize the common objects. Furthermore, to precisely localize the query objects, we design a multi-path fine-grained strategy which is able to make better use of the support feature representations. Second, a new online refinement strategy is proposed to help the trained model adapt to unseen classes, achieved by switching the roles of the query and the support images at the inference stage. The effectiveness of our framework is demonstrated by experimental results, which outperforms other competitive methods and leads to a new state-of-the-art on both PASCAL VOC and MSCOCO dataset.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
Rethinking Pseudo-LiDAR Representation
Authors:
Xinzhu Ma,
Shinan Liu,
Zhiyi Xia,
Hongwen Zhang,
Xingyu Zeng,
Wanli Ouyang
Abstract:
The recently proposed pseudo-LiDAR based 3D detectors greatly improve the benchmark of monocular/stereo 3D detection task. However, the underlying mechanism remains obscure to the research community. In this paper, we perform an in-depth investigation and observe that the efficacy of pseudo-LiDAR representation comes from the coordinate transformation, instead of data representation itself. Based…
▽ More
The recently proposed pseudo-LiDAR based 3D detectors greatly improve the benchmark of monocular/stereo 3D detection task. However, the underlying mechanism remains obscure to the research community. In this paper, we perform an in-depth investigation and observe that the efficacy of pseudo-LiDAR representation comes from the coordinate transformation, instead of data representation itself. Based on this observation, we design an image based CNN detector named Patch-Net, which is more generalized and can be instantiated as pseudo-LiDAR based 3D detectors. Moreover, the pseudo-LiDAR data in our PatchNet is organized as the image representation, which means existing 2D CNN designs can be easily utilized for extracting deep features from input data and boosting 3D detection performance. We conduct extensive experiments on the challenging KITTI dataset, where the proposed PatchNet outperforms all existing pseudo-LiDAR based counterparts. Code has been made available at: https://github.com/xinzhuma/patchnet.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
MiNet: Mixed Interest Network for Cross-Domain Click-Through Rate Prediction
Authors:
Wentao Ouyang,
Xiuwu Zhang,
Lei Zhao,
**mei Luo,
Yu Zhang,
Heng Zou,
Zhaojie Liu,
Yanlong Du
Abstract:
Click-through rate (CTR) prediction is a critical task in online advertising systems. Existing works mainly address the single-domain CTR prediction problem and model aspects such as feature interaction, user behavior history and contextual information. Nevertheless, ads are usually displayed with natural content, which offers an opportunity for cross-domain CTR prediction. In this paper, we addre…
▽ More
Click-through rate (CTR) prediction is a critical task in online advertising systems. Existing works mainly address the single-domain CTR prediction problem and model aspects such as feature interaction, user behavior history and contextual information. Nevertheless, ads are usually displayed with natural content, which offers an opportunity for cross-domain CTR prediction. In this paper, we address this problem and leverage auxiliary data from a source domain to improve the CTR prediction performance of a target domain. Our study is based on UC Toutiao (a news feed service integrated with the UC Browser App, serving hundreds of millions of users daily), where the source domain is the news and the target domain is the ad. In order to effectively leverage news data for predicting CTRs of ads, we propose the Mixed Interest Network (MiNet) which jointly models three types of user interest: 1) long-term interest across domains, 2) short-term interest from the source domain and 3) short-term interest in the target domain. MiNet contains two levels of attentions, where the item-level attention can adaptively distill useful information from clicked news / ads and the interest-level attention can adaptively fuse different interest representations. Offline experiments show that MiNet outperforms several state-of-the-art methods for CTR prediction. We have deployed MiNet in UC Toutiao and the A/B test results show that the online CTR is also improved substantially. MiNet now serves the main ad traffic in UC Toutiao.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Sliding Over Graphene Grain Boundaries: A Step Towards Macroscale Superlubricity
Authors:
Xiang Gao,
Wengen Ouyang,
Oded Hod,
Michael Urbakh
Abstract:
In light of the race towards macroscale superlubricity of graphitic contacts, the effect of grain boundaries on their frictional properties becomes of central importance. Here, we elucidate the unique frictional mechanisms characterizing topological defects along typical grain boundaries that can vary from being nearly flat to highly corrugated, depending on the boundary misfit angle. We find that…
▽ More
In light of the race towards macroscale superlubricity of graphitic contacts, the effect of grain boundaries on their frictional properties becomes of central importance. Here, we elucidate the unique frictional mechanisms characterizing topological defects along typical grain boundaries that can vary from being nearly flat to highly corrugated, depending on the boundary misfit angle. We find that frictional energy dissipation over grain boundaries can originate from variations of compressibility along the surface, heat produced during defect (un)buckling events, and elastic energy storage in irreversible buckling processes. These may lead to atypical non-monotonic dependence of the averaged friction on the normal load. The knowledge gained in the present study constitutes an important step towards the realization of superlubricity in macroscopic graphitic contacts.
△ Less
Submitted 3 August, 2020;
originally announced August 2020.
-
Differentiable Hierarchical Graph Grou** for Multi-Person Pose Estimation
Authors:
Sheng **,
Wentao Liu,
Enze Xie,
Wenhai Wang,
Chen Qian,
Wanli Ouyang,
** Luo
Abstract:
Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously. Previous methods can be divided into two streams, i.e. top-down and bottom-up methods. The top-down methods localize keypoints after human detection, while the bottom-up methods localize keypoints directly and then cluster/group them for different persons, which are generally more e…
▽ More
Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously. Previous methods can be divided into two streams, i.e. top-down and bottom-up methods. The top-down methods localize keypoints after human detection, while the bottom-up methods localize keypoints directly and then cluster/group them for different persons, which are generally more efficient than top-down methods. However, in existing bottom-up methods, the keypoint grou** is usually solved independently from keypoint detection, making them not end-to-end trainable and have sub-optimal performance. In this paper, we investigate a new perspective of human part grou** and reformulate it as a graph clustering task. Especially, we propose a novel differentiable Hierarchical Graph Grou** (HGG) method to learn the graph grou** in bottom-up multi-person pose estimation task. Moreover, HGG is easily embedded into main-stream bottom-up methods. It takes human keypoint candidates as graph nodes and clusters keypoints in a multi-layer graph neural network model. The modules of HGG can be trained end-to-end with the keypoint detection network and is able to supervise the grou** process in a hierarchical manner. To improve the discrimination of the clustering, we add a set of edge discriminators and macro-node discriminators. Extensive experiments on both COCO and OCHuman datasets demonstrate that the proposed method improves the performance of bottom-up pose estimation methods.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Whole-Body Human Pose Estimation in the Wild
Authors:
Sheng **,
Lumin Xu,
** Xu,
Can Wang,
Wentao Liu,
Chen Qian,
Wanli Ouyang,
** Luo
Abstract:
This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases a…
▽ More
This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases and large model complexity. To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. To our best knowledge, it is the first benchmark that has manual annotations on the entire human body, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. A single-network model, named ZoomNet, is devised to take into account the hierarchical structure of the full human body to solve the scale variation of different body parts of the same person. ZoomNet is able to significantly outperform existing methods on the proposed COCO-WholeBody dataset. Extensive experiments show that COCO-WholeBody not only can be used to train deep models from scratch for whole-body pose estimation but also can serve as a powerful pre-training dataset for many different tasks such as facial landmark detection and hand keypoint estimation. The dataset is publicly available at https://github.com/**-s13/COCO-WholeBody.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Controllable Thermal Conductivity in Twisted Homogeneous Interfaces of Graphene and Hexagonal Boron Nitride
Authors:
Wengen Ouyang,
Huasong Qin,
Michael Urbakh,
Oded Hod
Abstract:
Thermal conductivity of homogeneous twisted stacks of graphite is found to strongly depend on the misfit angle. The underlying mechanism relies on the angle dependence of phonon-phonon couplings across the twisted interface. Excellent agreement between the calculated thermal conductivity of narrow graphitic stacks and corresponding experimental results indicates the validity of the predictions. Th…
▽ More
Thermal conductivity of homogeneous twisted stacks of graphite is found to strongly depend on the misfit angle. The underlying mechanism relies on the angle dependence of phonon-phonon couplings across the twisted interface. Excellent agreement between the calculated thermal conductivity of narrow graphitic stacks and corresponding experimental results indicates the validity of the predictions. This is attributed to the accuracy of interlayer interactions descriptions obtained by the dedicated registry-dependent interlayer potential used. Similar results for h-BN stacks indicate overall higher conductivity and reduced misfit angle variation. This opens the way for the design of tunable heterogeneous junctions with controllable heat-transport properties ranging from substrate-isolation to efficient heat evacuation.
△ Less
Submitted 21 July, 2020;
originally announced July 2020.
-
INS/Odometer Land Navigation by Accurate Measurement Modeling and Multiple-Model Adaptive Estimation
Authors:
Wei Ouyang,
Yuanxin Wu,
Hongyue Chen
Abstract:
Land vehicle navigation based on inertial navigation system (INS) and odometers is a classical autonomous navigation application and has been extensively studied over the past several decades. In this work, we seriously analyze the error characteristics of the odometer (OD) pulses and investigate three types of odometer measurement models in the INS/OD integrated system. Specifically, in the pulse…
▽ More
Land vehicle navigation based on inertial navigation system (INS) and odometers is a classical autonomous navigation application and has been extensively studied over the past several decades. In this work, we seriously analyze the error characteristics of the odometer (OD) pulses and investigate three types of odometer measurement models in the INS/OD integrated system. Specifically, in the pulse velocity model, a preliminary Kalman filter is designed to obtain accurate vehicle velocity from the accumulated pulses; the pulse increment model is accordingly obtained by integrating the pulse velocity; a new pulse accumulation model is proposed by augmenting the travelled distance into the system state. The three types of measurements, along with the nonhonolomic constraint (NHC), are implemented in the standard extended Kalman filter. In view of the motion-related pulse error characteristics, the multiple model adaptive estimation (MMAE) approach is exploited to further enhance the performance. Simulations and long-distance experiments are conducted to verify the feasibility and effectiveness of the proposed methods. It is shown that the standard pulse velocity measurement achieves the superior performance, whereas the accumulated pulse measurement is most favorable with the MMAE enhancement.
△ Less
Submitted 20 July, 2020;
originally announced July 2020.
-
Anderson Acceleration for Nonconvex ADMM Based on Douglas-Rachford Splitting
Authors:
Wenqing Ouyang,
Yue Peng,
Yuxin Yao,
Juyong Zhang,
Bailin Deng
Abstract:
The alternating direction multiplier method (ADMM) is widely used in computer graphics for solving optimization problems that can be nonsmooth and nonconvex. It converges quickly to an approximate solution, but can take a long time to converge to a solution of high-accuracy. Previously, Anderson acceleration has been applied to ADMM, by treating it as a fixed-point iteration for the concatenation…
▽ More
The alternating direction multiplier method (ADMM) is widely used in computer graphics for solving optimization problems that can be nonsmooth and nonconvex. It converges quickly to an approximate solution, but can take a long time to converge to a solution of high-accuracy. Previously, Anderson acceleration has been applied to ADMM, by treating it as a fixed-point iteration for the concatenation of the dual variables and a subset of the primal variables. In this paper, we note that the equivalence between ADMM and Douglas-Rachford splitting reveals that ADMM is in fact a fixed-point iteration in a lower-dimensional space. By applying Anderson acceleration to such lower-dimensional fixed-point iteration, we obtain a more effective approach for accelerating ADMM. We analyze the convergence of the proposed acceleration method on nonconvex problems, and verify its effectiveness on a variety of computer graphics problems including geometry processing and physical simulation.
△ Less
Submitted 26 June, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
3D Human Mesh Regression with Dense Correspondence
Authors:
Wang Zeng,
Wanli Ouyang,
** Luo,
Wentao Liu,
Xiaogang Wang
Abstract:
Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction. However, prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing, leading to suboptimal solu…
▽ More
Estimating 3D mesh of the human body from a single 2D image is an important task with many applications such as augmented reality and Human-Robot interaction. However, prior works reconstructed 3D mesh from global image feature extracted by using convolutional neural network (CNN), where the dense correspondences between the mesh surface and the image pixels are missing, leading to suboptimal solution. This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space (i.e. a 2D space used for texture map** of 3D mesh). DecoMR first predicts pixel-to-surface dense correspondence map (i.e., IUV image), with which we transfer local features from the image space to the UV space. Then the transferred local image features are processed in the UV space to regress a location map, which is well aligned with transferred features. Finally we reconstruct 3D human mesh from the regressed location map with a predefined map** function. We also observe that the existing discontinuous UV map are unfriendly to the learning of network. Therefore, we propose a novel UV map that maintains most of the neighboring relations on the original mesh surface. Experiments demonstrate that our proposed local feature alignment and continuous UV map outperforms existing 3D mesh based methods on multiple public benchmarks. Code will be made available at https://github.com/zengwang430521/DecoMR
△ Less
Submitted 6 June, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Nonmonotone Globalization for Anderson Acceleration via Adaptive Regularization
Authors:
Wenqing Ouyang,
Jiong Tao,
Andre Milzarek,
Bailin Deng
Abstract:
Anderson acceleration (AA) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for AA to improve stability and achieve unified global and local convergence. Unlike existing AA globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-r…
▽ More
Anderson acceleration (AA) is a popular method for accelerating fixed-point iterations, but may suffer from instability and stagnation. We propose a globalization method for AA to improve stability and achieve unified global and local convergence. Unlike existing AA globalization approaches that rely on safeguarding operations and might hinder fast local convergence, we adopt a nonmonotone trust-region framework and introduce an adaptive quadratic regularization together with a tailored acceptance mechanism. We prove global convergence and show that our algorithm attains the same local convergence as AA under appropriate assumptions. The effectiveness of our method is demonstrated in several numerical experiments.
△ Less
Submitted 2 May, 2023; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Scope Head for Accurate Localization in Object Detection
Authors:
Geng Zhan,
Dan Xu,
Guo Lu,
Wei Wu,
Chunhua Shen,
Wanli Ouyang
Abstract:
Existing anchor-based and anchor-free object detectors in multi-stage or one-stage pipelines have achieved very promising detection performance. However, they still encounter the design difficulty in hand-crafted 2D anchor definition and the learning complexity in 1D direct location regression. To tackle these issues, in this paper, we propose a novel detector coined as ScopeNet, which models anch…
▽ More
Existing anchor-based and anchor-free object detectors in multi-stage or one-stage pipelines have achieved very promising detection performance. However, they still encounter the design difficulty in hand-crafted 2D anchor definition and the learning complexity in 1D direct location regression. To tackle these issues, in this paper, we propose a novel detector coined as ScopeNet, which models anchors of each location as a mutually dependent relationship. This approach quantizes the prediction space and employs a coarse-to-fine strategy for localization. It achieves superior flexibility as in the regression based anchor-free methods, while produces more precise prediction. Besides, an inherit anchor selection score is learned to indicate the localization quality of the detection result, and we propose to better represent the confidence of a detection box by combining the category-classification score and the anchor-selection score. With our concise and effective design, the proposed ScopeNet achieves state-of-the-art results on COCO
△ Less
Submitted 11 May, 2020; v1 submitted 11 May, 2020;
originally announced May 2020.
-
Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection
Authors:
Dongzhan Zhou,
Xinchi Zhou,
Hongwen Zhang,
Shuai Yi,
Wanli Ouyang
Abstract:
In this paper, we propose a general and efficient pre-training paradigm, Montage pre-training, for object detection. Montage pre-training needs only the target detection dataset while taking only 1/4 computational resources compared to the widely adopted ImageNet pre-training.To build such an efficient paradigm, we reduce the potential redundancy by carefully extracting useful samples from the ori…
▽ More
In this paper, we propose a general and efficient pre-training paradigm, Montage pre-training, for object detection. Montage pre-training needs only the target detection dataset while taking only 1/4 computational resources compared to the widely adopted ImageNet pre-training.To build such an efficient paradigm, we reduce the potential redundancy by carefully extracting useful samples from the original images, assembling samples in a Montage manner as input, and using an ERF-adaptive dense classification strategy for model pre-training. These designs include not only a new input pattern to improve the spatial utilization but also a novel learning objective to expand the effective receptive field of the pretrained model. The efficiency and effectiveness of Montage pre-training are validated by extensive experiments on the MS-COCO dataset, where the results indicate that the models using Montage pre-training are able to achieve on-par or even better detection performances compared with the ImageNet pre-training.
△ Less
Submitted 31 August, 2020; v1 submitted 25 April, 2020;
originally announced April 2020.
-
Location-Aware Feature Selection Text Detection Network
Authors:
Zengyuan Guo,
Zilin Wang,
Zhihui Wang,
Wanli Ouyang,
Haojie Li,
Wen Gao
Abstract:
Regression-based text detection methods have already achieved promising performances with simple network structure and high efficiency. However, they are behind in accuracy comparing with recent segmentation-based text detectors. In this work, we discover that one important reason to this case is that regression-based methods usually utilize a fixed feature selection way, i.e. selecting features i…
▽ More
Regression-based text detection methods have already achieved promising performances with simple network structure and high efficiency. However, they are behind in accuracy comparing with recent segmentation-based text detectors. In this work, we discover that one important reason to this case is that regression-based methods usually utilize a fixed feature selection way, i.e. selecting features in a single location or in neighbor regions, to predict components of the bounding box, such as the distances to the boundaries or the rotation angle. The features selected through this way sometimes are not the best choices for predicting every component of a text bounding box and thus degrade the accuracy performance. To address this issue, we propose a novel Location-Aware feature Selection text detection Network (LASNet). LASNet selects suitable features from different locations to separately predict the five components of a bounding box and gets the final bounding box through the combination of these components. Specifically, instead of using the classification score map to select one feature for predicting the whole bounding box as most of the existing methods did, the proposed LASNet first learn five new confidence score maps to indicate the prediction accuracy of the bounding box components, respectively. Then, a Location-Aware Feature Selection mechanism (LAFS) is designed to weightily fuse the top-$K$ prediction results for each component according to their confidence score, and to combine the all five fused components into a final bounding box. As a result, LASNet predicts the more accurate bounding boxes by using a learnable feature selection way. The experimental results demonstrate that our LASNet achieves state-of-the-art performance with single-model and single-scale testing, outperforming all existing regression-based detectors.
△ Less
Submitted 25 May, 2020; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition
Authors:
Ziyu Liu,
Hongwen Zhang,
Zhenghao Chen,
Zhiyong Wang,
Wanli Ouyang
Abstract:
Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-ran…
▽ More
Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
△ Less
Submitted 19 May, 2020; v1 submitted 31 March, 2020;
originally announced March 2020.
-
Content Adaptive and Error Propagation Aware Deep Video Compression
Authors:
Guo Lu,
Chunlei Cai,
Xiaoyun Zhang,
Li Chen,
Wanli Ouyang,
Dong Xu,
Zhiyong Gao
Abstract:
Recently, learning based video compression methods attract increasing attention. However, the previous works suffer from error propagation due to the accumulation of reconstructed error in inter predictive coding. Meanwhile, the previous learning based video codecs are also not adaptive to different video contents. To address these two problems, we propose a content adaptive and error propagation…
▽ More
Recently, learning based video compression methods attract increasing attention. However, the previous works suffer from error propagation due to the accumulation of reconstructed error in inter predictive coding. Meanwhile, the previous learning based video codecs are also not adaptive to different video contents. To address these two problems, we propose a content adaptive and error propagation aware video compression system. Specifically, our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Based on the learned long-term temporal information, our approach effectively alleviates error propagation in reconstructed frames. More importantly, instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system. The proposed approach updates the parameters for encoder according to the rate-distortion criterion but keeps the decoder unchanged in the inference stage. Therefore, the encoder is adaptive to different video contents and achieves better compression performance by reducing the domain gap between the training and testing datasets. Our method is simple yet effective and outperforms the state-of-the-art learning based video codecs on benchmark datasets without increasing the model size or decreasing the decoding speed.
△ Less
Submitted 25 March, 2020;
originally announced March 2020.
-
Channel Pruning Guided by Classification Loss and Feature Importance
Authors:
**yang Guo,
Wanli Ouyang,
Dong Xu
Abstract:
In this work, we propose a new layer-by-layer channel pruning method called Channel Pruning guided by classification Loss and feature Importance (CPLI). In contrast to the existing layer-by-layer channel pruning approaches that only consider how to reconstruct the features from the next layer, our approach additionally take the classification loss into account in the channel pruning process. We al…
▽ More
In this work, we propose a new layer-by-layer channel pruning method called Channel Pruning guided by classification Loss and feature Importance (CPLI). In contrast to the existing layer-by-layer channel pruning approaches that only consider how to reconstruct the features from the next layer, our approach additionally take the classification loss into account in the channel pruning process. We also observe that some reconstructed features will be removed at the next pruning stage. So it is unnecessary to reconstruct these features. To this end, we propose a new strategy to suppress the influence of unimportant features (i.e., the features will be removed at the next pruning stage). Our comprehensive experiments on three benchmark datasets, i.e., CIFAR-10, ImageNet, and UCF-101, demonstrate the effectiveness of our CPLI method.
△ Less
Submitted 15 March, 2020;
originally announced March 2020.
-
Equalization Loss for Long-Tailed Object Recognition
Authors:
**gru Tan,
Changbao Wang,
Buyu Li,
Quanquan Li,
Wanli Ouyang,
Changqing Yin,
Junjie Yan
Abstract:
Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS. In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tai…
▽ More
Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS. In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https: //github.com/tztztztztz/eql.detectron2
△ Less
Submitted 14 April, 2020; v1 submitted 11 March, 2020;
originally announced March 2020.
-
EcoNAS: Finding Proxies for Economical Neural Architecture Search
Authors:
Dongzhan Zhou,
Xinchi Zhou,
Wenwei Zhang,
Chen Change Loy,
Shuai Yi,
Xuesen Zhang,
Wanli Ouyang
Abstract:
Neural Architecture Search (NAS) achieves significant progress in many computer vision tasks. While many methods have been proposed to improve the efficiency of NAS, the search progress is still laborious because training and evaluating plausible architectures over large search space is time-consuming. Assessing network candidates under a proxy (i.e., computationally reduced setting) thus becomes…
▽ More
Neural Architecture Search (NAS) achieves significant progress in many computer vision tasks. While many methods have been proposed to improve the efficiency of NAS, the search progress is still laborious because training and evaluating plausible architectures over large search space is time-consuming. Assessing network candidates under a proxy (i.e., computationally reduced setting) thus becomes inevitable. In this paper, we observe that most existing proxies exhibit different behaviors in maintaining the rank consistency among network candidates. In particular, some proxies can be more reliable -- the rank of candidates does not differ much comparing their reduced setting performance and final performance. In this paper, we systematically investigate some widely adopted reduction factors and report our observations. Inspired by these observations, we present a reliable proxy and further formulate a hierarchical proxy strategy. The strategy spends more computations on candidate networks that are potentially more accurate, while discards unpromising ones in early stage with a fast proxy. This leads to an economical evolutionary-based NAS (EcoNAS), which achieves an impressive 400x search time reduction in comparison to the evolutionary-based state of the art (8 vs. 3150 GPU days). Some new proxies led by our observations can also be applied to accelerate other NAS methods while still able to discover good candidate networks with performance matching those found by previous proxy strategies.
△ Less
Submitted 26 February, 2020; v1 submitted 5 January, 2020;
originally announced January 2020.
-
Learning 3D Human Shape and Pose from Dense Body Parts
Authors:
Hongwen Zhang,
Jie Cao,
Guo Lu,
Wanli Ouyang,
Zhenan Sun
Abstract:
Reconstructing 3D human shape and pose from monocular images is challenging despite the promising results achieved by the most recent learning-based methods. The commonly occurred misalignment comes from the facts that the map** from images to the model space is highly non-linear and the rotation-based pose representation of body models is prone to result in the drift of joint positions. In this…
▽ More
Reconstructing 3D human shape and pose from monocular images is challenging despite the promising results achieved by the most recent learning-based methods. The commonly occurred misalignment comes from the facts that the map** from images to the model space is highly non-linear and the rotation-based pose representation of body models is prone to result in the drift of joint positions. In this work, we investigate learning 3D human shape and pose from dense correspondences of body parts and propose a Decompose-and-aggregate Network (DaNet) to address these issues. DaNet adopts the dense correspondence maps, which densely build a bridge between 2D pixels and 3D vertices, as intermediate representations to facilitate the learning of 2D-to-3D map**. The prediction modules of DaNet are decomposed into one global stream and multiple local streams to enable global and fine-grained perceptions for the shape and pose predictions, respectively. Messages from local streams are further aggregated to enhance the robust prediction of the rotation-based poses, where a position-aided rotation feature refinement strategy is proposed to exploit spatial relationships between body joints. Moreover, a Part-based Dropout (PartDrop) strategy is introduced to drop out dense information from intermediate representations during training, encouraging the network to focus on more complementary body parts as well as neighboring position features. The efficacy of the proposed method is validated on both indoor and real-world datasets including Human3.6M, UP3D, COCO, and 3DPW, showing that our method could significantly improve the reconstruction performance in comparison with previous state-of-the-art methods. Our code is publicly available at https://hongwenzhang.github.io/dense2mesh .
△ Less
Submitted 6 December, 2020; v1 submitted 31 December, 2019;
originally announced December 2019.
-
Computation Reallocation for Object Detection
Authors:
Feng Liang,
Chen Lin,
Ronghao Guo,
Ming Sun,
Wei Wu,
Junjie Yan,
Wanli Ouyang
Abstract:
The allocation of computation resources in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detector, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computati…
▽ More
The allocation of computation resources in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detector, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computation reallocation strategies across different feature resolution and spatial position diectly on the target detection dataset. A two-level reallocation space is proposed for both stage and spatial reallocation. A novel hierarchical search procedure is adopted to cope with the complex search space. We apply CR-NAS to multiple backbones and achieve consistent improvements. Our CR-ResNet50 and CR-MobileNetV2 outperforms the baseline by 1.9% and 1.7% COCO AP respectively without any additional computation budget. The models discovered by CR-NAS can be equiped to other powerful detection neck/head and be easily transferred to other dataset, e.g. PASCAL VOC, and other vision tasks, e.g. instance segmentation. Our CR-NAS can be used as a plugin to improve the performance of various networks, which is demanding.
△ Less
Submitted 24 December, 2019;
originally announced December 2019.