Search | arXiv e-print repository

Rethinking IoU-based Optimization for Single-stage 3D Object Detection

Authors: Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Min-Jian Zhao, Gim Hee Lee

Abstract: Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D IoU. However, such a direct computation in 3D is… ▽ More Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D IoU. However, such a direct computation in 3D is very costly due to the complex implementation and inefficient backward operations. Moreover, 3D IoU-based optimization is sub-optimal as it is sensitive to rotation and thus can cause training instability and detection performance deterioration. In this paper, we propose a novel Rotation-Decoupled IoU (RDIoU) method that can mitigate the rotation-sensitivity issue, and produce more efficient optimization objectives compared with 3D IoU during the training stage. Specifically, our RDIoU simplifies the complex interactions of regression parameters by decoupling the rotation variable as an independent term, yet preserving the geometry of 3D IoU. By incorporating RDIoU into both the regression and classification branches, the network is encouraged to learn more precise bounding boxes and concurrently overcome the misalignment issue between classification and regression. Extensive experiments on the benchmark KITTI and Waymo Open Dataset validate that our RDIoU method can bring substantial improvement for the single-stage 3D object detection. △ Less

Submitted 20 July, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted by ECCV2022. The code is available at https://github.com/hlsheng1/RDIoU

arXiv:2207.04892 [pdf, other]

Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation

Authors: Zhun Zhong, Yuyang Zhao, Gim Hee Lee, Nicu Sebe

Abstract: In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model's performance and the style features can be well represented by the channel-wise… ▽ More In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model's performance and the style features can be well represented by the channel-wise mean and standard deviation of images. Inspired by this, we propose a novel adversarial style augmentation (AdvStyle) approach, which can dynamically generate hard stylized images during training and thus can effectively prevent the model from overfitting on the source domain. Specifically, AdvStyle regards the style feature as a learnable parameter and updates it by adversarial training. The learned adversarial style feature is used to construct an adversarial image for robust model training. AdvStyle is easy to implement and can be readily applied to different models. Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art. Moreover, AdvStyle can be employed to domain generalized image classification and produces a clear improvement on the considered datasets. △ Less

Submitted 12 October, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

Comments: NeurIPS 2022

arXiv:2205.13579 [pdf, other]

CA-UDA: Class-Aware Unsupervised Domain Adaptation with Optimal Assignment and Pseudo-Label Refinement

Authors: Can Zhang, Gim Hee Lee

Abstract: Recent works on unsupervised domain adaptation (UDA) focus on the selection of good pseudo-labels as surrogates for the missing labels in the target data. However, source domain bias that deteriorates the pseudo-labels can still exist since the shared network of the source and target domains are typically used for the pseudo-label selections. The suboptimal feature space source-to-target domain al… ▽ More Recent works on unsupervised domain adaptation (UDA) focus on the selection of good pseudo-labels as surrogates for the missing labels in the target data. However, source domain bias that deteriorates the pseudo-labels can still exist since the shared network of the source and target domains are typically used for the pseudo-label selections. The suboptimal feature space source-to-target domain alignment can also result in unsatisfactory performance. In this paper, we propose CA-UDA to improve the quality of the pseudo-labels and UDA results with optimal assignment, a pseudo-label refinement strategy and class-aware domain alignment. We use an auxiliary network to mitigate the source domain bias for pseudo-label refinement. Our intuition is that the underlying semantics in the target domain can be fully exploited to help refine the pseudo-labels that are inferred from the source features under domain shift. Furthermore, our optimal assignment can optimally align features in the source-to-target domains and our class-aware domain alignment can simultaneously close the domain gap while preserving the classification decision boundaries. Extensive experiments on several benchmark datasets show that our method can achieve state-of-the-art performance in the image classification task. △ Less

Submitted 30 May, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2205.09068 [pdf, other]

VRAG: Region Attention Graphs for Content-Based Video Retrieval

Authors: Kennard Ng, Ser-Nam Lim, Gim Hee Lee

Abstract: Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-lev… ▽ More Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRAG) that improves the state-of-the-art of video-level methods. We represent videos at a finer granularity via region-level features and encode video spatio-temporal dynamics through region-level relations. Our VRAG captures the relationships between regions based on their semantic content via self-attention and the permutation invariant aggregation of Graph Convolution. In addition, we show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval. We evaluate our VRAG over several video retrieval tasks and achieve a new state-of-the-art for video-level retrieval. Furthermore, our shot-level VRAG shows higher retrieval precision than other existing video-level methods, and closer performance to frame-level methods at faster evaluation speeds. Finally, our code will be made publicly available. △ Less

Submitted 18 May, 2022; originally announced May 2022.

arXiv:2205.04042 [pdf, other]

Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning

Authors: Na Dong, Yongqiang Zhang, Mingli Ding, Gim Hee Lee

Abstract: Incremental few-shot object detection aims at detecting novel classes without forgetting knowledge of the base classes with only a few labeled training data from the novel classes. Most related prior works are on incremental object detection that rely on the availability of abundant training samples per novel class that substantially limits the scalability to real-world setting where novel data ca… ▽ More Incremental few-shot object detection aims at detecting novel classes without forgetting knowledge of the base classes with only a few labeled training data from the novel classes. Most related prior works are on incremental object detection that rely on the availability of abundant training samples per novel class that substantially limits the scalability to real-world setting where novel data can be scarce. In this paper, we propose the Incremental-DETR that does incremental few-shot object detection via fine-tuning and self-supervised learning on the DETR object detector. To alleviate severe over-fitting with few novel class data, we first fine-tune the class-specific components of DETR with self-supervision from additional object proposals generated using Selective Search as pseudo labels. We further introduce an incremental few-shot fine-tuning strategy with knowledge distillation on the class-specific components of DETR to encourage the network in detecting novel classes without forgetting the base classes. Extensive experiments conducted on standard incremental object detection and incremental few-shot object detection settings show that our approach significantly outperforms state-of-the-art methods by a large margin. △ Less

Submitted 27 February, 2023; v1 submitted 9 May, 2022; originally announced May 2022.

Comments: Accepted by AAAI2023

arXiv:2204.02548 [pdf, other]

Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation

Authors: Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, Gim Hee Lee

Abstract: In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between synthetic and real-world data, significantly hinder… ▽ More In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between synthetic and real-world data, significantly hinders the model performance on unseen real-world scenes. In this work, we propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle such domain shift. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages real-world knowledge to prevent the model from overfitting to synthetic data and thus largely keeps the representation consistent between the synthetic and real-world models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Experiments show that our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.05% and 8.35% on the average mIoU of three real-world datasets on single- and multi-source settings, respectively. △ Less

Submitted 19 July, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: ECCV 2022

arXiv:2203.14517 [pdf, other]

REGTR: End-to-end Point Cloud Correspondences with Transformers

Authors: Zi Jian Yew, Gim Hee Lee

Abstract: Despite recent success in incorporating learning into point cloud registration, many works focus on learning feature descriptors and continue to rely on nearest-neighbor feature matching and outlier filtering through RANSAC to obtain the final set of correspondences for pose estimation. In this work, we conjecture that attention mechanisms can replace the role of explicit feature matching and RANS… ▽ More Despite recent success in incorporating learning into point cloud registration, many works focus on learning feature descriptors and continue to rely on nearest-neighbor feature matching and outlier filtering through RANSAC to obtain the final set of correspondences for pose estimation. In this work, we conjecture that attention mechanisms can replace the role of explicit feature matching and RANSAC, and thus propose an end-to-end framework to directly predict the final set of correspondences. We use a network architecture consisting primarily of transformer layers containing self and cross attentions, and train it to predict the probability each point lies in the overlap** region and its corresponding position in the other point cloud. The required rigid transformation can then be estimated directly from the predicted correspondences without further post-processing. Despite its simplicity, our approach achieves state-of-the-art performance on 3DMatch and ModelNet benchmarks. Our source code can be found at https://github.com/yewzijian/RegTR . △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: 15 pages, 11 figures, CVPR2022

arXiv:2203.03498 [pdf, other]

Weakly Supervised Learning of Keypoints for 6D Object Pose Estimation

Authors: Meng Tian, Gim Hee Lee

Abstract: State-of-the-art approaches for 6D object pose estimation require large amounts of labeled data to train the deep networks. However, the acquisition of 6D object pose annotations is tedious and labor-intensive in large quantity. To alleviate this problem, we propose a weakly supervised 6D object pose estimation approach based on 2D keypoint detection. Our method trains only on image pairs with kno… ▽ More State-of-the-art approaches for 6D object pose estimation require large amounts of labeled data to train the deep networks. However, the acquisition of 6D object pose annotations is tedious and labor-intensive in large quantity. To alleviate this problem, we propose a weakly supervised 6D object pose estimation approach based on 2D keypoint detection. Our method trains only on image pairs with known relative transformations between their viewpoints. Specifically, we assign a set of arbitrarily chosen 3D keypoints to represent each unknown target 3D object and learn a network to detect their 2D projections that comply with the relative camera viewpoints. During inference, our network first infers the 2D keypoints from the query image and a given labeled reference image. We then use these 2D keypoints and the arbitrarily chosen 3D keypoints retained from training to infer the 6D object pose. Extensive experiments demonstrate that our approach achieves comparable performance with state-of-the-art fully supervised approaches. △ Less

Submitted 7 March, 2022; originally announced March 2022.

arXiv:2112.07241 [pdf, other]

Static-Dynamic Co-Teaching for Class-Incremental 3D Object Detection

Authors: Na Zhao, Gim Hee Lee

Abstract: Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios, where continu… ▽ More Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios, where continuous learning systems are needed. In this paper, we study the unexplored yet important class-incremental 3D object detection problem and present the first solution - SDCoT, a novel static-dynamic co-teaching method. Our SDCoT alleviates the catastrophic forgetting of old classes via a static teacher, which provides pseudo annotations for old classes in the new samples and regularizes the current model by extracting previous knowledge with a distillation loss. At the same time, SDCoT consistently learns the underlying knowledge from new data via a dynamic teacher. We conduct extensive experiments on two benchmark datasets and demonstrate the superior performance of our SDCoT over baseline approaches in several incremental learning scenarios. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: Accepted at AAAI 2022

arXiv:2112.01900 [pdf, other]

Novel Class Discovery in Semantic Segmentation

Authors: Yuyang Zhao, Zhun Zhong, Nicu Sebe, Gim Hee Lee

Abstract: We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic segmentation. In NCDSS, we need to distinguish the… ▽ More We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic segmentation. In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image, which increases the difficulty in using the unlabeled data. To tackle this new setting, we leverage the labeled base data and a saliency model to coarsely cluster novel classes for model training in our basic framework. Additionally, we propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels, further improving the model performance on the novel classes. Our EUMS utilizes an entropy ranking technique and a dynamic reassignment to distill clean labels, thereby making full use of the noisy data via self-supervised learning. We build the NCDSS benchmark on the PASCAL-5$^i$ dataset and COCO-20$^i$ dataset. Extensive experiments demonstrate the feasibility of the basic framework (achieving an average mIoU of 49.81% on PASCAL-5$^i$) and the effectiveness of EUMS framework (outperforming the basic framework by 9.28% mIoU on PASCAL-5$^i$). △ Less

Submitted 28 March, 2022; v1 submitted 3 December, 2021; originally announced December 2021.

Comments: CVPR 2022

arXiv:2111.10946 [pdf, other]

A General Framework for Lifelong Localization and Map** in Changing Environment

Authors: Min Zhao, Xin Guo, Le Song, Baoxing Qin, Xuesong Shi, Gim Hee Lee, Guanghui Sun

Abstract: The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and map** (SLAM) f… ▽ More The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and map** (SLAM) framework. Our framework uses a multiple session map representation, and exploits an efficient map updating strategy that includes map building, pose graph refinement and sparsification. To mitigate the unbounded increase of memory usage, we propose a map-trimming method based on the Chow-Liu maximum-mutual-information spanning tree. The proposed SLAM framework has been comprehensively validated by over a month of robot deployment in real supermarket environment. Furthermore, we release the dataset collected from the indoor and outdoor changing environment with the hope to accelerate lifelong SLAM research in the community. Our dataset is available at https://github.com/sanduan168/lifelong-SLAM-dataset. △ Less

Submitted 21 November, 2021; originally announced November 2021.

arXiv:2111.08176 [pdf, other]

Coarse-to-fine Animal Pose and Shape Estimation

Authors: Chen Li, Gim Hee Lee

Abstract: Most existing animal pose and shape estimation approaches reconstruct animal meshes with a parametric SMAL model. This is because the low-dimensional pose and shape parameters of the SMAL model makes it easier for deep networks to learn the high-dimensional animal meshes. However, the SMAL model is learned from scans of toy animals with limited pose and shape variations, and thus may not be able t… ▽ More Most existing animal pose and shape estimation approaches reconstruct animal meshes with a parametric SMAL model. This is because the low-dimensional pose and shape parameters of the SMAL model makes it easier for deep networks to learn the high-dimensional animal meshes. However, the SMAL model is learned from scans of toy animals with limited pose and shape variations, and thus may not be able to represent highly varying real animals well. This may result in poor fittings of the estimated meshes to the 2D evidences, e.g. 2D keypoints or silhouettes. To mitigate this problem, we propose a coarse-to-fine approach to reconstruct 3D animal mesh from a single image. The coarse estimation stage first estimates the pose, shape and translation parameters of the SMAL model. The estimated meshes are then used as a starting point by a graph convolutional network (GCN) to predict a per-vertex deformation in the refinement stage. This combination of SMAL-based and vertex-based representations benefits from both parametric and non-parametric representations. We design our mesh refinement GCN (MRGCN) as an encoder-decoder structure with hierarchical feature representations to overcome the limited receptive field of traditional GCNs. Moreover, we observe that the global image feature used by existing animal mesh reconstruction works is unable to capture detailed shape information for mesh refinement. We thus introduce a local feature extractor to retrieve a vertex-level feature and use it together with the global feature as the input of the MRGCN. We test our approach on the StanfordExtra dataset and achieve state-of-the-art results. Furthermore, we test the generalization capacity of our approach on the Animal Pose and BADJA datasets. Our code is available at the project website. △ Less

Submitted 15 November, 2021; originally announced November 2021.

Comments: Accepted by Neurips2021

arXiv:2111.00728 [pdf, other]

Learning Iterative Robust Transformation Synchronization

Authors: Zi Jian Yew, Gim Hee Lee

Abstract: Transformation Synchronization is the problem of recovering absolute transformations from a given set of pairwise relative motions. Despite its usefulness, the problem remains challenging due to the influences from noisy and outlier relative motions, and the difficulty to model analytically and suppress them with high fidelity. In this work, we avoid handcrafting robust loss functions, and propose… ▽ More Transformation Synchronization is the problem of recovering absolute transformations from a given set of pairwise relative motions. Despite its usefulness, the problem remains challenging due to the influences from noisy and outlier relative motions, and the difficulty to model analytically and suppress them with high fidelity. In this work, we avoid handcrafting robust loss functions, and propose to use graph neural networks (GNNs) to learn transformation synchronization. Unlike previous works which use complicated multi-stage pipelines, we use an iterative approach where each step consists of a single weight-shared message passing layer that refines the absolute poses from the previous iteration by predicting an incremental update in the tangent space. To reduce the influence of outliers, the messages are weighted before aggregation. Our iterative approach alleviates the need for an explicit initialization step and performs well with identity initial poses. Although our approach is simple, we show that it performs favorably against existing handcrafted and learned synchronization methods through experiments on both SO(3) and SE(3) synchronization. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: To appear in 3DV2021

arXiv:2110.15017 [pdf, other]

Bridging Non Co-occurrence with Unlabeled In-the-wild Data for Incremental Object Detection

Authors: Na Dong, Yongqiang Zhang, Mingli Ding, Gim Hee Lee

Abstract: Deep networks have shown remarkable results in the task of object detection. However, their performance suffers critical drops when they are subsequently trained on novel classes without any sample from the base classes originally used to train the model. This phenomenon is known as catastrophic forgetting. Recently, several incremental learning methods are proposed to mitigate catastrophic forget… ▽ More Deep networks have shown remarkable results in the task of object detection. However, their performance suffers critical drops when they are subsequently trained on novel classes without any sample from the base classes originally used to train the model. This phenomenon is known as catastrophic forgetting. Recently, several incremental learning methods are proposed to mitigate catastrophic forgetting for object detection. Despite the effectiveness, these methods require co-occurrence of the unlabeled base classes in the training data of the novel classes. This requirement is impractical in many real-world settings since the base classes do not necessarily co-occur with the novel classes. In view of this limitation, we consider a more practical setting of complete absence of co-occurrence of the base and novel classes for the object detection task. We propose the use of unlabeled in-the-wild data to bridge the non co-occurrence caused by the missing base classes during the training of additional novel classes. To this end, we introduce a blind sampling strategy based on the responses of the base-class model and pre-trained novel-class model to select a smaller relevant dataset from the large in-the-wild dataset for incremental learning. We then design a dual-teacher distillation framework to transfer the knowledge distilled from the base- and novel-class teacher models to the student model using the sampled in-the-wild data. Experimental results on the PASCAL VOC and MS COCO datasets show that our proposed method significantly outperforms other state-of-the-art class-incremental object detection methods when there is no co-occurrence between the base and novel classes during training. △ Less

Submitted 28 October, 2021; originally announced October 2021.

Comments: Accepted paper at NeurIPS 2021

arXiv:2108.09936 [pdf, other]

Voxel-based Network for Shape Completion by Leveraging Edge Generation

Authors: Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee

Abstract: Deep learning technique has yielded significant improvements in point cloud completion with the aim of completing missing object shapes from partial inputs. However, most existing methods fail to recover realistic structures due to over-smoothing of fine-grained details. In this paper, we develop a voxel-based network for point cloud completion by leveraging edge generation (VE-PCN). We first embe… ▽ More Deep learning technique has yielded significant improvements in point cloud completion with the aim of completing missing object shapes from partial inputs. However, most existing methods fail to recover realistic structures due to over-smoothing of fine-grained details. In this paper, we develop a voxel-based network for point cloud completion by leveraging edge generation (VE-PCN). We first embed point clouds into regular voxel grids, and then generate complete objects with the help of the hallucinated shape edges. This decoupled architecture together with a multi-scale grid feature learning is able to generate more realistic on-surface details. We evaluate our model on the publicly available completion datasets and show that it outperforms existing state-of-the-art approaches quantitatively and qualitatively. Our source code is available at https://github.com/xiaogangw/VE-PCN. △ Less

Submitted 23 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2106.03422 [pdf, other]

Source-Free Open Compound Domain Adaptation in Semantic Segmentation

Authors: Yuyang Zhao, Zhun Zhong, Zhiming Luo, Gim Hee Lee, Nicu Sebe

Abstract: In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domains and unseen open domains. In SF-OCDA, only the so… ▽ More In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domains and unseen open domains. In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model. The model is evaluated on the samples from the target and unseen open domains. To solve this problem, we present an effective framework by separating the training process into two stages: (1) pre-training a generalized source model and (2) adapting a target model with self-supervised learning. In our framework, we propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level, which can benefit the training of both stages. First, CPSS can significantly improve the generalization ability of the source model, providing more accurate pseudo-labels for the latter stage. Second, CPSS can reduce the influence of noisy pseudo-labels and also avoid the model overfitting to the target domain during self-supervised learning, consistently boosting the performance on the target and open domains. Experiments demonstrate that our method produces state-of-the-art results on the C-Driving dataset. Furthermore, our model also achieves the leading performance on CityScapes for domain generalization. △ Less

Submitted 7 June, 2021; originally announced June 2021.

arXiv:2105.11636 [pdf, other]

FILTRA: Rethinking Steerable CNN by Filter Transform

Authors: Bo Li, Qili Wang, Gim Hee Lee

Abstract: Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. It has been an intuitive and widely used technique to construct a steerable filter by augmenting a filter with its transformed copies in the past decades, which is named as filter transfor… ▽ More Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. It has been an intuitive and widely used technique to construct a steerable filter by augmenting a filter with its transformed copies in the past decades, which is named as filter transform in this paper. Recently, the problem of steerable CNN has been studied from aspect of group representation theory, which reveals the function space structure of a steerable kernel function. However, it is not yet clear on how this theory is related to the filter transform technique. In this paper, we show that kernel constructed by filter transform can also be interpreted in the group representation theory. This interpretation help complete the puzzle of steerable CNN theory and provides a novel and simple approach to implement steerable convolution operators. Experiments are executed on multiple datasets to verify the feasibility of the proposed approach. △ Less

Submitted 15 February, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

Comments: ICML 2021

arXiv:2104.03501 [pdf, other]

DeepI2P: Image-to-Point Cloud Registration via Deep Classification

Authors: Jiaxin Li, Gim Hee Lee

Abstract: This paper presents DeepI2P: a novel approach for cross-modality registration between an image and a point cloud. Given an image (e.g. from a rgb-camera) and a general point cloud (e.g. from a 3D Lidar scanner) captured at different locations in the same scene, our method estimates the relative rigid transformation between the coordinate frames of the camera and Lidar. Learning common feature desc… ▽ More This paper presents DeepI2P: a novel approach for cross-modality registration between an image and a point cloud. Given an image (e.g. from a rgb-camera) and a general point cloud (e.g. from a 3D Lidar scanner) captured at different locations in the same scene, our method estimates the relative rigid transformation between the coordinate frames of the camera and Lidar. Learning common feature descriptors to establish correspondences for the registration is inherently challenging due to the lack of appearance and geometric correlations across the two modalities. We circumvent the difficulty by converting the registration problem into a classification and inverse camera projection optimization problem. A classification neural network is designed to label whether the projection of each point in the point cloud is within or beyond the camera frustum. These labeled points are subsequently passed into a novel inverse camera projection solver to estimate the relative pose. Extensive experimental results on Oxford Robotcar and KITTI datasets demonstrate the feasibility of our approach. Our source code is available at https://github.com/lijx10/DeepI2P △ Less

Submitted 8 April, 2021; originally announced April 2021.

Comments: CVPR 2021. Main paper and supplementary materials

arXiv:2104.02385 [pdf, other]

Learning Spatial Context with Graph Neural Network for Multi-Person Pose Grou**

Authors: Jiahao Lin, Gim Hee Lee

Abstract: Bottom-up approaches for image-based multi-person pose estimation consist of two stages: (1) keypoint detection and (2) grou** of the detected keypoints to form person instances. Current grou** approaches rely on learned embedding from only visual features that completely ignore the spatial configuration of human poses. In this work, we formulate the grou** task as a graph partitioning probl… ▽ More Bottom-up approaches for image-based multi-person pose estimation consist of two stages: (1) keypoint detection and (2) grou** of the detected keypoints to form person instances. Current grou** approaches rely on learned embedding from only visual features that completely ignore the spatial configuration of human poses. In this work, we formulate the grou** task as a graph partitioning problem, where we learn the affinity matrix with a Graph Neural Network (GNN). More specifically, we design a Geometry-aware Association GNN that utilizes spatial information of the keypoints and learns local affinity from the global context. The learned geometry-based affinity is further fused with appearance-based affinity to achieve robust keypoint association. Spectral clustering is used to partition the graph for the formation of the pose instances. Experimental results on two benchmark datasets show that our proposed method outperforms existing appearance-only grou** frameworks, which shows the effectiveness of utilizing spatial context for robust grou**. Source code is available at: https://github.com/jiahaoLjh/PoseGrou**. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: 7 pages, 4 figures. Accepted in ICRA 2021

arXiv:2104.02273 [pdf, other]

Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo

Authors: Jiahao Lin, Gim Hee Lee

Abstract: Existing approaches for multi-view multi-person 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views and solve for the 3D pose estimation for each person. Establishing cross-view correspondences is challenging in multi-person scenes, and incorrect correspondences will lead to sub-optimal performance for the multi-stage pipeline.… ▽ More Existing approaches for multi-view multi-person 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views and solve for the 3D pose estimation for each person. Establishing cross-view correspondences is challenging in multi-person scenes, and incorrect correspondences will lead to sub-optimal performance for the multi-stage pipeline. In this work, we present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot. Specifically, we propose to perform depth regression for each joint of each 2D pose in a target camera view. Cross-view consistency constraints are implicitly enforced by multiple reference camera views via the plane sweep algorithm to facilitate accurate depth regression. We adopt a coarse-to-fine scheme to first regress the person-level depth followed by a per-person joint-level relative depth estimation. 3D poses are obtained from a simple back-projection given the estimated depths. We evaluate our approach on benchmark datasets where it outperforms previous state-of-the-arts while being remarkably efficient. Our code is available at https://github.com/jiahaoLjh/PlaneSweepPose. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: 10 pages, 5 figures. Accepted in CVPR 2021

arXiv:2103.14910 [pdf, other]

MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis

Authors: Jiaxin Li, Zijian Feng, Qi She, Henghui Ding, Changhu Wang, Gim Hee Lee

Abstract: In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct… ▽ More In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents. The reconstructed and inpainted frustum can then be easily rendered into novel RGB or depth views using differentiable rendering. Extensive experiments on RealEstate10K, KITTI and Flowers Light Fields show that our MINE outperforms state-of-the-art by a large margin in novel view synthesis. We also achieve competitive results in depth estimation on iBims-1 and NYU-v2 without annotated depth supervision. Our source code is available at https://github.com/vincentfung13/MINE △ Less

Submitted 30 July, 2021; v1 submitted 27 March, 2021; originally announced March 2021.

Comments: ICCV 2021. Main paper and supplementary materials

arXiv:2103.14843 [pdf, other]

From Synthetic to Real: Unsupervised Domain Adaptation for Animal Pose Estimation

Authors: Chen Li, Gim Hee Lee

Abstract: Animal pose estimation is an important field that has received increasing attention in the recent years. The main challenge for this task is the lack of labeled data. Existing works circumvent this problem with pseudo labels generated from data of other easily accessible domains such as synthetic data. However, these pseudo labels are noisy even with consistency check or confidence-based filtering… ▽ More Animal pose estimation is an important field that has received increasing attention in the recent years. The main challenge for this task is the lack of labeled data. Existing works circumvent this problem with pseudo labels generated from data of other easily accessible domains such as synthetic data. However, these pseudo labels are noisy even with consistency check or confidence-based filtering due to the domain shift in the data. To solve this problem, we design a multi-scale domain adaptation module (MDAM) to reduce the domain gap between the synthetic and real data. We further introduce an online coarse-to-fine pseudo label updating strategy. Specifically, we propose a self-distillation module in an inner coarse-update loop and a mean-teacher in an outer fine-update loop to generate new pseudo labels that gradually replace the old ones. Consequently, our model is able to learn from the old pseudo labels at the early stage, and gradually switch to the new pseudo labels to prevent overfitting in the later stage. We evaluate our approach on the TigDog and VisDA 2019 datasets, where we outperform existing approaches by a large margin. We also demonstrate the generalization ability of our model by testing extensively on both unseen domains and unseen animal categories. Our code is available at the project website. △ Less

Submitted 27 March, 2021; originally announced March 2021.

Comments: CVPR2021

arXiv:2103.14314 [pdf, other]

City-scale Scene Change Detection using Point Clouds

Authors: Zi Jian Yew, Gim Hee Lee

Abstract: We propose a method for detecting structural changes in a city using images captured from vehicular mounted cameras over traversals at two different times. We first generate 3D point clouds for each traversal from the images and approximate GNSS/INS readings using Structure-from-Motion (SfM). A direct comparison of the two point clouds for change detection is not ideal due to inaccurate geo-locati… ▽ More We propose a method for detecting structural changes in a city using images captured from vehicular mounted cameras over traversals at two different times. We first generate 3D point clouds for each traversal from the images and approximate GNSS/INS readings using Structure-from-Motion (SfM). A direct comparison of the two point clouds for change detection is not ideal due to inaccurate geo-location information and possible drifts in the SfM. To circumvent this problem, we propose a deep learning-based non-rigid registration on the point clouds which allows us to compare the point clouds for structural change detection in the scene. Furthermore, we introduce a dual thresholding check and post-processing step to enhance the robustness of our method. We collect two datasets for the evaluation of our approach. Experiments show that our method is able to detect scene changes effectively, even in the presence of viewpoint and illumination differences. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: 8 pages, 10 figures. To be presented at ICRA2021

arXiv:2010.08719 [pdf, other]

Cascaded Refinement Network for Point Cloud Completion with Self-supervision

Authors: Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee

Abstract: Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications. Existing shape completion methods tend to generate rough shapes without fine-grained details. Considering this, we introduce a two-branch network for shape completion. The first branch is a cascaded shape completion sub-network to synthesize complete objects, where we propose to use the partial in… ▽ More Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications. Existing shape completion methods tend to generate rough shapes without fine-grained details. Considering this, we introduce a two-branch network for shape completion. The first branch is a cascaded shape completion sub-network to synthesize complete objects, where we propose to use the partial input together with the coarse output to preserve the object details during the dense point reconstruction. The second branch is an auto-encoder to reconstruct the original partial input. The two branches share a same feature extractor to learn an accurate global feature for shape completion. Furthermore, we propose two strategies to enable the training of our network when ground truth data are not available. This is to mitigate the dependence of existing approaches on large amounts of ground truth training data that are often difficult to obtain in real-world applications. Additionally, our proposed strategies are also able to improve the reconstruction quality for fully supervised learning. We verify our approach in self-supervised, semi-supervised and fully supervised settings with superior performances. Quantitative and qualitative results on different datasets demonstrate that our method achieves more realistic outputs than state-of-the-art approaches on the point cloud completion task. △ Less

Submitted 26 August, 2021; v1 submitted 17 October, 2020; originally announced October 2020.

Comments: Accepted by PAMI. Extended version of the following paper: Cascaded Refinement Network for Point Cloud Completion. CVPR 2020. arXiv link: arXiv:2004.03327

arXiv:2008.05770 [pdf, other]

Weakly Supervised Generative Network for Multiple 3D Human Pose Hypotheses

Authors: Chen Li, Gim Hee Lee

Abstract: 3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth. Several previous works addressed the inverse problem by generating multiple hypotheses. However, these works are strongly supervised and require ground truth 2D-to-3D correspondences which can be difficult to obtain. In this paper, we propose a weakly supervised deep generative ne… ▽ More 3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth. Several previous works addressed the inverse problem by generating multiple hypotheses. However, these works are strongly supervised and require ground truth 2D-to-3D correspondences which can be difficult to obtain. In this paper, we propose a weakly supervised deep generative network to address the inverse problem and circumvent the need for ground truth 2D-to-3D correspondences. To this end, we design our network to model a proposal distribution which we use to approximate the unknown multi-modal target posterior distribution. We achieve the approximation by minimizing the KL divergence between the proposal and target distributions, and this leads to a 2D reprojection error and a prior loss term that can be weakly supervised. Furthermore, we determine the most probable solution as the conditional mode of the samples using the mean-shift algorithm. We evaluate our method on three benchmark datasets -- Human3.6M, MPII and MPI-INF-3DHP. Experimental results show that our approach is capable of generating multiple feasible hypotheses and achieves state-of-the-art results compared to existing weakly supervised approaches. Our source code is available at the project website. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: Accepted to BMVC2020

arXiv:2008.00394 [pdf, other]

Point Cloud Completion by Learning Shape Priors

Authors: Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee

Abstract: In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine strategy to incorporate partial prior in the fine… ▽ More In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine strategy to incorporate partial prior in the fine stage. To learn the complete objects prior, we first train a point cloud auto-encoder to extract the latent embeddings from complete points. Then we learn a map** to transfer the point features from partial points to that of the complete points by optimizing feature alignment losses. The feature alignment losses consist of a L2 distance and an adversarial loss obtained by Maximum Mean Discrepancy Generative Adversarial Network (MMD-GAN). The L2 distance optimizes the partial features towards the complete ones in the feature space, and MMD-GAN decreases the statistical distance of two point features in a Reproducing Kernel Hilbert Space. We achieve state-of-the-art performances on the point cloud completion task. Our code is available at https://github.com/xiaogangw/point-cloud-completion-shape-prior. △ Less

Submitted 15 July, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

Comments: IROS 2020

arXiv:2007.10986 [pdf, other]

Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry

Authors: He Chen, Pengfei Guo, Pengfei Li, Gim Hee Lee, Gregory Chirikjian

Abstract: Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The first is the mismatch of human joints resulting f… ▽ More Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The first is the mismatch of human joints resulting from the simple cues provided by the Euclidean distances between joints and epipolar lines. The second is the lack of robustness from the naive formulation of the problem as a least squares minimization. In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation. Our method consists of two key components: a graph model for fast cross-view matching, and a maximum a posteriori (MAP) estimator for the reconstruction of the 3D human poses. We demonstrate the effectiveness and superiority of our proposed method on four benchmark datasets. △ Less

Submitted 21 July, 2020; originally announced July 2020.

arXiv:2007.08943 [pdf, other]

HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization

Authors: Jiahao Lin, Gim Hee Lee

Abstract: Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HDNet first estimates the 2D human pose with heatmap… ▽ More Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HDNet first estimates the 2D human pose with heatmaps of the joints. These estimated heatmaps serve as attention masks for pooling features from image regions corresponding to the target person. A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints. We formulate the target depth regression as a bin index estimation problem, which can be transformed with a soft-argmax operation from the classification output of our HDNet. We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets, i.e., Human3.6M and MuPoTS-3D. The experimental results show that we outperform the previous state-of-the-art consistently under multiple evaluation metrics. Our source code is available at: https://github.com/jiahaoLjh/HumanDepth. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: 16 pages, 5 figures. Accepted in ECCV 2020

arXiv:2007.08454 [pdf, other]

Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation

Authors: Meng Tian, Marcelo H Ang Jr, Gim Hee Lee

Abstract: We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense correspondences between the depth observation of th… ▽ More We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense correspondences between the depth observation of the object instance and the reconstructed 3D model to jointly estimate the 6D object pose and size. We design an autoencoder that trains on a collection of object models and compute the mean latent embedding for each category to learn the categorical shape priors. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms the state of the art. Our code is available at https://github.com/mentian/object-deformnet. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted at ECCV 2020

arXiv:2007.07686 [pdf, other]

Relative Pose Estimation of Calibrated Cameras with Known $\mathrm{SE}(3)$ Invariants

Authors: Bo Li, Evgeniy Martyushev, Gim Hee Lee

Abstract: The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of point pairs for relative pose estimation and impr… ▽ More The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of point pairs for relative pose estimation and improves the estimation efficiency and robustness. The $\mathrm{SE}(3)$ invariant constraints can come from extra sensor measurements or motion assumption. Different from conventional relative pose estimation with extra constraints, no extrinsic calibration is required to transform the constraints to the camera frame. This advantage comes from the invariance of $\mathrm{SE}(3)$ invariants cross different coordinate systems on a rigid body and makes the solvers more convenient and flexible in practical applications. Besides proposing the concept of relative pose estimation constrained by $\mathrm{SE}(3)$ invariants, we present a comprehensive study of existing polynomial formulations for relative pose estimation and discover their relationship. Different formulations are carefully chosen for each proposed problems to achieve best efficiency. Experiments on synthetic and real data shows performance improvement compared to conventional relative pose estimation methods. △ Less

Submitted 15 July, 2020; originally announced July 2020.

arXiv:2007.00860 [pdf]

Enhanced graphitic domains of unreduced graphene oxide and the interplay of hydration behaviour and catalytic activity

Authors: Tobias Foller, Rahman Daiyan, Xiaoheng **, Joshua Leverett, Hangyel Kim, Richard Webster, Jeaniffer E. Yap, Xinyue Wen, Aditya Rawal, K. Kanishka H. DeSilva, Masamichi Yoshimura, Heriberto Bustamante, Shery L. Y. Chang, Priyank Kumar, Yi You, Gwan Hyoung Lee, Rose Amal, Rakesh Joshi

Abstract: Previous studies indicate that the properties of graphene oxide (GO) can be significantly improved by enhancing its graphitic domain size through thermal diffusion and clustering of functional groups. Remarkably, this transition takes place below the decomposition temperature of the functional groups and thus allows fine-tuning of graphitic domains without compromising with the functionality of GO… ▽ More Previous studies indicate that the properties of graphene oxide (GO) can be significantly improved by enhancing its graphitic domain size through thermal diffusion and clustering of functional groups. Remarkably, this transition takes place below the decomposition temperature of the functional groups and thus allows fine-tuning of graphitic domains without compromising with the functionality of GO. By studying the transformation of GO under mild thermal treatment, we directly observe this size enhancement of graphitic domains from originally 40 nm2 to 200 nm2 through an extensive transmission electron microscopy (TEM) study. Additionally, we confirm the integrity of the functional groups during this process by comprehensive chemical analysis. A closer look into the process confirms the theoretically predicted relevance for the room temperature stability of GO. We further investigate the influence of enlarged graphitic domains on the hydration behaviour of GO and catalytic performance of single-atom catalysts supported by GO. △ Less

Submitted 21 May, 2021; v1 submitted 1 July, 2020; originally announced July 2020.

arXiv:2006.12052 [pdf, other]

Few-shot 3D Point Cloud Semantic Segmentation

Authors: Na Zhao, Tat-Seng Chua, Gim Hee Lee

Abstract: Many existing approaches for 3D point cloud semantic segmentation are fully supervised. These fully supervised approaches heavily rely on large amounts of labeled training data that are difficult to obtain and cannot segment new classes after training. To mitigate these limitations, we propose a novel attention-aware multi-prototype transductive few-shot point cloud semantic segmentation method to… ▽ More Many existing approaches for 3D point cloud semantic segmentation are fully supervised. These fully supervised approaches heavily rely on large amounts of labeled training data that are difficult to obtain and cannot segment new classes after training. To mitigate these limitations, we propose a novel attention-aware multi-prototype transductive few-shot point cloud semantic segmentation method to segment new classes given a few labeled examples. Specifically, each class is represented by multiple prototypes to model the complex data distribution of labeled points. Subsequently, we employ a transductive label propagation method to exploit the affinities between labeled multi-prototypes and unlabeled points, and among the unlabeled points. Furthermore, we design an attention-aware multi-level feature learning network to learn the discriminative features that capture the geometric dependencies and semantic correlations between points. Our proposed method shows significant and consistent improvements compared to baselines in different few-shot point cloud semantic segmentation settings (i.e., 2/3-way 1/5-shot) on two benchmark datasets. Our code is available at https://github.com/Na-Z/attMPTI. △ Less

Submitted 29 March, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: CVPR 2021

ACM Class: I.2.10; I.4.6

arXiv:2005.14686 [pdf, other]

doi 10.1103/PhysRevC.102.064905

Production of $π^0$ and $η$ mesons in U$+$U collisions at $\sqrt{s_{_{NN}}}=192$ GeV

Authors: U. Acharya, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, J. Alexander, K. Aoki, N. Apadula, H. Asano, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, X. Bai, B. Bannier, K. N. Barish, S. Bathe, V. Baublis, C. Baumann, S. Baumgart, A. Bazilevsky, M. Beaumier, R. Belmont, A. Berdnikov , et al. (378 additional authors not shown)

Abstract: The PHENIX experiment at the Relativistic Heavy Ion Collider measured $π^0$ and $η$ mesons at midrapidity in U$+$U collisions at $\sqrt{s_{_{NN}}}=192$ GeV in a wide transverse momentum range. Measurements were performed in the $π^0(η)\rightarrowγγ$ decay modes. A strong suppression of $π^0$ and $η$ meson production at high transverse momentum was observed in central U$+$U collisions relative to b… ▽ More The PHENIX experiment at the Relativistic Heavy Ion Collider measured $π^0$ and $η$ mesons at midrapidity in U$+$U collisions at $\sqrt{s_{_{NN}}}=192$ GeV in a wide transverse momentum range. Measurements were performed in the $π^0(η)\rightarrowγγ$ decay modes. A strong suppression of $π^0$ and $η$ meson production at high transverse momentum was observed in central U$+$U collisions relative to binary scaled $p$$+$$p$ results. Yields of $π^0$ and $η$ mesons measured in U$+$U collisions show similar suppression pattern to the ones measured in Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV for similar numbers of participant nucleons. The $η$/$π^0$ ratios do not show dependence on centrality or transverse momentum, and are consistent with previously measured values in hadron-hadron, hadron-nucleus, nucleus-nucleus, and $e^+e^-$ collisions. △ Less

Submitted 13 November, 2020; v1 submitted 29 May, 2020; originally announced May 2020.

Comments: 403 authors from 72 institutions, 13 pages, 6 figures, 7 tables, 2012 data. v2 is version accepted by Physical Review C. Plain text data tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

Journal ref: Phys. Rev. C 102, 064905 (2020)

arXiv:2004.04091 [pdf, other]

Weakly Supervised Semantic Point Cloud Segmentation:Towards 10X Fewer Labels

Authors: Xun Xu, Gim Hee Lee

Abstract: Point cloud analysis has received much attention recently; and segmentation is one of the most important tasks. The success of existing approaches is attributed to deep network design and large amount of labelled training data, where the latter is assumed to be always available. However, obtaining 3d point cloud segmentation labels is often very costly in practice. In this work, we propose a weakl… ▽ More Point cloud analysis has received much attention recently; and segmentation is one of the most important tasks. The success of existing approaches is attributed to deep network design and large amount of labelled training data, where the latter is assumed to be always available. However, obtaining 3d point cloud segmentation labels is often very costly in practice. In this work, we propose a weakly supervised point cloud segmentation approach which requires only a tiny fraction of points to be labelled in the training stage. This is made possible by learning gradient approximation and exploitation of additional spatial and color smoothness constraints. Experiments are done on three public datasets with different degrees of weak supervision. In particular, our proposed method can produce results that are close to and sometimes even better than its fully supervised counterpart with 10$\times$ fewer labels. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: CVPR2020

arXiv:2004.03327 [pdf, other]

Cascaded Refinement Network for Point Cloud Completion

Authors: Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee

Abstract: Point clouds are often sparse and incomplete. Existing shape completion methods are incapable of generating details of objects or learning the complex point distributions. To this end, we propose a cascaded refinement network together with a coarse-to-fine strategy to synthesize the detailed object shapes. Considering the local details of partial input with the global shape information together, w… ▽ More Point clouds are often sparse and incomplete. Existing shape completion methods are incapable of generating details of objects or learning the complex point distributions. To this end, we propose a cascaded refinement network together with a coarse-to-fine strategy to synthesize the detailed object shapes. Considering the local details of partial input with the global shape information together, we can preserve the existing details in the incomplete point set and generate the missing parts with high fidelity. We also design a patch discriminator that guarantees every local area has the same pattern with the ground truth to learn the complicated point distribution. Quantitative and qualitative experiments on different datasets show that our method achieves superior results compared to existing state-of-the-art approaches on the 3D point cloud completion task. Our source code is available at https://github.com/xiaogangw/cascaded-point-completion.git. △ Less

Submitted 5 June, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: CVPR2020

arXiv:2003.13479 [pdf, other]

doi 10.1109/CVPR42600.2020.01184

RPM-Net: Robust Point Matching using Learned Features

Authors: Zi Jian Yew, Gim Hee Lee

Abstract: Iterative Closest Point (ICP) solves the rigid point cloud registration problem iteratively in two steps: (1) make hard assignments of spatially closest point correspondences, and then (2) find the least-squares rigid transformation. The hard assignments of closest point correspondences based on spatial distances are sensitive to the initial rigid transformation and noisy/outlier points, which oft… ▽ More Iterative Closest Point (ICP) solves the rigid point cloud registration problem iteratively in two steps: (1) make hard assignments of spatially closest point correspondences, and then (2) find the least-squares rigid transformation. The hard assignments of closest point correspondences based on spatial distances are sensitive to the initial rigid transformation and noisy/outlier points, which often cause ICP to converge to wrong local minima. In this paper, we propose the RPM-Net -- a less sensitive to initialization and more robust deep learning-based approach for rigid point cloud registration. To this end, our network uses the differentiable Sinkhorn layer and annealing to get soft assignments of point correspondences from hybrid features learned from both spatial coordinates and local geometry. To further improve registration performance, we introduce a secondary network to predict optimal annealing parameters. Unlike some existing methods, our RPM-Net handles missing correspondences and point clouds with partial visibility. Experimental results show that our RPM-Net achieves state-of-the-art performance compared to existing non-deep learning and recent deep learning methods. Our source code is available at the project website https://github.com/yewzijian/RPMNet . △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: 10 pages, 4 figures. To appear in CVPR2020

arXiv:2003.13255 [pdf, other]

Joint Orthogonal Band and Power Allocation for Energy Fairness in WPT System with Nonlinear Logarithmic Energy Harvesting Model

Authors: Jaeseob Han, Gyeong Ho Lee, Sangdon Park, Jun Kyun Choi

Abstract: Wireless power transmission (WPT) is expected to play an important role in the Internet of Things services by providing the perpetual operation of IoT sensors. However, to prolong the IoT network's lifetime, the efficient resource allocation algorithm is required, in particular, the energy fairness issue among IoT sensors has been a critical challenge of the WPT system. In this paper, considering… ▽ More Wireless power transmission (WPT) is expected to play an important role in the Internet of Things services by providing the perpetual operation of IoT sensors. However, to prolong the IoT network's lifetime, the efficient resource allocation algorithm is required, in particular, the energy fairness issue among IoT sensors has been a critical challenge of the WPT system. In this paper, considering energy fairness as the minimum received energy of all energy poverty IoT sensors (EPISs), we allocate orthogonal frequency bands to several EPISs and transfer the RF power on each orthogonal band, using energy beamforming. Based on the energy poverty, we propose orthogonal frequency bands assignment rule, granting the priority to the EPISs with less received energy. We also formulate two transmission power allocation problems, incorporated the nonlinear logarithm-energy harvesting (EH) model. First, the total received power maximization (TRPM) problem is presented and solved by combining the well-known Karush-Kuhn-Tucker (KKT) conditions with the modified water-filling algorithm. Second, the common received power maximization (CRPM) problem is formulated and the optimal solution is derived using the iterative bisection search method. To apply the bisection search method to the problem, this paper proposes a method of specifying the scope of the solution for the objective function defined by the sum of monotonous functions. In numerical results, assuming the mobility of EPISs by the one-dimensional random walk model, the effectiveness of the mobility of EPISs on the minimum received energy of all EPISs is presented. Finally, the performance of the proposed resource allocation schemes is verified by comparing other resources allocation schemes, such as Round robin and equal power distribution △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: 12 pages, 27 figures

arXiv:2003.00188 [pdf, other]

Robust 6D Object Pose Estimation by Learning RGB-D Features

Authors: Meng Tian, Liang Pan, Marcelo H Ang Jr, Gim Hee Lee

Abstract: Accurate 6D object pose estimation is fundamental to robotic manipulation and gras**. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete-continuous formulation for rotation regression to resolve this local-optimum problem. We uniformly sampl… ▽ More Accurate 6D object pose estimation is fundamental to robotic manipulation and gras**. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete-continuous formulation for rotation regression to resolve this local-optimum problem. We uniformly sample rotation anchors in SO(3), and predict a constrained deviation from each anchor to the target, as well as uncertainty scores for selecting the best prediction. Additionally, the object location is detected by aggregating point-wise vectors pointing to the 3D center. Experiments on two benchmarks: LINEMOD and YCB-Video, show that the proposed method outperforms state-of-the-art approaches. Our code is available at https://github.com/mentian/object-posenet. △ Less

Submitted 9 March, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: Accepted at ICRA 2020

arXiv:1912.11803 [pdf, other]

SESS: Self-Ensembling Semi-Supervised 3D Object Detection

Authors: Na Zhao, Tat-Seng Chua, Gim Hee Lee

Abstract: The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. Inspired by the recent success of self-ensembling t… ▽ More The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. Inspired by the recent success of self-ensembling technique in semi-supervised image classification task, we propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data. Furthermore, we propose three consistency losses to enforce the consistency between two sets of predicted 3D object proposals, to facilitate the learning of structure and semantic invariances of objects. Extensive experiments conducted on SUN RGB-D and ScanNet datasets demonstrate the effectiveness of SESS in both inductive and transductive semi-supervised 3D object detection. Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50% labeled data. Our code is available at https://github.com/Na-Z/sess. △ Less

Submitted 17 March, 2021; v1 submitted 26 December, 2019; originally announced December 2019.

Comments: CVPR 2020 Oral

arXiv:1909.12555 [pdf, other]

Identifying through Flows for Recovering Latent Representations

Authors: Shen Li, Bryan Hooi, Gim Hee Lee

Abstract: Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable genera… ▽ More Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods. △ Less

Submitted 26 April, 2020; v1 submitted 27 September, 2019; originally announced September 2019.

arXiv:1908.08289 [pdf, other]

Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Authors: Jiahao Lin, Gim Hee Lee

Abstract: Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitiv… ▽ More Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing state-of-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input. In this paper, we propose a deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation. Our approach processes all input frames concurrently to avoid the sensitivity and drift problems, and yet outputs the 3d pose estimates for every frame in the input sequence. More specifically, the 3d poses in all frames are represented as a motion matrix factorized into a trajectory bases matrix and a trajectory coefficient matrix. The trajectory bases matrix is precomputed from matrix factorization approaches such as Singular Value Decomposition (SVD) or Discrete Cosine Transform (DCT), and the problem of sequential 3d pose estimation is reduced to training a deep network to regress the trajectory coefficient matrix. We demonstrate the effectiveness of our framework on long sequences by achieving state-of-the-art performances on multiple benchmark datasets. Our source code is available at: https://github.com/jiahaoLjh/trajectory-pose-3d. △ Less

Submitted 22 August, 2019; originally announced August 2019.

Comments: 13 pages, 5 figures. Accepted in BMVC 2019

arXiv:1908.05425 [pdf, other]

doi 10.1007/978-3-030-68787-8

PS^2-Net: A Locally and Globally Aware Network for Point-Based Semantic Segmentation

Authors: Na Zhao, Tat-Seng Chua, Gim Hee Lee

Abstract: In this paper, we present the PS^2-Net -- a locally and globally aware deep learning framework for semantic segmentation on 3D scene-level point clouds. In order to deeply incorporate local structures and global context to support 3D scene segmentation, our network is built on four repeatedly stacked encoders, where each encoder has two basic components: EdgeConv that captures local structures and… ▽ More In this paper, we present the PS^2-Net -- a locally and globally aware deep learning framework for semantic segmentation on 3D scene-level point clouds. In order to deeply incorporate local structures and global context to support 3D scene segmentation, our network is built on four repeatedly stacked encoders, where each encoder has two basic components: EdgeConv that captures local structures and NetVLAD that models global context. Different from existing start-of-the-art methods for point-based scene semantic segmentation that either violate or do not achieve permutation invariance, our PS^2-Net is designed to be permutation invariant which is an essential property of any deep network used to process unordered point clouds. We further provide theoretical proof to guarantee the permutation invariance property of our network. We perform extensive experiments on two large-scale 3D indoor scene datasets and demonstrate that our PS2-Net is able to achieve state-of-the-art performances as compared to existing approaches. △ Less

Submitted 15 August, 2019; originally announced August 2019.

arXiv:1907.13185 [pdf, other]

Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM

Authors: Bingbing Zhuang, Quoc-Huy Tran, Pan Ji, Gim Hee Lee, Loong Fah Cheong, Manmohan Chandraker

Abstract: Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Map** (SLAM) systems, especially in driving scenarios. In this paper, we revisit the geometric approach to this problem, and provide a theoretical proof that explic… ▽ More Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Map** (SLAM) systems, especially in driving scenarios. In this paper, we revisit the geometric approach to this problem, and provide a theoretical proof that explicitly shows the ambiguity between radial distortion and scene depth when two-view geometry is used to self-calibrate the radial distortion. In view of such geometric degeneracy, we propose a learning approach that trains a convolutional neural network (CNN) on a large amount of synthetic data. We demonstrate the utility of our proposed method by applying it as a checkerboard-free calibration tool for SLAM, achieving comparable or superior performance to previous learning and hand-crafted methods. △ Less

Submitted 30 July, 2019; originally announced July 2019.

Comments: To appear at IROS 2019

arXiv:1907.09798 [pdf, other]

PointAtrousGraph: Deep Hierarchical Encoder-Decoder with Point Atrous Convolution for Unorganized 3D Points

Authors: Liang Pan, Chee-Meng Chew, Gim Hee Lee

Abstract: Motivated by the success of encoding multi-scale contextual information for image analysis, we propose our PointAtrousGraph (PAG) - a deep permutation-invariant hierarchical encoder-decoder for efficiently exploiting multi-scale edge features in point clouds. Our PAG is constructed by several novel modules, such as Point Atrous Convolution (PAC), Edge-preserved Pooling (EP) and Edge-preserved Unpo… ▽ More Motivated by the success of encoding multi-scale contextual information for image analysis, we propose our PointAtrousGraph (PAG) - a deep permutation-invariant hierarchical encoder-decoder for efficiently exploiting multi-scale edge features in point clouds. Our PAG is constructed by several novel modules, such as Point Atrous Convolution (PAC), Edge-preserved Pooling (EP) and Edge-preserved Unpooling (EU). Similar with atrous convolution, our PAC can effectively enlarge receptive fields of filters and thus densely learn multi-scale point features. Following the idea of non-overlap** max-pooling operations, we propose our EP to preserve critical edge features during subsampling. Correspondingly, our EU modules gradually recover spatial information for edge features. In addition, we introduce chained skip subsampling/upsampling modules that directly propagate edge features to the final stage. Particularly, our proposed auxiliary loss functions can further improve our performance. Experimental results show that our PAG outperform previous state-of-the-art methods on various 3D semantic perception applications. △ Less

Submitted 13 September, 2019; v1 submitted 23 July, 2019; originally announced July 2019.

Comments: 11 pages, 10 figures

arXiv:1905.09634 [pdf, other]

Robust Point Cloud Based Reconstruction of Large-Scale Outdoor Scenes

Authors: Ziquan Lan, Zi Jian Yew, Gim Hee Lee

Abstract: Outlier feature matches and loop-closures that survived front-end data association can lead to catastrophic failures in the back-end optimization of large-scale point cloud based 3D reconstruction. To alleviate this problem, we propose a probabilistic approach for robust back-end optimization in the presence of outliers. More specifically, we model the problem as a Bayesian network and solve it us… ▽ More Outlier feature matches and loop-closures that survived front-end data association can lead to catastrophic failures in the back-end optimization of large-scale point cloud based 3D reconstruction. To alleviate this problem, we propose a probabilistic approach for robust back-end optimization in the presence of outliers. More specifically, we model the problem as a Bayesian network and solve it using the Expectation-Maximization algorithm. Our approach leverages on a long-tail Cauchy distribution to suppress outlier feature matches in the odometry constraints, and a Cauchy-Uniform mixture model with a set of binary latent variables to simultaneously suppress outlier loop-closure constraints and outlier feature matches in the inlier loop-closure constraints. Furthermore, we show that by using a Gaussian-Uniform mixture model, our approach degenerates to the formulation of a state-of-the-art approach for robust indoor reconstruction. Experimental results demonstrate that our approach has comparable performance with the state-of-the-art on a benchmark indoor dataset, and outperforms it on a large-scale outdoor dataset. Our source code can be found on the project website. △ Less

Submitted 23 May, 2019; originally announced May 2019.

Comments: CVPR 2019, 8 pages, 5 figures

arXiv:1904.10300 [pdf, other]

Transferable Semi-supervised 3D Object Detection from RGB-D Data

Authors: Yew Siang Tang, Gim Hee Lee

Abstract: We investigate the direction of training a 3D object detector for new object classes from only 2D bounding box labels of these new classes, while simultaneously transferring information from 3D bounding box labels of the existing classes. To this end, we propose a transferable semi-supervised 3D object detection model that learns a 3D object detector network from training data with two disjoint se… ▽ More We investigate the direction of training a 3D object detector for new object classes from only 2D bounding box labels of these new classes, while simultaneously transferring information from 3D bounding box labels of the existing classes. To this end, we propose a transferable semi-supervised 3D object detection model that learns a 3D object detector network from training data with two disjoint sets of object classes - a set of strong classes with both 2D and 3D box labels, and another set of weak classes with only 2D box labels. In particular, we suggest a relaxed reprojection loss, box prior loss and a Box-to-Point Cloud Fit network that allow us to effectively transfer useful 3D information from the strong classes to the weak classes during training, and consequently, enable the network to detect 3D objects in the weak classes during inference. Experimental results show that our proposed algorithm outperforms baseline approaches and achieves promising results compared to fully-supervised approaches on the SUN-RGBD and KITTI datasets. Furthermore, we show that our Box-to-Point Cloud Fit network improves performances of the fully-supervised approaches on both datasets. △ Less

Submitted 23 April, 2019; originally announced April 2019.

arXiv:1904.09742 [pdf, other]

2D3D-MatchNet: Learning to Match Keypoints Across 2D Image and 3D Point Cloud

Authors: Mengdan Feng, Sixing Hu, Marcelo Ang, Gim Hee Lee

Abstract: Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet - an end-to-end deep network architecture to jointly learn the descriptors for 2D and 3D keypoint from image and point clou… ▽ More Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet - an end-to-end deep network architecture to jointly learn the descriptors for 2D and 3D keypoint from image and point cloud, respectively. As a result, we are able to directly match and establish 2D-3D correspondences from the query image and 3D point cloud reference map for visual pose estimation. We create our Oxford 2D-3D Patches dataset from the Oxford Robotcar dataset with the ground truth camera poses and 2D-3D image to point cloud correspondences for training and testing the deep network. Experimental results verify the feasibility of our approach. △ Less

Submitted 22 April, 2019; originally announced April 2019.

arXiv:1904.05547 [pdf, other]

Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network

Authors: Chen Li, Gim Hee Lee

Abstract: 3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints.In contrast to existing… ▽ More 3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints.In contrast to existing deep learning approaches which minimize a mean square error based on an unimodal Gaussian distribution, our method is able to generate multiple feasible hypotheses of 3D pose based on a multimodal mixture density networks. Our experiments show that the 3D poses estimated by our approach from an input of 2D joints are consistent in 2D reprojections, which supports our argument that multiple solutions exist for the 2D-to-3D inverse problem. Furthermore, we show state-of-the-art performance on the Human3.6M dataset in both best hypothesis and multi-view settings, and we demonstrate the generalization capacity of our model by testing on the MPII and MPI-INF-3DHP datasets. Our code is available at the project website. △ Less

Submitted 11 April, 2019; originally announced April 2019.

Comments: CVPR 2019

arXiv:1904.00319 [pdf, other]

Discrete Rotation Equivariance for Point Cloud Recognition

Authors: Jiaxin Li, Yingcai Bi, Gim Hee Lee

Abstract: Despite the recent active research on processing point clouds with deep networks, few attention has been on the sensitivity of the networks to rotations. In this paper, we propose a deep learning architecture that achieves discrete $\mathbf{SO}(2)$/$\mathbf{SO}(3)$ rotation equivariance for point cloud recognition. Specifically, the rotation of an input point cloud with elements of a rotation grou… ▽ More Despite the recent active research on processing point clouds with deep networks, few attention has been on the sensitivity of the networks to rotations. In this paper, we propose a deep learning architecture that achieves discrete $\mathbf{SO}(2)$/$\mathbf{SO}(3)$ rotation equivariance for point cloud recognition. Specifically, the rotation of an input point cloud with elements of a rotation group is similar to shuffling the feature vectors generated by our approach. The equivariance is easily reduced to invariance by eliminating the permutation with operations such as maximum or average. Our method can be directly applied to any existing point cloud based networks, resulting in significant improvements in their performance for rotated inputs. We show state-of-the-art results in the classification tasks with various datasets under both $\mathbf{SO}(2)$ and $\mathbf{SO}(3)$ rotations. In addition, we further analyze the necessary conditions of applying our approach to PointNet based networks. Source codes at https://github.com/lijx10/rot-equ-net △ Less

Submitted 30 March, 2019; originally announced April 2019.

Comments: The 2019 International Conference on Robotics and Automation (ICRA)

arXiv:1904.00229 [pdf, other]

USIP: Unsupervised Stable Interest Point Detection from 3D Point Clouds

Authors: Jiaxin Li, Gim Hee Lee

Abstract: In this paper, we propose the USIP detector: an Unsupervised Stable Interest Point detector that can detect highly repeatable and accurately localized keypoints from 3D point clouds under arbitrary transformations without the need for any ground truth training data. Our USIP detector consists of a feature proposal network that learns stable keypoints from input 3D point clouds and their respective… ▽ More In this paper, we propose the USIP detector: an Unsupervised Stable Interest Point detector that can detect highly repeatable and accurately localized keypoints from 3D point clouds under arbitrary transformations without the need for any ground truth training data. Our USIP detector consists of a feature proposal network that learns stable keypoints from input 3D point clouds and their respective transformed pairs from randomly generated transformations. We provide degeneracy analysis of our USIP detector and suggest solutions to prevent it. We encourage high repeatability and accurate localization of the keypoints with a probabilistic chamfer loss that minimizes the distances between the detected keypoints from the training point cloud pairs. Extensive experimental results of repeatability tests on several simulated and real-world 3D point cloud datasets from Lidar, RGB-D and CAD models show that our USIP detector significantly outperforms existing hand-crafted and deep learning-based 3D keypoint detectors. Our code is available at the project website. https://github.com/lijx10/USIP △ Less

Submitted 30 March, 2019; originally announced April 2019.

Comments: 19 pages

Showing 51–100 of 137 results for author: Lee, G H