Search | arXiv e-print repository

Machine Learning-Guided Design of Non-Reciprocal and Asymmetric Elastic Chiral Metamaterials

Authors: Lingxiao Yuan, Emma Lejeune, Harold S. Park

Abstract: There has been significant recent interest in the mechanics community to design structures that can either violate reciprocity, or exhibit elastic asymmetry or odd elasticity. While these properties are highly desirable to enable mechanical metamaterials to exhibit novel wave propagation phenomena, it remains an open question as to how to design passive structures that exhibit both significant non… ▽ More There has been significant recent interest in the mechanics community to design structures that can either violate reciprocity, or exhibit elastic asymmetry or odd elasticity. While these properties are highly desirable to enable mechanical metamaterials to exhibit novel wave propagation phenomena, it remains an open question as to how to design passive structures that exhibit both significant non-reciprocity and elastic asymmetry. In this paper, we first define several design spaces for chiral metamaterials leveraging specific design parameters, including the ligament contact angles, the ligament shape, and circle radius. Having defined the design spaces, we then leverage machine learning approaches, and specifically Bayesian optimization, to determine optimally performing designs within each design space satisfying maximal non-reciprocity or stiffness asymmetry. Finally, we perform multi-objective optimization by determining the Pareto optimum and find chiral metamaterials that simultaneously exhibit high non-reciprocity and stiffness asymmetry. Our analysis of the underlying mechanisms reveals that chiral metamaterials that can display multiple different contact states under loading in different directions are able to simultaneously exhibit both high non-reciprocity and stiffness asymmetry. Overall, this work demonstrates the effectiveness of employing ML to bring insights to a novel domain with limited prior information, and more generally will pave the way for metamaterials with unique properties and functionality in directing and guiding mechanical wave energy. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2402.11909 [pdf, other]

One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

Authors: Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, Yinda Zhang

Abstract: Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from… ▽ More Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and leverage it as a prior for creating personalized avatar from few-shot images. Different from previous 3D-aware face generative models, our prior is built with a 3DMM-anchored neural radiance field backbone, which we show to be more effective for avatar creation through auto-decoding based on few-shot inputs. We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and camera calibration that leads to better few-shot adaptation. Our method demonstrates compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation. △ Less

Submitted 19 February, 2024; originally announced February 2024.

arXiv:2311.18259 [pdf, other]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/ △ Less

Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

arXiv:2311.05828 [pdf, other]

Diffusion Shape Prior for Wrinkle-Accurate Cloth Registration

Authors: **gfan Guo, Fabian Prada, Donglai Xiang, Javier Romero, Chenglei Wu, Hyun Soo Park, Takaaki Shiratori, Shunsuke Saito

Abstract: Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation from real-world data. However, previous methods either rely on texture information, which is not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registrati… ▽ More Registering clothes from 4D scans with vertex-accurate correspondence is challenging, yet important for dynamic appearance modeling and physics parameter estimation from real-world data. However, previous methods either rely on texture information, which is not always reliable, or achieve only coarse-level alignment. In this work, we present a novel approach to enabling accurate surface registration of texture-less clothes with large deformation. Our key idea is to effectively leverage a shape prior learned from pre-captured clothing using diffusion models. We also propose a multi-stage guidance scheme based on learned functional maps, which stabilizes registration for large-scale deformation even when they vary significantly from training data. Using high-fidelity real captured clothes, our experiments show that the proposed approach based on diffusion models generalizes better than surface registration with VAE or PCA-based priors, outperforming both optimization-based and learning-based non-rigid registration methods for both interpolation and extrapolation tests. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Project page: https://www-users.cse.umn.edu/~guo00109/projects/3dv2024/

arXiv:2307.14579 [pdf, other]

Neural Representation-Based Method for Metal-induced Artifact Reduction in Dental CBCT Imaging

Authors: Hyoung Suk Park, Kiwan Jeon, ** Keun Seo

Abstract: This study introduces a novel reconstruction method for dental cone-beam computed tomography (CBCT), focusing on effectively reducing metal-induced artifacts commonly encountered in the presence of prevalent metallic implants. Despite significant progress in metal artifact reduction techniques, challenges persist owing to the intricate physical interactions between polychromatic X-ray beams and me… ▽ More This study introduces a novel reconstruction method for dental cone-beam computed tomography (CBCT), focusing on effectively reducing metal-induced artifacts commonly encountered in the presence of prevalent metallic implants. Despite significant progress in metal artifact reduction techniques, challenges persist owing to the intricate physical interactions between polychromatic X-ray beams and metal objects, which are further compounded by the additional effects associated with metal-tooth interactions and factors specific to the dental CBCT data environment. To overcome these limitations, we propose an implicit neural network that generates two distinct and informative tomographic images. One image represents the monochromatic attenuation distribution at a specific energy level, whereas the other captures the nonlinear beam-hardening factor resulting from the polychromatic nature of X-ray beams. In contrast to existing CT reconstruction techniques, the proposed method relies exclusively on the Beer--Lambert law, effectively preventing the generation of metal-induced artifacts during the backprojection process commonly implemented in conventional methods. Extensive experimental evaluations demonstrate that the proposed method effectively reduces metal artifacts while providing high-quality image reconstructions, thus emphasizing the significance of the second image in capturing the nonlinear beam-hardening factor. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 10 pages, 5 figures

arXiv:2305.10132 [pdf, other]

Automatic 3D Registration of Dental CBCT and Face Scan Data using 2D Projection Images

Authors: Hyoung Suk Park, Chang Min Hyun, Sang-Hwy Lee, ** Keun Seo, Kiwan Jeon

Abstract: This paper presents a fully automatic registration method of dental cone-beam computed tomography (CBCT) and face scan data. It can be used for a digital platform of 3D jaw-teeth-face models in a variety of applications, including 3D digital treatment planning and orthognathic surgery. Difficulties in accurately merging facial scans and CBCT images are due to the different image acquisition method… ▽ More This paper presents a fully automatic registration method of dental cone-beam computed tomography (CBCT) and face scan data. It can be used for a digital platform of 3D jaw-teeth-face models in a variety of applications, including 3D digital treatment planning and orthognathic surgery. Difficulties in accurately merging facial scans and CBCT images are due to the different image acquisition methods and limited area of correspondence between the two facial surfaces. In addition, it is difficult to use machine learning techniques because they use face-related 3D medical data with radiation exposure, which are difficult to obtain for training. The proposed method addresses these problems by reusing an existing machine-learning-based 2D landmark detection algorithm in an open-source library and develo** a novel mathematical algorithm that identifies paired 3D landmarks from knowledge of the corresponding 2D landmarks. A main contribution of this study is that the proposed method does not require annotated training data of facial landmarks because it uses a pre-trained facial landmark detection algorithm that is known to be robust and generalized to various 2D face image models. Note that this reduces a 3D landmark detection problem to a 2D problem of identifying the corresponding landmarks on two 2D projection images generated from two different projection angles. Here, the 3D landmarks for registration were selected from the sub-surfaces with the least geometric change under the CBCT and face scan environments. For the final fine-tuning of the registration, the Iterative Closest Point method was applied, which utilizes geometrical information around the 3D landmarks. The experimental results show that the proposed method achieved an averaged surface distance error of 0.74 mm for three pairs of CBCT and face scan datasets. △ Less

Submitted 26 July, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: 8 pages, 6 figures, 2 tables

MSC Class: 92C55; 15A04; 62F10

arXiv:2305.09986 [pdf, other]

A robust multi-domain network for short-scanning amyloid PET reconstruction

Authors: Hyoung Suk Park, Young ** Jeong, Kiwan Jeon

Abstract: This paper presents a robust multi-domain network designed to restore low-quality amyloid PET images acquired in a short period of time. The proposed method is trained on pairs of PET images from short (2 minutes) and standard (20 minutes) scanning times, sourced from multiple domains. Learning relevant image features between these domains with a single network is challenging. Our key contribution… ▽ More This paper presents a robust multi-domain network designed to restore low-quality amyloid PET images acquired in a short period of time. The proposed method is trained on pairs of PET images from short (2 minutes) and standard (20 minutes) scanning times, sourced from multiple domains. Learning relevant image features between these domains with a single network is challenging. Our key contribution is the introduction of a map** label, which enables effective learning of specific representations between different domains. The network, trained with various map** labels, can efficiently correct amyloid PET datasets in multiple training domains and unseen domains, such as those obtained with new radiotracers, acquisition protocols, or PET scanners. Internal, temporal, and external validations demonstrate the effectiveness of the proposed method. Notably, for external validation datasets from unseen domains, the proposed method achieved comparable or superior results relative to methods trained with these datasets, in terms of quantitative metrics such as normalized root mean-square error and structure similarity index measure. Two nuclear medicine physicians evaluated the amyloid status as positive or negative for the external validation datasets, with accuracies of 0.970 and 0.930 for readers 1 and 2, respectively. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: 21 pages, 7 figures, 3 tables

MSC Class: 92C55; 68T05; 15A29; 65F22

arXiv:2303.06504 [pdf, other]

Normal-guided Garment UV Prediction for Human Re-texturing

Authors: Yasamin Jafarian, Tuanfeng Y. Wang, Duygu Ceylan, Jimei Yang, Nathan Carr, Yi Zhou, Hyun Soo Park

Abstract: Clothes undergo complex geometric deformations, which lead to appearance changes. To edit human videos in a physically plausible way, a texture map must take into account not only the garment transformation induced by the body movements and clothes fitting, but also its 3D fine-grained surface geometry. This poses, however, a new challenge of 3D reconstruction of dynamic clothes from an image or a… ▽ More Clothes undergo complex geometric deformations, which lead to appearance changes. To edit human videos in a physically plausible way, a texture map must take into account not only the garment transformation induced by the body movements and clothes fitting, but also its 3D fine-grained surface geometry. This poses, however, a new challenge of 3D reconstruction of dynamic clothes from an image or a video. In this paper, we show that it is possible to edit dressed human images and videos without 3D reconstruction. We estimate a geometry aware texture map between the garment region in an image and the texture space, a.k.a, UV map. Our UV map is designed to preserve isometry with respect to the underlying 3D surface by making use of the 3D surface normals predicted from the image. Our approach captures the underlying geometry of the garment in a self-supervised way, requiring no ground truth annotation of UV maps and can be readily extended to predict temporally coherent UV maps. We demonstrate that our method outperforms the state-of-the-art human UV map estimation approaches on both real and synthetic data. △ Less

Submitted 11 March, 2023; originally announced March 2023.

arXiv:2303.01678 [pdf, other]

Nonlinear ill-posed problem in low-dose dental cone-beam computed tomography

Authors: Hyoung Suk Park, Chang Min Hyun, ** Keun Seo

Abstract: This paper describes the mathematical structure of the ill-posed nonlinear inverse problem of low-dose dental cone-beam computed tomography (CBCT) and explains the advantages of a deep learning-based approach to the reconstruction of computed tomography images over conventional regularization methods. This paper explains the underlying reasons why dental CBCT is more ill-posed than standard comput… ▽ More This paper describes the mathematical structure of the ill-posed nonlinear inverse problem of low-dose dental cone-beam computed tomography (CBCT) and explains the advantages of a deep learning-based approach to the reconstruction of computed tomography images over conventional regularization methods. This paper explains the underlying reasons why dental CBCT is more ill-posed than standard computed tomography. Despite this severe ill-posedness, the demand for dental CBCT systems is rapidly growing because of their cost competitiveness and low radiation dose. We then describe the limitations of existing methods in the accurate restoration of the morphological structures of teeth using dental CBCT data severely damaged by metal implants. We further discuss the usefulness of panoramic images generated from CBCT data for accurate tooth segmentation. We also discuss the possibility of utilizing radiation-free intra-oral scan data as prior information in CBCT image reconstruction to compensate for the damage to data caused by metal implants. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2209.05432 [pdf, other]

Self-supervised Wide Baseline Visual Servoing via 3D Equivariance

Authors: **wook Huh, Jungseok Hong, Suveer Garg, Hyun Soo Park, Volkan Isler

Abstract: One of the challenging input settings for visual servoing is when the initial and goal camera views are far apart. Such settings are difficult because the wide baseline can cause drastic changes in object appearance and cause occlusions. This paper presents a novel self-supervised visual servoing method for wide baseline images which does not require 3D ground truth supervision. Existing approache… ▽ More One of the challenging input settings for visual servoing is when the initial and goal camera views are far apart. Such settings are difficult because the wide baseline can cause drastic changes in object appearance and cause occlusions. This paper presents a novel self-supervised visual servoing method for wide baseline images which does not require 3D ground truth supervision. Existing approaches that regress absolute camera pose with respect to an object require 3D ground truth data of the object in the forms of 3D bounding boxes or meshes. We learn a coherent visual representation by leveraging a geometric property called 3D equivariance-the representation is transformed in a predictable way as a function of 3D transformation. To ensure that the feature-space is faithful to the underlying geodesic space, a geodesic preserving constraint is applied in conjunction with the equivariance. We design a Siamese network that can effectively enforce these two geometric properties without requiring 3D supervision. With the learned model, the relative transformation can be inferred simply by following the gradient in the learned space and used as feedback for closed-loop visual servoing. Our method is evaluated on objects from the YCB dataset, showing meaningful outperformance on a visual servoing task, or object alignment task with respect to state-of-the-art approaches that use 3D supervision. Ours yields more than 35% average distance error reduction and more than 90% success rate with 3cm error tolerance. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2207.07077 [pdf, other]

Egocentric Scene Understanding via Multimodal Spatial Rectifier

Authors: Tien Do, Khiem Vuong, Hyun Soo Park

Abstract: In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects in… ▽ More In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our EDINA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) and EPIC-KITCHENS. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: Appearing in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

arXiv:2206.14917 [pdf, other]

doi 10.1016/j.cma.2022.115569

Towards out of distribution generalization for problems in mechanics

Authors: Lingxiao Yuan, Harold S. Park, Emma Lejeune

Abstract: There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applie… ▽ More There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applied to real world mechanics problems with unknown test environments and data distribution shifts. In contrast, out-of-distribution (OOD) generalization assumes that the test data may shift (i.e., violate the i.i.d. assumption). To date, multiple methods have been proposed to improve the OOD generalization of ML methods. However, because of the lack of benchmark datasets for OOD regression problems, the efficiency of these OOD methods on regression problems, which dominate the mechanics field, remains unknown. To address this, we investigate the performance of OOD generalization methods for regression problems in mechanics. Specifically, we identify three OOD problems: covariate shift, mechanism shift, and sampling bias. For each problem, we create two benchmark examples that extend the Mechanical MNIST dataset collection, and we investigate the performance of popular OOD generalization methods on these mechanics-specific regression problems. Our numerical experiments show that in most cases, while the OOD generalization algorithms perform better compared to traditional ML methods on these OOD problems, there is a compelling need to develop more robust OOD generalization methods that are effective across multiple OOD scenarios. Overall, we expect that this study, as well as the associated open access benchmark datasets, will enable further development of OOD generalization methods for mechanics specific regression problems. △ Less

Submitted 13 August, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

Journal ref: Comput. Methods Appl. Mech. Engrg. 400 (2022) 115569

arXiv:2203.12780 [pdf, other]

Learning Motion-Dependent Appearance for High-Fidelity Rendering of Dynamic Humans from a Single Camera

Authors: Jae Shin Yoon, Duygu Ceylan, Tuanfeng Y. Wang, **gwan Lu, Jimei Yang, Zhixin Shu, Hyun Soo Park

Abstract: Appearance of dressed humans undergoes a complex geometric transformation induced not only by the static pose but also by its dynamics, i.e., there exists a number of cloth geometric configurations given a pose depending on the way it has moved. Such appearance modeling conditioned on motion has been largely neglected in existing human rendering methods, resulting in rendering of physically implau… ▽ More Appearance of dressed humans undergoes a complex geometric transformation induced not only by the static pose but also by its dynamics, i.e., there exists a number of cloth geometric configurations given a pose depending on the way it has moved. Such appearance modeling conditioned on motion has been largely neglected in existing human rendering methods, resulting in rendering of physically implausible motion. A key challenge of learning the dynamics of the appearance lies in the requirement of a prohibitively large amount of observations. In this paper, we present a compact motion representation by enforcing equivariance -- a representation is expected to be transformed in the way that the pose is transformed. We model an equivariant encoder that can generate the generalizable representation from the spatial and temporal derivatives of the 3D body surface. This learned representation is decoded by a compositional multi-task decoder that renders high fidelity time-varying appearance. Our experiments show that our method can generate a temporally coherent video of dynamic humans for unseen body poses and novel views given a single view video. △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: CVPR accepted. 15 pages. 17 figures, 5 tables

Journal ref: IEEE Computer Vision and Pattern Recognition (CVPR) 2022

arXiv:2202.03571 [pdf, other]

Metal Artifact Reduction with Intra-Oral Scan Data for 3D Low Dose Maxillofacial CBCT Modeling

Authors: Chang Min Hyun, Taigyntuya Bayaraa, Hye Sun Yun, Tae Jun Jang, Hyoung Suk Park, ** Keun Seo

Abstract: Low-dose dental cone beam computed tomography (CBCT) has been increasingly used for maxillofacial modeling. However, the presence of metallic inserts, such as implants, crowns, and dental filling, causes severe streaking and shading artifacts in a CBCT image and loss of the morphological structures of the teeth, which consequently prevents accurate segmentation of bones. A two-stage metal artifact… ▽ More Low-dose dental cone beam computed tomography (CBCT) has been increasingly used for maxillofacial modeling. However, the presence of metallic inserts, such as implants, crowns, and dental filling, causes severe streaking and shading artifacts in a CBCT image and loss of the morphological structures of the teeth, which consequently prevents accurate segmentation of bones. A two-stage metal artifact reduction method is proposed for accurate 3D low-dose maxillofacial CBCT modeling, where a key idea is to utilize explicit tooth shape prior information from intra-oral scan data whose acquisition does not require any extra radiation exposure. In the first stage, an image-to-image deep learning network is employed to mitigate metal-related artifacts. To improve the learning ability, the proposed network is designed to take advantage of the intra-oral scan data as side-inputs and perform multi-task learning of auxiliary tooth segmentation. In the second stage, a 3D maxillofacial model is constructed by segmenting the bones from the dental CBCT image corrected in the first stage. For accurate bone segmentation, weighted thresholding is applied, wherein the weighting region is determined depending on the geometry of the intra-oral scan data. Because acquiring a paired training dataset of metal-artifact-free and metal artifact-affected dental CBCT images is challenging in clinical practice, an automatic method of generating a realistic dataset according to the CBCT physics model is introduced. Numerical simulations and clinical experiments show the feasibility of the proposed method, which takes advantage of tooth surface information from intra-oral scan data in 3D low dose maxillofacial CBCT modeling. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2112.00216 [pdf, other]

PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound

Authors: Zhijian Yang, Xiaoran Fan, Volkan Isler, Hyun Soo Park

Abstract: Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are ma… ▽ More Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are many applications such as virtual telepresence, robotics, and augmented reality that require metric scale reconstruction. In this paper, we show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person. The key insight is that as the audio signals traverse across the 3D space, their interactions with the body provide metric information about the body's pose. Based on this insight, we introduce a time-invariant transfer function called pose kernel -- the impulse response of audio signals induced by the body pose. The main properties of the pose kernel are that (1) its envelope highly correlates with 3D pose, (2) the time response corresponds to arrival time, indicating the metric distance to the microphone, and (3) it is invariant to changes in the scene geometry configurations. Therefore, it is readily generalizable to unseen scenes. We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale. We show that our multi-modal method produces accurate metric reconstruction in real world scenes, which is not possible with state-of-the-art lifting approaches including parametric mesh regression and depth regression. △ Less

Submitted 2 December, 2021; v1 submitted 30 November, 2021; originally announced December 2021.

arXiv:2110.07058 [pdf, other]

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/ △ Less

Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

arXiv:2110.00543 [pdf, other]

Self-supervised Secondary Landmark Detection via 3D Representation Learning

Authors: Praneet C. Bala, Jan Zimmermann, Hyun Soo Park, Benjamin Y. Hayden

Abstract: Recent technological developments have spurred great advances in the computerized tracking of joints and other landmarks in moving animals, including humans. Such tracking promises important advances in biology and biomedicine. Modern tracking models depend critically on labor-intensive annotated datasets of primary landmarks by non-expert humans. However, such annotation approaches can be costly… ▽ More Recent technological developments have spurred great advances in the computerized tracking of joints and other landmarks in moving animals, including humans. Such tracking promises important advances in biology and biomedicine. Modern tracking models depend critically on labor-intensive annotated datasets of primary landmarks by non-expert humans. However, such annotation approaches can be costly and impractical for secondary landmarks, that is, ones that reflect fine-grained geometry of animals, and that are often specific to customized behavioral tasks. Due to visual and geometric ambiguity, nonexperts are often not qualified for secondary landmark annotation, which can require anatomical and zoological knowledge. These barriers significantly impede downstream behavioral studies because the learned tracking models exhibit limited generalizability. We hypothesize that there exists a shared representation between the primary and secondary landmarks because the range of motion of the secondary landmarks can be approximately spanned by that of the primary landmarks. We present a method to learn this spatial relationship of the primary and secondary landmarks in three dimensional space, which can, in turn, self-supervise the secondary landmark detector. This 3D representation learning is generic, and can therefore be applied to various multiview settings across diverse organisms, including macaques, flies, and humans. △ Less

Submitted 1 October, 2021; originally announced October 2021.

arXiv:2110.00119 [pdf, other]

HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark Challenge

Authors: Jae Shin Yoon, Zhixuan Yu, Jaesik Park, Hyun Soo Park

Abstract: This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of five primary body signals including gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and… ▽ More This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of five primary body signals including gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style. With the multiview image streams, we reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. Based on HUMBI, we formulate a new benchmark challenge of a pose-guided appearance rendering task that aims to substantially extend photorealism in modeling diverse human expressions in 3D, which is the key enabling factor of authentic social tele-presence. HUMBI is publicly available at http://humbi-data.net △ Less

Submitted 20 December, 2021; v1 submitted 30 September, 2021; originally announced October 2021.

Comments: 18 pages; Accepted to TPAMI

arXiv:2109.11248 [pdf]

Encryption Device Based on Wave-Chaos for Enhanced Physical Security of Wireless Wave Transmission

Authors: Hong Soo Park, Sun K. Hong

Abstract: We introduce an encryption device based on wave-chaos to enhance the physical security of wireless wave transmission. The proposed encryption device is composed of a compact quasi-2D disordered cavity, where transmit signals pass through to be distorted in time before transmission. On the receiving end, the signals can only be decrypted when they pass through an identical cavity. In the absence of… ▽ More We introduce an encryption device based on wave-chaos to enhance the physical security of wireless wave transmission. The proposed encryption device is composed of a compact quasi-2D disordered cavity, where transmit signals pass through to be distorted in time before transmission. On the receiving end, the signals can only be decrypted when they pass through an identical cavity. In the absence of a proper decryption device, the signals cannot be properly decrypted. If a cavity with a different shape is used on the receiving end, vastly different wave dynamics will prevent the signals from being decrypted, causing them to appear as noise. We experimentally demonstrate the proposed concept in an apparatus representing a wireless link. △ Less

Submitted 23 September, 2021; originally announced September 2021.

arXiv:2109.09299 [pdf, other]

Semi-supervised Dense Keypoints Using Unlabeled Multiview Images

Authors: Zhixuan Yu, Haozheng Yu, Long Sha, Sujoy Ganguly, Hyun Soo Park

Abstract: This paper presents a new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images. A key challenge lies in finding the exact correspondences between the dense keypoints in multiple views since the inverse of the keypoint map** can be neither analytically derived nor differentiated. This limits applying existing multiview supervision approaches use… ▽ More This paper presents a new end-to-end semi-supervised framework to learn a dense keypoint detector using unlabeled multiview images. A key challenge lies in finding the exact correspondences between the dense keypoints in multiple views since the inverse of the keypoint map** can be neither analytically derived nor differentiated. This limits applying existing multiview supervision approaches used to learn sparse keypoints that rely on the exact correspondences. To address this challenge, we derive a new probabilistic epipolar constraint that encodes the two desired properties. (1) Soft correspondence: we define a matchability, which measures a likelihood of a point matching to the other image's corresponding point, thus relaxing the requirement of the exact correspondences. (2) Geometric consistency: every point in the continuous correspondence fields must satisfy the multiview consistency collectively. We formulate a probabilistic epipolar constraint using a weighted average of epipolar errors through the matchability thereby generalizing the point-to-point geometric error to the field-to-field geometric error. This generalization facilitates learning a geometrically coherent dense keypoint detection model by utilizing a large number of unlabeled multiview images. Additionally, to prevent degenerative cases, we employ a distillation-based regularization by using a pretrained model. Finally, we design a new neural network architecture, made of twin networks, that effectively minimizes the probabilistic epipolar errors of all possible correspondences between two view images by building affinity matrices. Our method shows superior performance compared to existing methods, including non-differentiable bootstrap** in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy. △ Less

Submitted 19 February, 2024; v1 submitted 20 September, 2021; originally announced September 2021.

Comments: Published as a conference paper at NeurIPS 2021

arXiv:2103.03319 [pdf, other]

Self-supervised 3D Representation Learning of Dressed Humans from Social Media Videos

Authors: Yasamin Jafarian, Hyun Soo Park

Abstract: A key challenge of learning a visual representation for the 3D high fidelity geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos tha… ▽ More A key challenge of learning a visual representation for the 3D high fidelity geometry of dressed humans lies in the limited availability of the ground truth data (e.g., 3D scanned models), which results in the performance degradation of 3D human reconstruction when applying to real-world imagery. We address this challenge by leveraging a new data resource: a number of social media dance videos that span diverse appearance, clothing styles, performances, and identities. Each video depicts dynamic movements of the body and clothes of a single person while lacking the 3D ground truth geometry. To learn a visual representation from these videos, we present a new self-supervised learning method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant. This allows self-supervision by enforcing a temporal coherence over the predictions. In addition, we jointly learn the depths along with the surface normals that are highly responsive to local texture, wrinkle, and shade by maximizing their geometric consistency. Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image. We further provide a theoretical bound of self-supervised learning via an uncertainty analysis that characterizes the performance of the self-supervised learning without training. We demonstrate that our method outperforms the state-of-the-art human depth estimation and human shape recovery approaches on both real and rendered images. △ Less

Submitted 27 December, 2022; v1 submitted 4 March, 2021; originally announced March 2021.

arXiv:2102.00062 [pdf, other]

Neural 3D Clothes Retargeting from a Single Image

Authors: Jae Shin Yoon, Kihwan Kim, Jan Kautz, Hyun Soo Park

Abstract: In this paper, we present a method of clothes retargeting; generating the potential poses and deformations of a given 3D clothing template model to fit onto a person in a single RGB image. The problem is fundamentally ill-posed as attaining the ground truth data is impossible, i.e., images of people wearing the different 3D clothing template model at exact same pose. We address this challenge by u… ▽ More In this paper, we present a method of clothes retargeting; generating the potential poses and deformations of a given 3D clothing template model to fit onto a person in a single RGB image. The problem is fundamentally ill-posed as attaining the ground truth data is impossible, i.e., images of people wearing the different 3D clothing template model at exact same pose. We address this challenge by utilizing large-scale synthetic data generated from physical simulation, allowing us to map 2D dense body pose to 3D clothing deformation. With the simulated data, we propose a semi-supervised learning framework that validates the physical plausibility of the 3D deformation by matching with the prescribed body-to-cloth contact points and clothing silhouette to fit onto the unlabeled real images. A new neural clothes retargeting network (CRNet) is designed to integrate the semi-supervised retargeting task in an end-to-end fashion. In our evaluation, we show that our method can predict the realistic 3D pose and deformation field needed for retargeting clothes models in real-world examples. △ Less

Submitted 29 January, 2021; originally announced February 2021.

Comments: 20 pages, 21 figures

arXiv:2012.03796 [pdf, other]

Pose-Guided Human Animation from a Single Image in the Wild

Authors: Jae Shin Yoon, Lingjie Liu, Vladislav Golyanik, Kripasindhu Sarkar, Hyun Soo Park, Christian Theobalt

Abstract: We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositi… ▽ More We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositional neural network that predicts the silhouette, garment labels, and textures. Each modular network is explicitly dedicated to a subtask that can be learned from the synthetic data. At the inference time, we utilize the trained network to produce a unified representation of appearance and its labels in UV coordinates, which remains constant across poses. The unified representation provides an incomplete yet strong guidance to generating the appearance in response to the pose change. We use the trained network to complete the appearance and render it with the background. With these strategies, we are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene. Experiments show that our method outperforms the state-of-the-arts in terms of synthesis quality, temporal coherence, and generalization ability. △ Less

Submitted 21 November, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: 14 pages including Appendix

arXiv:2007.09264 [pdf, other]

Surface Normal Estimation of Tilted Images via Spatial Rectifier

Authors: Tien Do, Khiem Vuong, Stergios I. Roumeliotis, Hyun Soo Park

Abstract: In this paper, we present a spatial rectifier to estimate surface normals of tilted images. Tilted images are of particular interest as more visual data are captured by arbitrarily oriented sensors such as body-/robot-mounted cameras. Existing approaches exhibit bounded performance on predicting surface normals because they were trained using gravity-aligned images. Our two main hypotheses are: (1… ▽ More In this paper, we present a spatial rectifier to estimate surface normals of tilted images. Tilted images are of particular interest as more visual data are captured by arbitrarily oriented sensors such as body-/robot-mounted cameras. Existing approaches exhibit bounded performance on predicting surface normals because they were trained using gravity-aligned images. Our two main hypotheses are: (1) visual scene layout is indicative of the gravity direction; and (2) not all surfaces are equally represented by a learned estimator due to the structured distribution of the training data, thus, there exists a transformation for each tilted image that is more responsive to the learned estimator than others. We design a spatial rectifier that is learned to transform the surface normal distribution of a tilted image to the rectified one that matches the gravity-aligned training data distribution. Along with the spatial rectifier, we propose a novel truncated angular loss that offers a stronger gradient at smaller angular errors and robustness to outliers. The resulting estimator outperforms the state-of-the-art methods including data augmentation baselines not only on ScanNet and NYUv2 but also on a new dataset called Tilt-RGBD that includes considerable roll and pitch camera motion. △ Less

Submitted 14 July, 2022; v1 submitted 17 July, 2020; originally announced July 2020.

Comments: Appearing in the European Conference on Computer Vision 2020. This version fixes a typo on the L2 loss function

arXiv:2004.01294 [pdf, other]

Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera

Authors: Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, Jan Kautz

Abstract: This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth fr… ▽ More This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: This paper is accepted to CVPR 2020

arXiv:1907.10815 [pdf, other]

Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking

Authors: Jae Shin Yoon, Takaaki Shiratori, Shoou-I Yu, Hyun Soo Park

Abstract: Improvements in data-capture and face modeling techniques have enabled us to create high-fidelity realistic face models. However, driving these realistic face models requires special input data, e.g. 3D meshes and unwrapped textures. Also, these face models expect clean input data taken under controlled lab environments, which is very different from data collected in the wild. All these constraint… ▽ More Improvements in data-capture and face modeling techniques have enabled us to create high-fidelity realistic face models. However, driving these realistic face models requires special input data, e.g. 3D meshes and unwrapped textures. Also, these face models expect clean input data taken under controlled lab environments, which is very different from data collected in the wild. All these constraints make it challenging to use the high-fidelity models in tracking for commodity cameras. In this paper, we propose a self-supervised domain adaptation approach to enable the animation of high-fidelity face models from a commodity camera. Our approach first circumvents the requirement for special input data by training a new network that can directly drive a face model just from a single 2D image. Then, we overcome the domain mismatch between lab and uncontrolled environments by performing self-supervised domain adaptation based on "consecutive frame texture consistency" based on the assumption that the appearance of the face is consistent over consecutive frames, avoiding the necessity of modeling the new environment such as lighting or background. Experiments show that we are able to drive a high-fidelity face model to perform complex facial motion from a cellphone camera without requiring any labeled data from the new domain. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: This work is accepted by CVPR 2019

arXiv:1903.06257 [pdf, other]

doi 10.1109/ACCESS.2019.2934178

Unpaired image denoising using a generative adversarial network in X-ray CT

Authors: Hyoung Suk Park, **eon Baek, Sun Kyoung You, Jae Kyu Choi, ** Keun Seo

Abstract: This paper proposes a deep learning-based denoising method for noisy low-dose computerized tomography (CT) images in the absence of paired training data. The proposed method uses a fidelity-embedded generative adversarial network (GAN) to learn a denoising function from unpaired training data of low-dose CT (LDCT) and standard-dose CT (SDCT) images, where the denoising function is the optimal gene… ▽ More This paper proposes a deep learning-based denoising method for noisy low-dose computerized tomography (CT) images in the absence of paired training data. The proposed method uses a fidelity-embedded generative adversarial network (GAN) to learn a denoising function from unpaired training data of low-dose CT (LDCT) and standard-dose CT (SDCT) images, where the denoising function is the optimal generator in the GAN framework. This paper analyzes the f-GAN objective to derive a suitable generator that is optimized by minimizing a weighted sum of two losses: the Kullback-Leibler divergence between an SDCT data distribution and a generated distribution, and the $\ell_2$ loss between the LDCT image and the corresponding generated images (or denoised image). The computed generator reflects the prior belief about SDCT data distribution through training. We observed that the proposed method allows the preservation of fine anomalous features while eliminating noise. The experimental results show that the proposed deep-learning method with unpaired datasets performs comparably to a method using paired datasets. A clinical experiment was also performed to show the validity of the proposed method for noise arising in the low-dose X-ray CT. △ Less

Submitted 8 August, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

Journal ref: IEEE Access, 2019

arXiv:1901.09478 [pdf, other]

doi 10.3390/e21010080

Efficient High-dimensional Quantum Key Distribution with Hybrid Encoding

Authors: Yonggi Jo, Hee Su Park, Seung-Woo Lee, Wonmin Son

Abstract: We propose a schematic setup of quantum key distribution (QKD) with an improved secret key rate based on high-dimensional quantum states. Two degrees-of-freedom of a single photon, orbital angular momentum modes, and multi-path modes, are used to encode secret key information. Its practical implementation consists of optical elements that are within the reach of current technologies such as a mult… ▽ More We propose a schematic setup of quantum key distribution (QKD) with an improved secret key rate based on high-dimensional quantum states. Two degrees-of-freedom of a single photon, orbital angular momentum modes, and multi-path modes, are used to encode secret key information. Its practical implementation consists of optical elements that are within the reach of current technologies such as a multiport interferometer. We show that the proposed feasible protocol has improved the secret key rate with much sophistication compared to the previous 2-dimensional protocol known as the detector-device-independent QKD. △ Less

Submitted 27 January, 2019; originally announced January 2019.

Comments: 10 pages, 6 figures

Journal ref: Entropy 2019, 21(1), 80

arXiv:1812.01738 [pdf, other]

Multiview Cross-supervision for Semantic Segmentation

Authors: Yuan Yao, Hyun Soo Park

Abstract: This paper presents a semi-supervised learning framework for a customized semantic segmentation task using multiview image streams. A key challenge of the customized task lies in the limited accessibility of the labeled data due to the requirement of prohibitive manual annotation effort. We hypothesize that it is possible to leverage multiview image streams that are linked through the underlying 3… ▽ More This paper presents a semi-supervised learning framework for a customized semantic segmentation task using multiview image streams. A key challenge of the customized task lies in the limited accessibility of the labeled data due to the requirement of prohibitive manual annotation effort. We hypothesize that it is possible to leverage multiview image streams that are linked through the underlying 3D geometry, which can provide an additional supervisionary signal to train a segmentation model. We formulate a new cross-supervision method using a shape belief transfer---the segmentation belief in one image is used to predict that of the other image through epipolar geometry analogous to shape-from-silhouette. The shape belief transfer provides the upper and lower bounds of the segmentation for the unlabeled data where its gap approaches asymptotically to zero as the number of the labeled views increases. We integrate this theory to design a novel network that is agnostic to camera calibration, network model, and semantic category and bypasses the intermediate process of suboptimal 3D reconstruction. We validate this network by recognizing a customized semantic category per pixel from realworld visual data including non-human species and a subject of interest in social videos where attaining large-scale annotation data is infeasible. △ Less

Submitted 4 December, 2018; originally announced December 2018.

arXiv:1812.00312 [pdf, other]

ECO: Egocentric Cognitive Map**

Authors: Jayant Sharma, Zixing Wang, Alberto Speranzon, Vijay Venkataraman, Hyun Soo Park

Abstract: We present a new method to localize a camera within a previously unseen environment perceived from an egocentric point of view. Although this is, in general, an ill-posed problem, humans can effortlessly and efficiently determine their relative location and orientation and navigate into a previously unseen environments, e.g., finding a specific item in a new grocery store. To enable such a capabil… ▽ More We present a new method to localize a camera within a previously unseen environment perceived from an egocentric point of view. Although this is, in general, an ill-posed problem, humans can effortlessly and efficiently determine their relative location and orientation and navigate into a previously unseen environments, e.g., finding a specific item in a new grocery store. To enable such a capability, we design a new egocentric representation, which we call ECO (Egocentric COgnitive map). ECO is biologically inspired, by the cognitive map that allows human navigation, and it encodes the surrounding visual semantics with respect to both distance and orientation. ECO possesses three main properties: (1) reconfigurability: complex semantics and geometry is captured via the synthesis of atomic visual representations (e.g., image patch); (2) robustness: the visual semantics are registered in a geometrically consistent way (e.g., aligning with respect to the gravity vector, frontalizing, and rescaling to canonical depth), thus enabling us to learn meaningful atomic representations; (3) adaptability: a domain adaptation framework is designed to generalize the learned representation without manual calibration. As a proof-of-concept, we use ECO to localize a camera within real-world scenes---various grocery stores---and demonstrate performance improvements when compared to existing semantic localization approaches. △ Less

Submitted 1 December, 2018; originally announced December 2018.

arXiv:1812.00281 [pdf, other]

HUMBI: A Large Multiview Dataset of Human Body Expressions

Authors: Zhixuan Yu, Jae Shin Yoon, In Kyu Lee, Prashanth Venkatesh, Jaesik Park, Jihun Yu, Hyun Soo Park

Abstract: This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and physical condition. With the multivi… ▽ More This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and physical condition. With the multiview image streams, we reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance using their canonical atlas. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. △ Less

Submitted 22 May, 2020; v1 submitted 1 December, 2018; originally announced December 2018.

arXiv:1811.11251 [pdf, other]

Multiview Supervision By Registration

Authors: Yilun Zhang, Hyun Soo Park

Abstract: This paper presents a semi-supervised learning framework to train a keypoint detector using multiview image streams given the limited labeled data (typically $<$4\%). We leverage the complementary relationship between multiview geometry and visual tracking to provide three types of supervisionary signals to utilize the unlabeled data: (1) keypoint detection in one view can be supervised by other v… ▽ More This paper presents a semi-supervised learning framework to train a keypoint detector using multiview image streams given the limited labeled data (typically $<$4\%). We leverage the complementary relationship between multiview geometry and visual tracking to provide three types of supervisionary signals to utilize the unlabeled data: (1) keypoint detection in one view can be supervised by other views via the epipolar geometry; (2) a keypoint moves smoothly over time where its optical flow can be used to temporally supervise consecutive image frames to each other; (3) visible keypoint in one view is likely to be visible in the adjacent view. We integrate these three signals in a differentiable fashion to design a new end-to-end neural network composed of three pathways. This design allows us to extensively use the unlabeled data to train the keypoint detector. We show that our approach outperforms existing detectors including DeepLabCut tailored to the keypoint detection of non-human species such as monkeys, dogs, and mice. △ Less

Submitted 23 March, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

arXiv:1806.00104 [pdf, other]

MONET: Multiview Semi-supervised Keypoint Detection via Epipolar Divergence

Authors: Yuan Yao, Yasamin Jafarian, Hyun Soo Park

Abstract: This paper presents MONET -- an end-to-end semi-supervised learning framework for a keypoint detector using multiview image streams. In particular, we consider general subjects such as non-human species where attaining a large scale annotated dataset is challenging. While multiview geometry can be used to self-supervise the unlabeled data, integrating the geometry into learning a keypoint detector… ▽ More This paper presents MONET -- an end-to-end semi-supervised learning framework for a keypoint detector using multiview image streams. In particular, we consider general subjects such as non-human species where attaining a large scale annotated dataset is challenging. While multiview geometry can be used to self-supervise the unlabeled data, integrating the geometry into learning a keypoint detector is challenging due to representation mismatch. We address this mismatch by formulating a new differentiable representation of the epipolar constraint called epipolar divergence---a generalized distance from the epipolar lines to the corresponding keypoint distribution. Epipolar divergence characterizes when two view keypoint distributions produce zero reprojection error. We design a twin network that minimizes the epipolar divergence through stereo rectification that can significantly alleviate computational complexity and sampling aliasing in training. We demonstrate that our framework can localize customized keypoints of diverse species, e.g., humans, dogs, and monkeys. △ Less

Submitted 16 August, 2019; v1 submitted 31 May, 2018; originally announced June 2018.

arXiv:1802.02987 [pdf, ps, other]

A Generalization Method of Partitioned Activation Function for Complex Number

Authors: HyeonSeok Lee, Hyo Seon Park

Abstract: A method to convert real number partitioned activation function into complex number one is provided. The method has 4em variations; 1 has potential to get holomorphic activation, 2 has potential to conserve complex angle, and the last 1 guarantees interaction between real and imaginary parts. The method has been applied to LReLU and SELU as examples. The complex number activation function is an bu… ▽ More A method to convert real number partitioned activation function into complex number one is provided. The method has 4em variations; 1 has potential to get holomorphic activation, 2 has potential to conserve complex angle, and the last 1 guarantees interaction between real and imaginary parts. The method has been applied to LReLU and SELU as examples. The complex number activation function is an building block of complex number ANN, which has potential to properly deal with complex number problems. But the complex activation is not well established yet. Therefore, we propose a way to extend the partitioned real activation to complex number. △ Less

Submitted 8 February, 2018; originally announced February 2018.

Comments: Complex Activation Function, Holomorphic, Phase-preserving, real-complex interaction

arXiv:1712.01359 [pdf, other]

3D Semantic Trajectory Reconstruction from 3D Pixel Continuum

Authors: Jae Shin Yoon, Ziwei Li, Hyun Soo Park

Abstract: This paper presents a method to reconstruct dense semantic trajectory stream of human interactions in 3D from synchronized multiple videos. The interactions inherently introduce self-occlusion and illumination/appearance/shape changes, resulting in highly fragmented trajectory reconstruction with noisy and coarse semantic labels. Our conjecture is that among many views, there exists a set of views… ▽ More This paper presents a method to reconstruct dense semantic trajectory stream of human interactions in 3D from synchronized multiple videos. The interactions inherently introduce self-occlusion and illumination/appearance/shape changes, resulting in highly fragmented trajectory reconstruction with noisy and coarse semantic labels. Our conjecture is that among many views, there exists a set of views that can confidently recognize the visual semantic label of a 3D trajectory. We introduce a new representation called 3D semantic map---a probability distribution over the semantic labels per trajectory. We construct the 3D semantic map by reasoning about visibility and 2D recognition confidence based on view-pooling, i.e., finding the view that best represents the semantics of the trajectory. Using the 3D semantic map, we precisely infer all trajectory labels jointly by considering the affinity between long range trajectories via estimating their local rigid transformations. This inference quantitatively outperforms the baseline approaches in terms of predictive validity, representation robustness, and affinity effectiveness. We demonstrate that our algorithm can robustly compute the semantic labels of a large scale trajectory set involving real-world human interactions with object, scenes, and people. △ Less

Submitted 4 December, 2017; originally announced December 2017.

arXiv:1708.00607 [pdf, other]

doi 10.1002/mp.13199

CT sinogram-consistency learning for metal-induced beam hardening correction

Authors: Hyung Suk Park, Sung Min Lee, Hwa Pyung Kim, ** Keun Seo

Abstract: This paper proposes a sinogram consistency learning method to deal with beam-hardening related artifacts in polychromatic computerized tomography (CT). The presence of highly attenuating materials in the scan field causes an inconsistent sinogram, that does not match the range space of the Radon transform. When the mismatched data are entered into the range space during CT reconstruction, streakin… ▽ More This paper proposes a sinogram consistency learning method to deal with beam-hardening related artifacts in polychromatic computerized tomography (CT). The presence of highly attenuating materials in the scan field causes an inconsistent sinogram, that does not match the range space of the Radon transform. When the mismatched data are entered into the range space during CT reconstruction, streaking and shading artifacts are generated owing to the inherent nature of the inverse Radon transform. The proposed learning method aims to repair inconsistent sinograms by removing the primary metal-induced beam-hardening factors along the metal trace in the sinogram. Taking account of the fundamental difficulty in obtaining sufficient training data in a medical environment, the learning method is designed to use simulated training data and a patient-type specific learning model is used to simplify the learning process. The feasibility of the proposed method is investigated using a dataset, consisting of real CT scan of pelvises containing hip prostheses. The anatomical areas in training and test data are different, in order to demonstrate that the proposed method extracts the beam hardening features, selectively. The results show that our method successfully corrects sinogram inconsistency by extracting beam-hardening sources by means of deep learning. This paper proposed a deep learning method of sinogram correction for beam hardening reduction in CT for the first time. Conventional methods for beam hardening reduction are based on regularizations, and have the fundamental drawback of being not easily able to use manifold CT images, while a deep learning approach has the potential to do so. △ Less

Submitted 12 January, 2018; v1 submitted 2 August, 2017; originally announced August 2017.

Comments: 17 pages, 8 figures

arXiv:1704.00098 [pdf, other]

Customizing First Person Image Through Desired Actions

Authors: Shan Su, Jianbo Shi, Hyun Soo Park

Abstract: This paper studies a problem of inverse visual path planning: creating a visual scene from a first person action. Our conjecture is that the spatial arrangement of a first person visual scene is deployed to afford an action, and therefore, the action can be inversely used to synthesize a new scene such that the action is feasible. As a proof-of-concept, we focus on linking visual experiences induc… ▽ More This paper studies a problem of inverse visual path planning: creating a visual scene from a first person action. Our conjecture is that the spatial arrangement of a first person visual scene is deployed to afford an action, and therefore, the action can be inversely used to synthesize a new scene such that the action is feasible. As a proof-of-concept, we focus on linking visual experiences induced by walking. A key innovation of this paper is a concept of ActionTunnel---a 3D virtual tunnel along the future trajectory encoding what the wearer will visually experience as moving into the scene. This connects two distinctive first person images through similar walking paths. Our method takes a first person image with a user defined future trajectory and outputs a new image that can afford the future motion. The image is created by combining present and future ActionTunnels in 3D where the missing pixels in adjoining area are computed by a generative adversarial network. Our work can provide a travel across different first person experiences in diverse real world scenes. △ Less

Submitted 31 March, 2017; originally announced April 2017.

arXiv:1611.09464 [pdf, other]

Social Behavior Prediction from First Person Videos

Authors: Shan Su, Jung Pyo Hong, Jianbo Shi, Hyun Soo Park

Abstract: This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person… ▽ More This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person cameras to automatically annotate each other's the visual semantics of social configurations. We leverage two learning signals uniquely embedded in first person videos. Individually, a first person video records the visual semantics of a spatial and social layout around a person that allows associating with past similar situations. Collectively, first person videos follow joint attention that can link the individuals to a group. We learn the egocentric visual semantics of group movements using a Siamese neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility---the gaze alignment towards joint attention predicted by their social formation, where the dynamics of joint attention is learned by a long-term recurrent convolutional network. This allows us to characterize which social configuration is more plausible and predict future group trajectories. △ Less

Submitted 28 November, 2016; originally announced November 2016.

arXiv:1611.05365 [pdf, other]

Am I a Baller? Basketball Performance Assessment from First-Person Videos

Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Abstract: This paper presents a method to assess a basketball player's performance from his/her first-person video. A key challenge lies in the fact that the evaluation metric is highly subjective and specific to a particular evaluator. We leverage the first-person camera to address this challenge. The spatiotemporal visual semantics provided by a first-person view allows us to reason about the camera weare… ▽ More This paper presents a method to assess a basketball player's performance from his/her first-person video. A key challenge lies in the fact that the evaluation metric is highly subjective and specific to a particular evaluator. We leverage the first-person camera to address this challenge. The spatiotemporal visual semantics provided by a first-person view allows us to reason about the camera wearer's actions while he/she is participating in an unscripted basketball game. Our method takes a player's first-person video and provides a player's performance measure that is specific to an evaluator's preference. To achieve this goal, we first use a convolutional LSTM network to detect atomic basketball events from first-person videos. Our network's ability to zoom-in to the salient regions addresses the issue of a severe camera wearer's head movement in first-person videos. The detected atomic events are then passed through the Gaussian mixtures to construct a highly non-linear visual spatiotemporal basketball assessment feature. Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better. We demonstrate that despite not knowing the basketball evaluator's criterion, our model learns to accurately assess the players in real-world games. Furthermore, our model can also discover basketball events that contribute positively and negatively to a player's performance. △ Less

Submitted 2 August, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

arXiv:1611.05335 [pdf, other]

Unsupervised Learning of Important Objects from First-Person Videos

Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Abstract: A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person… ▽ More A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ("where") and visual ("what") pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods. △ Less

Submitted 2 August, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

arXiv:1603.04908 [pdf, other]

First Person Action-Object Detection with EgoNet

Authors: Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Abstract: Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects---the objects that capture person's conscious visual (watching… ▽ More Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects---the objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. Action-objects may be task-dependent but since many tasks share common person-object spatial configurations, action-objects exhibit a characteristic 3D spatial distance and orientation with respect to the person. We design a predictive model that detects action-objects using EgoNet, a joint two-stream network that holistically integrates visual appearance (RGB) and 3D spatial layout (depth and height) cues to predict per-pixel likelihood of action-objects. Our network also incorporates a first-person coordinate embedding, which is designed to learn a spatial distribution of the action-objects in the first-person data. We demonstrate EgoNet's predictive power, by showing that it consistently outperforms previous baseline approaches. Furthermore, EgoNet also exhibits a strong generalization ability, i.e., it predicts semantically meaningful objects in novel first-person datasets. Our method's ability to effectively detect action-objects could be used to improve robots' understanding of human-object interactions. △ Less

Submitted 10 June, 2017; v1 submitted 15 March, 2016; originally announced March 2016.

arXiv:1511.02682 [pdf, other]

Exploiting Egocentric Object Prior for 3D Saliency Detection

Authors: Gedas Bertasius, Hyun Soo Park, Jianbo Shi

Abstract: On a minute-to-minute basis people undergo numerous fluid interactions with objects that barely register on a conscious level. Recent neuroscientific research demonstrates that humans have a fixed size prior for salient objects. This suggests that a salient object in 3D undergoes a consistent transformation such that people's visual system perceives it with an approximately fixed size. This findin… ▽ More On a minute-to-minute basis people undergo numerous fluid interactions with objects that barely register on a conscious level. Recent neuroscientific research demonstrates that humans have a fixed size prior for salient objects. This suggests that a salient object in 3D undergoes a consistent transformation such that people's visual system perceives it with an approximately fixed size. This finding indicates that there exists a consistent egocentric object prior that can be characterized by shape, size, depth, and location in the first person view. In this paper, we develop an EgoObject Representation, which encodes these characteristics by incorporating shape, location, size and depth features from an egocentric RGBD image. We empirically show that this representation can accurately characterize the egocentric object prior by testing it on an egocentric RGBD dataset for three tasks: the 3D saliency detection, future saliency prediction, and interaction classification. This representation is evaluated on our new Egocentric RGBD Saliency dataset that includes various activities such as cooking, dining, and shop**. By using our EgoObject representation, we outperform previously proposed models for saliency detection (relative 30% improvement for 3D saliency detection task) on our dataset. Additionally, we demonstrate that this representation allows us to predict future salient objects based on the gaze cue and classify people's interactions with objects. △ Less

Submitted 9 November, 2015; originally announced November 2015.

arXiv:1509.02094 [pdf, other]

Future Localization from an Egocentric Depth Image

Authors: Hyun Soo Park, Yedong Niu, Jianbo Shi

Abstract: This paper presents a method for future localization: to predict a set of plausible trajectories of ego-motion given a depth image. We predict paths avoiding obstacles, between objects, even paths turning around a corner into space behind objects. As a byproduct of the predicted trajectories of ego-motion, we discover in the image the empty space occluded by foreground objects. We use no image bas… ▽ More This paper presents a method for future localization: to predict a set of plausible trajectories of ego-motion given a depth image. We predict paths avoiding obstacles, between objects, even paths turning around a corner into space behind objects. As a byproduct of the predicted trajectories of ego-motion, we discover in the image the empty space occluded by foreground objects. We use no image based features such as semantic labeling/segmentation or object detection/recognition for this algorithm. Inspired by proxemics, we represent the space around a person using an EgoSpace map, akin to an illustrated tourist map, that measures a likelihood of occlusion at the egocentric coordinate system. A future trajectory of ego-motion is modeled by a linear combination of compact trajectory bases allowing us to constrain the predicted trajectory. We learn the relationship between the EgoSpace map and trajectory from the EgoMotion dataset providing in-situ measurements of the future trajectory. A cost function that takes into account partial occlusion due to foreground objects is minimized to predict a trajectory. This cost function generates a trajectory that passes through the occluded space, which allows us to discover the empty space behind the foreground objects. We quantitatively evaluate our method to show predictive validity and apply to various real world scenes including walking, shop**, and social interactions. △ Less

Submitted 7 September, 2015; originally announced September 2015.

Comments: 9 pages

arXiv:cmp-lg/9505029 [pdf, ps]

Map** Scrambled Korean Sentences into English Using Synchronous TAGs

Authors: Hyun S. Park

Abstract: Synchronous Tree Adjoining Grammars can be used for Machine Translation. However, translating a free order language such as Korean to English is complicated. I present a mechanism to translate scrambled Korean sentences into English by combining the concepts of Multi-Component TAGs (MC-TAGs) and Synchronous TAGs (STAGs). Synchronous Tree Adjoining Grammars can be used for Machine Translation. However, translating a free order language such as Korean to English is complicated. I present a mechanism to translate scrambled Korean sentences into English by combining the concepts of Multi-Component TAGs (MC-TAGs) and Synchronous TAGs (STAGs). △ Less

Submitted 13 May, 1995; originally announced May 1995.

Comments: uuencoded compressed ps file. 3 pages. To appear ACL95

arXiv:cmp-lg/9410023 [pdf, ps]

Korean to English Translation Using Synchronous TAGs

Authors: Dania Egedi, Martha Palmer, Hyun S. Park, Aravind K. Joshi

Abstract: It is often argued that accurate machine translation requires reference to contextual knowledge for the correct treatment of linguistic phenomena such as dropped arguments and accurate lexical selection. One of the historical arguments in favor of the interlingua approach has been that, since it revolves around a deep semantic representation, it is better able to handle the types of linguistic p… ▽ More It is often argued that accurate machine translation requires reference to contextual knowledge for the correct treatment of linguistic phenomena such as dropped arguments and accurate lexical selection. One of the historical arguments in favor of the interlingua approach has been that, since it revolves around a deep semantic representation, it is better able to handle the types of linguistic phenomena that are seen as requiring a knowledge-based approach. In this paper we present an alternative approach, exemplified by a prototype system for machine translation of English and Korean which is implemented in Synchronous TAGs. This approach is essentially transfer based, and uses semantic feature unification for accurate lexical selection of polysemous verbs. The same semantic features, when combined with a discourse model which stores previously mentioned entities, can also be used for the recovery of topicalized arguments. In this paper we concentrate on the translation of Korean to English. △ Less

Submitted 24 October, 1994; originally announced October 1994.

Comments: ps file. 8 pages

Journal ref: Proceedings of AMTA 94

Showing 1–45 of 45 results for author: Park, H S