Search | arXiv e-print repository

What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models

Authors: Jeongrok Yu, Seong Ug Kim, Jacob Choi, **ho D. Choi

Abstract: Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper p… ▽ More Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based Masked Language Models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few works have been conducted for the task in other languages. This paper proposes a multilingual approach to estimate gender bias in MLMs from 5 languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias, compared to the traditional lexicon-based method. For each language, both the lexicon-based and model-based methods are applied to create two datasets respectively, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and 3 new scoring metrics. Our results show that the previous approach is data-sensitive and not stable as it does not remove contextual dependencies irrelevant to gender. In fact, the results often flip when different scoring metrics are used on the same dataset, suggesting that gender bias should be studied on a large dataset using multiple evaluation metrics for best practice. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.17420 [pdf, other]

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Authors: Dong** Kim, Sung ** Um, Sangmin Lee, Jung Uk Kim

Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localizati… ▽ More The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2308.09303 [pdf, other]

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

Authors: Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, Gyeong-Moon Park

Abstract: Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the… ▽ More Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.06087 [pdf, other]

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

Authors: Sung ** Um, Dong** Kim, Jung Uk Kim

Abstract: The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spa… ▽ More The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL △ Less

Submitted 17 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: Camera-Ready, ACM MM 2023

arXiv:2306.14289 [pdf, other]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Authors: Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong

Abstract: Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight imag… ▽ More Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU. △ Less

Submitted 1 July, 2023; v1 submitted 25 June, 2023; originally announced June 2023.

Comments: First work to make SAM lightweight for mobile applications

arXiv:2304.06488 [pdf, other]

One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era

Authors: Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, **woo Choi, Gyeong-Moon Park, Sung-Ho Bae, Lik-Hang Lee, Pan Hui, In So Kweon, Choong Seon Hong

Abstract: OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is demonstrated to be one small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI). Since its official release in November 2022, ChatGPT has quickly attracted numerous users with extensive media coverage. Such unprecedented attention has also motivated numerous researchers to investigate ChatGPT… ▽ More OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is demonstrated to be one small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI). Since its official release in November 2022, ChatGPT has quickly attracted numerous users with extensive media coverage. Such unprecedented attention has also motivated numerous researchers to investigate ChatGPT from various aspects. According to Google scholar, there are more than 500 articles with ChatGPT in their titles or mentioning it in their abstracts. Considering this, a review is urgently needed, and our work fills this gap. Overall, this work is the first to survey ChatGPT with a comprehensive review of its underlying technology, applications, and challenges. Moreover, we present an outlook on how ChatGPT might evolve to realize general-purpose AIGC (a.k.a. AI-generated content), which will be a significant milestone for the development of AGI. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Comments: A Survey on ChatGPT and GPT-4, 29 pages. Feedback is appreciated ([email protected])

arXiv:2210.16788 [pdf, other]

Image-free Domain Generalization via CLIP for 3D Hand Pose Estimation

Authors: Seongyeong Lee, Hansoo Park, Dong Uk Kim, Jihyeon Kim, Muhammadjon Boboev, Seungryul Baek

Abstract: RGB-based 3D hand pose estimation has been successful for decades thanks to large-scale databases and deep learning. However, the hand pose estimation network does not operate well for hand pose images whose characteristics are far different from the training data. This is caused by various factors such as illuminations, camera angles, diverse backgrounds in the input images, etc. Many existing me… ▽ More RGB-based 3D hand pose estimation has been successful for decades thanks to large-scale databases and deep learning. However, the hand pose estimation network does not operate well for hand pose images whose characteristics are far different from the training data. This is caused by various factors such as illuminations, camera angles, diverse backgrounds in the input images, etc. Many existing methods tried to solve it by supplying additional large-scale unconstrained/target domain images to augment data space; however collecting such large-scale images takes a lot of labors. In this paper, we present a simple image-free domain generalization approach for the hand pose estimation framework that uses only source domain data. We try to manipulate the image features of the hand pose estimation network by adding the features from text descriptions using the CLIP (Contrastive Language-Image Pre-training) model. The manipulated image features are then exploited to train the hand pose estimation network via the contrastive learning framework. In experiments with STB and RHD datasets, our algorithm shows improved performance over the state-of-the-art domain generalization approaches. △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2209.09486 [pdf, other]

Self-supervised 3D Object Detection from Monocular Pseudo-LiDAR

Authors: Curie Kim, Ue-Hwan Kim, Jong-Hwan Kim

Abstract: There have been attempts to detect 3D objects by fusion of stereo camera images and LiDAR sensor data or using LiDAR for pre-training and only monocular images for testing, but there have been less attempts to use only monocular image sequences due to low accuracy. In addition, when depth prediction using only monocular images, only scale-inconsistent depth can be predicted, which is the reason wh… ▽ More There have been attempts to detect 3D objects by fusion of stereo camera images and LiDAR sensor data or using LiDAR for pre-training and only monocular images for testing, but there have been less attempts to use only monocular image sequences due to low accuracy. In addition, when depth prediction using only monocular images, only scale-inconsistent depth can be predicted, which is the reason why researchers are reluctant to use monocular images alone. Therefore, we propose a method for predicting absolute depth and detecting 3D objects using only monocular image sequences by enabling end-to-end learning of detection networks and depth prediction networks. As a result, the proposed method surpasses other existing methods in performance on the KITTI 3D dataset. Even when monocular image and 3D LiDAR are used together during training in an attempt to improve performance, ours exhibit is the best performance compared to other methods using the same input. In addition, end-to-end learning not only improves depth prediction performance, but also enables absolute depth prediction, because our network utilizes the fact that the size of a 3D object such as a car is determined by the approximate size. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: Accepted for the 2022 IEEE International Conference on Multisensor Fusion and Integration (MFI 2022)

arXiv:2209.08844 [pdf, other]

A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View

Authors: Curie Kim, Ue-Hwan Kim

Abstract: The bird's-eye-view (BEV) representation allows robust learning of multiple tasks for autonomous driving including road layout estimation and 3D object detection. However, contemporary methods for unified road layout estimation and 3D object detection rarely handle the class imbalance of the training dataset and multi-class learning to reduce the total number of networks required. To overcome thes… ▽ More The bird's-eye-view (BEV) representation allows robust learning of multiple tasks for autonomous driving including road layout estimation and 3D object detection. However, contemporary methods for unified road layout estimation and 3D object detection rarely handle the class imbalance of the training dataset and multi-class learning to reduce the total number of networks required. To overcome these limitations, we propose a unified model for road layout estimation and 3D object detection inspired by the transformer architecture and the CycleGAN learning framework. The proposed model deals with the performance degradation due to the class imbalance of the dataset utilizing the focal loss and the proposed dual cycle loss. Moreover, we set up extensive learning scenarios to study the effect of multi-class learning for road layout estimation in various situations. To verify the effectiveness of the proposed model and the learning scheme, we conduct a thorough ablation study and a comparative study. The experiment results attest the effectiveness of our model; we achieve state-of-the-art performance in both the road layout estimation and 3D object detection tasks. △ Less

Submitted 19 September, 2022; originally announced September 2022.

arXiv:2207.10324 [pdf, other]

Enhancing Generative Networks for Chest Anomaly Localization through Automatic Registration-Based Unpaired-to-Pseudo-Paired Training Data Translation

Authors: Kyungsu Kim, Seong Je Oh, Chae Yeon Lim, Ju Hwan Lee, Tae Uk Kim, Myung ** Chung

Abstract: Image translation based on a generative adversarial network (GAN-IT) is a promising method for the precise localization of abnormal regions in chest X-ray images (AL-CXR) even without the pixel-level annotation. However, heterogeneous unpaired datasets undermine existing methods to extract key features and distinguish normal from abnormal cases, resulting in inaccurate and unstable AL-CXR. To addr… ▽ More Image translation based on a generative adversarial network (GAN-IT) is a promising method for the precise localization of abnormal regions in chest X-ray images (AL-CXR) even without the pixel-level annotation. However, heterogeneous unpaired datasets undermine existing methods to extract key features and distinguish normal from abnormal cases, resulting in inaccurate and unstable AL-CXR. To address this problem, we propose an improved two-stage GAN-IT involving registration and data augmentation. For the first stage, we introduce an advanced deep-learning-based registration technique that virtually and reasonably converts unpaired data into paired data for learning registration maps, by sequentially utilizing linear-based global and uniform coordinate transformation and AI-based non-linear coordinate fine-tuning. This approach enables independent and complex coordinate transformation of each detailed location of the lung while recognizing the entire lung structure, thereby achieving higher registration performance with resolving inherent artifacts caused by unpaired conditions. For the second stage, we apply data augmentation to diversify anomaly locations by swap** the left and right lung regions on the uniform registered frames, further improving the performance by alleviating imbalance in data distribution showing left and right lung lesions. The proposed method is model agnostic and shows consistent AL-CXR performance improvement in representative AI models. Therefore, we believe GAN-IT for AL-CXR can be clinically implemented by using our basis framework, even if learning data are scarce or difficult for the pixel-level disease annotation. △ Less

Submitted 15 June, 2024; v1 submitted 21 July, 2022; originally announced July 2022.

arXiv:2108.09030 [pdf, other]

Type Anywhere You Want: An Introduction to Invisible Mobile Keyboard

Authors: Sahng-Min Yoo, Ue-Hwan Kim, Yewon Hwang, Jong-Hwan Kim

Abstract: Contemporary soft keyboards possess limitations: the lack of physical feedback results in an increase of typos, and the interface of soft keyboards degrades the utility of the screen. To overcome these limitations, we propose an Invisible Mobile Keyboard (IMK), which lets users freely type on the desired area without any constraints. To facilitate a data-driven IMK decoding task, we have collected… ▽ More Contemporary soft keyboards possess limitations: the lack of physical feedback results in an increase of typos, and the interface of soft keyboards degrades the utility of the screen. To overcome these limitations, we propose an Invisible Mobile Keyboard (IMK), which lets users freely type on the desired area without any constraints. To facilitate a data-driven IMK decoding task, we have collected the most extensive text-entry dataset (approximately 2M pairs of ty** positions and the corresponding characters). Additionally, we propose our baseline decoder along with a semantic typo correction mechanism based on self-attention, which decodes such unconstrained inputs with high accuracy (96.0%). Moreover, the user study reveals that the users could type faster and feel convenience and satisfaction to IMK with our decoder. Lastly, we make the source code and the dataset public to contribute to the research community. △ Less

Submitted 20 August, 2021; originally announced August 2021.

Comments: Accepted by IJCAI 2021

arXiv:2104.09021 [pdf, other]

Writing in The Air: Unconstrained Text Recognition from Finger Movement Using Spatio-Temporal Convolution

Authors: Ue-Hwan Kim, Yewon Hwang, Sun-Kyung Lee, Jong-Hwan Kim

Abstract: In this paper, we introduce a new benchmark dataset for the challenging writing in the air (WiTA) task -- an elaborate task bridging vision and NLP. WiTA implements an intuitive and natural writing method with finger movement for human-computer interaction (HCI). Our WiTA dataset will facilitate the development of data-driven WiTA systems which thus far have displayed unsatisfactory performance --… ▽ More In this paper, we introduce a new benchmark dataset for the challenging writing in the air (WiTA) task -- an elaborate task bridging vision and NLP. WiTA implements an intuitive and natural writing method with finger movement for human-computer interaction (HCI). Our WiTA dataset will facilitate the development of data-driven WiTA systems which thus far have displayed unsatisfactory performance -- due to lack of dataset as well as traditional statistical models they have adopted. Our dataset consists of five sub-datasets in two languages (Korean and English) and amounts to 209,926 video instances from 122 participants. We capture finger movement for WiTA with RGB cameras to ensure wide accessibility and cost-efficiency. Next, we propose spatio-temporal residual network architectures inspired by 3D ResNet. These models perform unconstrained text recognition from finger movement, guarantee a real-time operation by processing 435 and 697 decoding frames-per-second for Korean and English, respectively, and will serve as an evaluation standard. Our dataset and the source codes are available at https://github.com/Uehwan/WiTA. △ Less

Submitted 18 April, 2021; originally announced April 2021.

Comments: 10 pages, 6 figures, 6 tables

arXiv:2103.12496 [pdf, other]

Revisiting Self-Supervised Monocular Depth Estimation

Authors: Ue-Hwan Kim, Jong-Hwan Kim

Abstract: Self-supervised learning of depth map prediction and motion estimation from monocular video sequences is of vital importance -- since it realizes a broad range of tasks in robotics and autonomous vehicles. A large number of research efforts have enhanced the performance by tackling illumination variation, occlusions, and dynamic objects, to name a few. However, each of those efforts targets indivi… ▽ More Self-supervised learning of depth map prediction and motion estimation from monocular video sequences is of vital importance -- since it realizes a broad range of tasks in robotics and autonomous vehicles. A large number of research efforts have enhanced the performance by tackling illumination variation, occlusions, and dynamic objects, to name a few. However, each of those efforts targets individual goals and endures as separate works. Moreover, most of previous works have adopted the same CNN architecture, not rea** architectural benefits. Therefore, the need to investigate the inter-dependency of the previous methods and the effect of architectural factors remains. To achieve these objectives, we revisit numerous previously proposed self-supervised methods for joint learning of depth and motion, perform a comprehensive empirical study, and unveil multiple crucial insights. Furthermore, we remarkably enhance the performance as a result of our study -- outperforming previous state-of-the-art performance. △ Less

Submitted 23 March, 2021; originally announced March 2021.

Comments: 14 pages, 3 figures, 4 tables

arXiv:2103.05368 [pdf, other]

ChangeSim: Towards End-to-End Online Scene Change Detection in Industrial Indoor Environments

Authors: **-Man Park, Jae-Hyuk Jang, Sahng-Min Yoo, Sun-Kyung Lee, Ue-Hwan Kim, Jong-Hwan Kim

Abstract: We present a challenging dataset, ChangeSim, aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor… ▽ More We present a challenging dataset, ChangeSim, aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor data and precise ground truth labels are obtainable such as the RGB image, depth image, semantic segmentation, change segmentation, camera poses, and 3D reconstructions. While the previous online SCD datasets evaluate models given well-aligned image pairs, ChangeSim also provides raw unpaired sequences that present an opportunity to develop an online SCD model in an end-to-end manner, considering both pairing and detection. Experiments show that even the latest pair-based SCD models suffer from the bottleneck of the pairing process, and it gets worse when the environment contains the non-targeted variations. Our dataset is available at http://sammica.github.io/ChangeSim/. △ Less

Submitted 22 July, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: Accepted to IROS 2021

arXiv:2009.10868 [pdf, other]

A Real-Time Predictive Pedestrian Collision Warning Service for Cooperative Intelligent Transportation Systems Using 3D Pose Estimation

Authors: Ue-Hwan Kim, Dongho Ka, Hwasoo Yeo, Jong-Hwan Kim

Abstract: Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization,… ▽ More Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization, and high computational complexity. To overcome these limitations, we propose a real-time predictive pedestrian collision warning service (P2CWS) for two tasks: pedestrian orientation recognition (100.53 FPS) and intention prediction (35.76 FPS). Our framework obtains satisfying generalization over multiple sites because of the proposed site-independent features. At the center of the feature extraction lies 3D pose estimation. The 3D pose analysis enables robust and accurate recognition of pedestrian orientations and prediction of intentions over multiple sites. The proposed vision framework realizes 89.3% accuracy in the behavior recognition task on the TUD dataset without any training process and 91.28% accuracy in intention prediction on our dataset achieving new state-of-the-art performance. To contribute to the corresponding research community, we make our source codes public which are available at https://github.com/Uehwan/VisionForPedestrian △ Less

Submitted 21 February, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: 12 pages, 8 figures, 4 tables

arXiv:2007.08154 [pdf, other]

Comprehensive Facial Expression Synthesis using Human-Interpretable Language

Authors: Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro

Abstract: Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units. Facial action units for an elaborate facial expression synthesis need to be intuitively represented for human comprehension, not a numeric categorization of facial action units. To address this issue, we utilize human-friendly approach: use of natural… ▽ More Recent advances in facial expression synthesis have shown promising results using diverse expression representations including facial action units. Facial action units for an elaborate facial expression synthesis need to be intuitively represented for human comprehension, not a numeric categorization of facial action units. To address this issue, we utilize human-friendly approach: use of natural language where language helps human grasp conceptual contexts. In this paper, therefore, we propose a new facial expression synthesis model from language-based facial expression description. Our method can synthesize the facial image with detailed expressions. In addition, effectively embedding language features on facial features, our method can control individual word to handle each part of facial movement. Extensive qualitative and quantitative evaluations were conducted to verify the effectiveness of the natural language. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: ICIP 2020

arXiv:2005.14390 [pdf, ps, other]

doi 10.1109/ACCESS.2021.3113186

Privacy-Protection Drone Patrol System based on Face Anonymization

Authors: Harim Lee, Myeung Un Kim, Yeongjun Kim, Hyeonsu Lyu, Hyun Jong Yang

Abstract: The robot market has been growing significantly and is expected to become 1.5 times larger in 2024 than what it was in 2019. Robots have attracted attention of security companies thanks to their mobility. These days, for security robots, unmanned aerial vehicles (UAVs) have quickly emerged by highlighting their advantage: they can even go to any hazardous place that humans cannot access. For UAVs,… ▽ More The robot market has been growing significantly and is expected to become 1.5 times larger in 2024 than what it was in 2019. Robots have attracted attention of security companies thanks to their mobility. These days, for security robots, unmanned aerial vehicles (UAVs) have quickly emerged by highlighting their advantage: they can even go to any hazardous place that humans cannot access. For UAVs, Drone has been a representative model and has several merits to consist of various sensors such as high-resolution cameras. Therefore, Drone is the most suitable as a mobile surveillance robot. These attractive advantages such as high-resolution cameras and mobility can be a double-edged sword, i.e., privacy infringement. Surveillance drones take videos with high-resolution to fulfill their role, however, those contain a lot of privacy sensitive information. The indiscriminate shooting is a critical issue for those who are very reluctant to be exposed. To tackle the privacy infringement, this work proposes face-anonymizing drone patrol system. In this system, one person's face in a video is transformed into a different face with facial components maintained. To construct our privacy-preserving system, we have adopted the latest generative adversarial networks frameworks and have some modifications on losses of those frameworks. Our face-anonymzing approach is evaluated with various public face-image and video dataset. Moreover, our system is evaluated with a customized drone consisting of a high-resolution camera, a companion computer, and a drone control computer. Finally, we confirm that our system can protect privacy sensitive information with our face-anonymzing algorithm while preserving the performance of robot perception, i.e., simultaneous localization and map**. △ Less

Submitted 29 May, 2020; originally announced May 2020.

arXiv:2005.10987 [pdf, other]

Investigating Vulnerability to Adversarial Examples on Multimodal Data Fusion in Deep Learning

Authors: Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim, Jung Uk Kim, Yong Man Ro

Abstract: The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data. Compared to their predictive performance, relatively less attention has been devoted to the robustness of multimodal fusion models. In this paper, we investigated whether the current multimodal fusion model utilizes the complementary intelligence to… ▽ More The success of multimodal data fusion in deep learning appears to be attributed to the use of complementary in-formation between multiple input data. Compared to their predictive performance, relatively less attention has been devoted to the robustness of multimodal fusion models. In this paper, we investigated whether the current multimodal fusion model utilizes the complementary intelligence to defend against adversarial attacks. We applied gradient based white-box attacks such as FGSM and PGD on MFNet, which is a major multispectral (RGB, Thermal) fusion deep learning model for semantic segmentation. We verified that the multimodal fusion model optimized for better prediction is still vulnerable to adversarial attack, even if only one of the sensors is attacked. Thus, it is hard to say that existing multimodal data fusion models are fully utilizing complementary relationships between multiple modalities in terms of adversarial robustness. We believe that our observations open a new horizon for adversarial attack research on multimodal data fusion. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2005.10750 [pdf, other]

Revisiting Role of Autoencoders in Adversarial Settings

Authors: Byeong Cheon Kim, Jung Uk Kim, Hakmin Lee, Yong Man Ro

Abstract: To combat against adversarial attacks, autoencoder structure is widely used to perform denoising which is regarded as gradient masking. In this paper, we revisit the role of autoencoders in adversarial settings. Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the autoencoders. We also found that autoencoders may us… ▽ More To combat against adversarial attacks, autoencoder structure is widely used to perform denoising which is regarded as gradient masking. In this paper, we revisit the role of autoencoders in adversarial settings. Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the autoencoders. We also found that autoencoders may use robust features that cause inherent adversarial robustness. We believe that our discovery of the adversarial robustness of the autoencoders can provide clues to the future research and applications for adversarial defense. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: Accepted at ICIP 2020

arXiv:1912.08541 [pdf, other]

s-DRN: Stabilized Developmental Resonance Network

Authors: In-Ug Yoon, Ue-Hwan Kim, Jong-Hwan

Abstract: Online incremental clustering of sequentially incoming data without prior knowledge suffers from changing cluster numbers and tends to fall into local extrema according to given data order. To overcome these limitations, we propose a stabilized developmental resonance network (s-DRN). First, we analyze the instability of the conventional choice function during the node activation process and desig… ▽ More Online incremental clustering of sequentially incoming data without prior knowledge suffers from changing cluster numbers and tends to fall into local extrema according to given data order. To overcome these limitations, we propose a stabilized developmental resonance network (s-DRN). First, we analyze the instability of the conventional choice function during the node activation process and design a scalable activation function to make clustering performance stable over all input data scales. Next, we devise three criteria for the node grou** algorithm: distance, intersection over union (IoU) and size criteria. The proposed node grou** algorithm effectively excludes unnecessary clusters from incrementally created clusters, diminishes the performance dependency on vigilance parameters and makes the clustering process robust. To verify the performance of the proposed s-DRN model, comparative studies are conducted on six real-world datasets whose statistical characteristics are distinctive. The comparative studies demonstrate the proposed s-DRN outperforms baselines in terms of stability and accuracy. △ Less

Submitted 15 July, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

Comments: Under review

arXiv:1911.05939 [pdf, other]

SimVODIS: Simultaneous Visual Odometry, Object Detection, and Instance Segmentation

Authors: Ue-Hwan Kim, Se-Ho Kim, Jong-Hwan Kim

Abstract: Intelligent agents need to understand the surrounding environment to provide meaningful services to or interact intelligently with humans. The agents should perceive geometric features as well as semantic entities inherent in the environment. Contemporary methods in general provide one type of information regarding the environment at a time, making it difficult to conduct high-level tasks. Moreove… ▽ More Intelligent agents need to understand the surrounding environment to provide meaningful services to or interact intelligently with humans. The agents should perceive geometric features as well as semantic entities inherent in the environment. Contemporary methods in general provide one type of information regarding the environment at a time, making it difficult to conduct high-level tasks. Moreover, running two types of methods and associating two resultant information requires a lot of computation and complicates the software architecture. To overcome these limitations, we propose a neural architecture that simultaneously performs both geometric and semantic tasks in a single thread: simultaneous visual odometry, object detection, and instance segmentation (SimVODIS). Training SimVODIS requires unlabeled video sequences and the photometric consistency between input image frames generates self-supervision signals. The performance of SimVODIS outperforms or matches the state-of-the-art performance in pose estimation, depth map prediction, object detection, and instance segmentation tasks while completing all the tasks in a single thread. We expect SimVODIS would enhance the autonomy of intelligent agents and let the agents provide effective services to humans. △ Less

Submitted 16 November, 2019; v1 submitted 14 November, 2019; originally announced November 2019.

Comments: Submitted to TPAMI

arXiv:1908.08204 [pdf, other]

Convolutional Recurrent Reconstructive Network for Spatiotemporal Anomaly Detection in Solder Paste Inspection

Authors: Yong-Ho Yoo, Ue-Hwan Kim, Jong-Hwan Kim

Abstract: Surface mount technology (SMT) is a process for producing printed circuit boards. Solder paste printer (SPP), package mounter, and solder reflow oven are used for SMT. The board on which the solder paste is deposited from the SPP is monitored by solder paste inspector (SPI). If SPP malfunctions due to the printer defects, the SPP produces defective products, and then abnormal patterns are detected… ▽ More Surface mount technology (SMT) is a process for producing printed circuit boards. Solder paste printer (SPP), package mounter, and solder reflow oven are used for SMT. The board on which the solder paste is deposited from the SPP is monitored by solder paste inspector (SPI). If SPP malfunctions due to the printer defects, the SPP produces defective products, and then abnormal patterns are detected by SPI. In this paper, we propose a convolutional recurrent reconstructive network (CRRN), which decomposes the anomaly patterns generated by the printer defects, from SPI data. CRRN learns only normal data and detects anomaly pattern through reconstruction error. CRRN consists of a spatial encoder (S-Encoder), a spatiotemporal encoder and decoder (ST-Encoder-Decoder), and a spatial decoder (S-Decoder). The ST-Encoder-Decoder consists of multiple convolutional spatiotemporal memories (CSTMs) with ST-Attention mechanism. CSTM is developed to extract spatiotemporal patterns efficiently. Additionally, a spatiotemporal attention (ST-Attention) mechanism is designed to facilitate transmitting information from the ST-Encoder to the ST-Decoder, which can solve the long-term dependency problem. We demonstrate the proposed CRRN outperforms the other conventional models in anomaly detection. Moreover, we show the discriminative power of the anomaly map decomposed by the proposed CRRN through the printer defect classification. △ Less

Submitted 22 August, 2019; originally announced August 2019.

arXiv:1908.04929 [pdf]

doi 10.1109/TCYB.2019.2931042

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents

Authors: Ue-Hwan Kim, **-Man Park, Taek-** Song, Jong-Hwan Kim

Abstract: Intelligent agents gather information and perceive semantics within the environments before taking on given tasks. The agents store the collected information in the form of environment models that compactly represent the surrounding environments. The agents, however, can only conduct limited tasks without an efficient and effective environment model. Thus, such an environment model takes a crucial… ▽ More Intelligent agents gather information and perceive semantics within the environments before taking on given tasks. The agents store the collected information in the form of environment models that compactly represent the surrounding environments. The agents, however, can only conduct limited tasks without an efficient and effective environment model. Thus, such an environment model takes a crucial role for the autonomy systems of intelligent agents. We claim the following characteristics for a versatile environment model: accuracy, applicability, usability, and scalability. Although a number of researchers have attempted to develop such models that represent environments precisely to a certain degree, they lack broad applicability, intuitive usability, and satisfactory scalability. To tackle these limitations, we propose 3-D scene graph as an environment model and the 3-D scene graph construction framework. The concise and widely used graph structure readily guarantees usability as well as scalability for 3-D scene graph. We demonstrate the accuracy and applicability of the 3-D scene graph by exhibiting the deployment of the 3-D scene graph in practical applications. Moreover, we verify the performance of the proposed 3-D scene graph and the framework by conducting a series of comprehensive experiments under various conditions. △ Less

Submitted 13 August, 2019; originally announced August 2019.

Comments: Early Access

arXiv:1907.13285 [pdf, other]

I-Keyboard: Fully Imaginary Keyboard on Touch Devices Empowered by Deep Neural Decoder

Authors: Ue-Hwan Kim, Sahng-Min Yoo, Jong-Hwan Kim

Abstract: Text-entry aims to provide an effective and efficient pathway for humans to deliver their messages to computers. With the advent of mobile computing, the recent focus of text-entry research has moved from physical keyboards to soft keyboards. Current soft keyboards, however, increase the typo rate due to lack of tactile feedback and degrade the usability of mobile devices due to their large portio… ▽ More Text-entry aims to provide an effective and efficient pathway for humans to deliver their messages to computers. With the advent of mobile computing, the recent focus of text-entry research has moved from physical keyboards to soft keyboards. Current soft keyboards, however, increase the typo rate due to lack of tactile feedback and degrade the usability of mobile devices due to their large portion on screens. To tackle these limitations, we propose a fully imaginary keyboard (I-Keyboard) with a deep neural decoder (DND). The invisibility of I-Keyboard maximizes the usability of mobile devices and DND empowered by a deep neural architecture allows users to start ty** from any position on the touch screens at any angle. To the best of our knowledge, the eyes-free ten-finger ty** scenario of I-Keyboard which does not necessitate both a calibration step and a predefined region for ty** is first explored in this work. For the purpose of training DND, we collected the largest user data in the process of develo** I-Keyboard. We verified the performance of the proposed I-Keyboard and DND by conducting a series of comprehensive simulations and experiments under various conditions. I-Keyboard showed 18.95% and 4.06% increases in ty** speed (45.57 WPM) and accuracy (95.84%), respectively over the baseline. △ Less

Submitted 30 July, 2019; originally announced July 2019.

Comments: Submitted to IEEE TRANSACTIONS ON CYBERNETICS

arXiv:1907.13274 [pdf]

doi 10.1109/TCYB.2018.2882921

A Stabilized Feedback Episodic Memory (SF-EM) and Home Service Provision Framework for Robot and IoT Collaboration

Authors: Ue-Hwan Kim, Jong-Hwan Kim

Abstract: The automated home referred to as Smart Home is expected to offer fully customized services to its residents, reducing the amount of home labor, thus improving human beings' welfare. Service robots and Internet of Things (IoT) play the key roles in the development of Smart Home. The service provision with these two main components in a Smart Home environment requires: 1) learning and reasoning alg… ▽ More The automated home referred to as Smart Home is expected to offer fully customized services to its residents, reducing the amount of home labor, thus improving human beings' welfare. Service robots and Internet of Things (IoT) play the key roles in the development of Smart Home. The service provision with these two main components in a Smart Home environment requires: 1) learning and reasoning algorithms and 2) the integration of robot and IoT systems. Conventional computational intelligence-based learning and reasoning algorithms do not successfully manage dynamic changes in the Smart Home data, and the simple integrations fail to fully draw the synergies from the collaboration of the two systems. To tackle these limitations, we propose: 1) a stabilized memory network with a feedback mechanism which can learn user behaviors in an incremental manner and 2) a robot-IoT service provision framework for a Smart Home which utilizes the proposed memory architecture as a learning and reasoning module and exploits synergies between the robot and IoT systems. We conduct a set of comprehensive experiments under various conditions to verify the performance of the proposed memory architecture and the service provision framework and analyze the experiment results. △ Less

Submitted 30 July, 2019; originally announced July 2019.

Comments: Accepted (Early Access)

arXiv:1809.05001 [pdf]

Reductive property of new fuzzy reasoning method based on distance measure

Authors: Son-il Kwak, Gum-ju Kim, Michio Sugeno, Gwang-chol Li, Myong-suk Son, Hyok-chol Kim, Un-ha Kim

Abstract: Firstly in this paper we propose a new criterion function for evaluation of the reductive property about the fuzzy reasoning result for fuzzy modus ponens and fuzzy modus tollens. Secondly unlike fuzzy reasoning methods based on the similarity measure, we propose a new fuzzy reasoning method based on distance measure. Thirdly the reductive property for 5 fuzzy reasoning methods are checked with re… ▽ More Firstly in this paper we propose a new criterion function for evaluation of the reductive property about the fuzzy reasoning result for fuzzy modus ponens and fuzzy modus tollens. Secondly unlike fuzzy reasoning methods based on the similarity measure, we propose a new fuzzy reasoning method based on distance measure. Thirdly the reductive property for 5 fuzzy reasoning methods are checked with respect to fuzzy modus ponens and fuzzy modus tollens. Through the experiment, we show that proposed method is better than the previous methods in accordance with human thinking. △ Less

Submitted 7 September, 2018; originally announced September 2018.

arXiv:1708.03431 [pdf]

Iterative Deep Convolutional Encoder-Decoder Network for Medical Image Segmentation

Authors: Jung Uk Kim, Hak Gu Kim, Yong Man Ro

Abstract: In this paper, we propose a novel medical image segmentation using iterative deep learning framework. We have combined an iterative learning approach and an encoder-decoder network to improve segmentation results, which enables to precisely localize the regions of interest (ROIs) including complex shapes or detailed textures of medical images in an iterative manner. The proposed iterative deep con… ▽ More In this paper, we propose a novel medical image segmentation using iterative deep learning framework. We have combined an iterative learning approach and an encoder-decoder network to improve segmentation results, which enables to precisely localize the regions of interest (ROIs) including complex shapes or detailed textures of medical images in an iterative manner. The proposed iterative deep convolutional encoder-decoder network consists of two main paths: convolutional encoder path and convolutional decoder path with iterative learning. Experimental results show that the proposed iterative deep learning framework is able to yield excellent medical image segmentation performances for various medical images. The effectiveness of the proposed method has been proved by comparing with other state-of-the-art medical image segmentation methods. △ Less

Submitted 11 August, 2017; originally announced August 2017.

Comments: accepted at EMBC 2017

Showing 1–27 of 27 results for author: Kim, U