Search | arXiv e-print repository

THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation

Authors: Seo Hyun Kim, Kai Tzu-iunn Ong, Taeyoon Kwon, Namyoung Kim, Keummin Ka, SeongHyeon Bae, Yohan Jo, Seung-won Hwang, Dongha Lee, **young Yeo

Abstract: Large language models (LLMs) are capable of processing lengthy dialogue histories during prolonged interaction with users without additional memory modules; however, their responses tend to overlook or incorrectly recall information from the past. In this paper, we revisit memory-augmented response generation in the era of LLMs. While prior work focuses on getting rid of outdated memories, we argu… ▽ More Large language models (LLMs) are capable of processing lengthy dialogue histories during prolonged interaction with users without additional memory modules; however, their responses tend to overlook or incorrectly recall information from the past. In this paper, we revisit memory-augmented response generation in the era of LLMs. While prior work focuses on getting rid of outdated memories, we argue that such memories can provide contextual cues that help dialogue systems understand the development of past events and, therefore, benefit response generation. We present Theanine, a framework that augments LLMs' response generation with memory timelines -- series of memories that demonstrate the development and causality of relevant past events. Along with Theanine, we introduce TeaFarm, a counterfactual-driven question-answering pipeline addressing the limitation of G-Eval in long-term conversations. Supplementary videos of our methods and the TeaBag dataset for TeaFarm evaluation are in https://theanine-693b0.web.app/. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Under Review

arXiv:2403.04787 [pdf, other]

Ever-Evolving Memory by Blending and Refining the Past

Authors: Seo Hyun Kim, Keummin Ka, Yohan Jo, Seung-won Hwang, Dongha Lee, **young Yeo

Abstract: For a human-like chatbot, constructing a long-term memory is crucial. However, current large language models often lack this capability, leading to instances of missing important user information or redundantly asking for the same information, thereby diminishing conversation quality. To effectively construct memory, it is crucial to seamlessly connect past and present information, while also poss… ▽ More For a human-like chatbot, constructing a long-term memory is crucial. However, current large language models often lack this capability, leading to instances of missing important user information or redundantly asking for the same information, thereby diminishing conversation quality. To effectively construct memory, it is crucial to seamlessly connect past and present information, while also possessing the ability to forget obstructive information. To address these challenges, we propose CREEM, a novel memory system for long-term conversation. Improving upon existing approaches that construct memory based solely on current sessions, CREEM blends past memories during memory formation. Additionally, we introduce a refining process to handle redundant or outdated information. Unlike traditional paradigms, we view responding and memory construction as inseparable tasks. The blending process, which creates new memories, also serves as a reasoning step for response generation by informing the connection between past and present. Through evaluation, we demonstrate that CREEM enhances both memory and response qualities in multi-session personalized dialogues. △ Less

Submitted 7 April, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 17 pages, 4 figures, 7 tables

arXiv:2312.13822 [pdf, other]

Universal Noise Annotation: Unveiling the Impact of Noisy annotation on Object Detection

Authors: Kwangrok Ryoo, Yeonsik Jo, Seungjun Lee, Mira Kim, Ahra Jo, Seung Hwan Kim, Seungryong Kim, Soonyoung Lee

Abstract: For object detection task with noisy labels, it is important to consider not only categorization noise, as in image classification, but also localization noise, missing annotations, and bogus bounding boxes. However, previous studies have only addressed certain types of noise (e.g., localization or categorization). In this paper, we propose Universal-Noise Annotation (UNA), a more practical settin… ▽ More For object detection task with noisy labels, it is important to consider not only categorization noise, as in image classification, but also localization noise, missing annotations, and bogus bounding boxes. However, previous studies have only addressed certain types of noise (e.g., localization or categorization). In this paper, we propose Universal-Noise Annotation (UNA), a more practical setting that encompasses all types of noise that can occur in object detection, and analyze how UNA affects the performance of the detector. We analyzed the development direction of previous works of detection algorithms and examined the factors that impact the robustness of detection model learning method. We open-source the code for injecting UNA into the dataset and all the training log and weight are also shared. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: appendix and code : https://github.com/Ryoo72/UNA

arXiv:2312.12661 [pdf, other]

Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining

Authors: Bumsoo Kim, Yeonsik Jo, **hyung Kim, Seung Hwan Kim

Abstract: Contrastive Language-Image Pretraining has emerged as a prominent approach for training vision and text encoders with uncurated image-text pairs from the web. To enhance data-efficiency, recent efforts have introduced additional supervision terms that involve random-augmented views of the image. However, since the image augmentation process is unaware of its text counterpart, this procedure could… ▽ More Contrastive Language-Image Pretraining has emerged as a prominent approach for training vision and text encoders with uncurated image-text pairs from the web. To enhance data-efficiency, recent efforts have introduced additional supervision terms that involve random-augmented views of the image. However, since the image augmentation process is unaware of its text counterpart, this procedure could cause various degrees of image-text misalignments during training. Prior methods either disregarded this discrepancy or introduced external models to mitigate the impact of misalignments during training. In contrast, we propose a novel metric learning approach that capitalizes on these misalignments as an additional training source, which we term "Misalign, Contrast then Distill (MCD)". Unlike previous methods that treat augmented images and their text counterparts as simple positive pairs, MCD predicts the continuous scales of misalignment caused by the augmentation. Our extensive experimental results show that our proposed MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: ICCV 2023

arXiv:2312.12659 [pdf, other]

Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders

Authors: Bumsoo Kim, **hyung Kim, Yeonsik Jo, Seung Hwan Kim

Abstract: Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-… ▽ More Recent advances in vision language pretraining (VLP) have been largely attributed to the large-scale data collected from the web. However, uncurated dataset contains weakly correlated image-text pairs, causing data inefficiency. To address the issue, knowledge distillation have been explored at the expense of extra image and text momentum encoders to generate teaching signals for misaligned image-text pairs. In this paper, our goal is to resolve the misalignment problem with an efficient distillation framework. To this end, we propose ECLIPSE: Expediting Contrastive Language-Image Pretraining with Self-distilled Encoders. ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. This strategic design choice enables the distillation to operate within a unified projected space of text embedding, resulting in better performance. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder. Through our extensive experiments, we validate that there is a sweet spot between expedition and distillation where the partial view from the expedited online image encoder interacts complementarily with the momentum teacher. As a result, ECLIPSE outperforms its counterparts while achieving substantial acceleration in inference speed. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: AAAI 2024

arXiv:2310.15263 [pdf, other]

One-hot Generalized Linear Model for Switching Brain State Discovery

Authors: Chengrui Li, Soon Ho Kim, Chris Rodgers, Hannah Choi, Anqi Wu

Abstract: Exposing meaningful and interpretable neural interactions is critical to understanding neural circuits. Inferred neural interactions from neural signals primarily reflect functional interactions. In a long experiment, subject animals may experience different stages defined by the experiment, stimuli, or behavioral states, and hence functional interactions can change over time. To model dynamically… ▽ More Exposing meaningful and interpretable neural interactions is critical to understanding neural circuits. Inferred neural interactions from neural signals primarily reflect functional interactions. In a long experiment, subject animals may experience different stages defined by the experiment, stimuli, or behavioral states, and hence functional interactions can change over time. To model dynamically changing functional interactions, prior work employs state-switching generalized linear models with hidden Markov models (i.e., HMM-GLMs). However, we argue they lack biological plausibility, as functional interactions are shaped and confined by the underlying anatomical connectome. Here, we propose a novel prior-informed state-switching GLM. We introduce both a Gaussian prior and a one-hot prior over the GLM in each state. The priors are learnable. We will show that the learned prior should capture the state-constant interaction, shedding light on the underlying anatomical connectome and revealing more likely physical neuron interactions. The state-dependent interaction modeled by each GLM offers traceability to capture functional variations across multiple brain states. Our methods effectively recover true interaction structures in simulated data, achieve the highest predictive likelihood with real neural datasets, and render interaction structures and hidden states more interpretable when applied to real neural data. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.08221 [pdf, other]

SimCKP: Simple Contrastive Learning of Keyphrase Representations

Authors: Minseok Choi, Chaeheon Gwak, Seho Kim, Si Hyeong Kim, Jaegul Choo

Abstract: Keyphrase generation (KG) aims to generate a set of summarizing words or phrases given a source document, while keyphrase extraction (KE) aims to identify them from the text. Because the search space is much smaller in KE, it is often combined with KG to predict keyphrases that may or may not exist in the corresponding document. However, current unified approaches adopt sequence labeling and maxim… ▽ More Keyphrase generation (KG) aims to generate a set of summarizing words or phrases given a source document, while keyphrase extraction (KE) aims to identify them from the text. Because the search space is much smaller in KE, it is often combined with KG to predict keyphrases that may or may not exist in the corresponding document. However, current unified approaches adopt sequence labeling and maximization-based generation that primarily operate at a token level, falling short in observing and scoring keyphrases as a whole. In this work, we propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art models by a significant margin. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: Accepted to Findings of EMNLP 2023

arXiv:2309.01961 [pdf, other]

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Authors: Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bohyung Han, Kyoung Mu Lee, Honglak Lee, Kyounghoon Bae, Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu, Youngtaek Oh, Jae Won Cho, Dong-** Kim, In So Kweon, Junmo Kim, Wooyoung Kang, Won Young Jhoo, Byungseok Roh , et al. (17 additional authors not shown)

Abstract: In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested… ▽ More In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested using a new evaluation dataset that includes a large variety of visual concepts from many domains. There was no specific training data provided for the challenge, and therefore the challenge entries were required to adapt to new types of image descriptions that had not been seen during training. This report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries. We expect that the outcomes of the challenge will contribute to the improvement of AI models on various vision-language tasks. △ Less

Submitted 10 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

Comments: Tech report, project page https://nice.lgresearch.ai/

arXiv:2308.07575 [pdf, other]

Story Visualization by Online Text Augmentation with Context Memory

Authors: Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee, Dongyeop Kang, Jonghyun Choi

Abstract: Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convin… ▽ More Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity. △ Less

Submitted 19 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: ICCV 2023, Project page: https://dcahn12.github.io/projects/CMOTA/

arXiv:2305.16713 [pdf, other]

ReConPatch : Contrastive Patch Representation Learning for Industrial Anomaly Detection

Authors: Jeeho Hyun, Sangyun Kim, Giyoung Jeon, Seung Hwan Kim, Kyunghoon Bae, Byung Jun Kang

Abstract: Anomaly detection is crucial to the advanced identification of product defects such as incorrect parts, misaligned components, and damages in industrial manufacturing. Due to the rare observations and unknown types of defects, anomaly detection is considered to be challenging in machine learning. To overcome this difficulty, recent approaches utilize the common visual representations pre-trained f… ▽ More Anomaly detection is crucial to the advanced identification of product defects such as incorrect parts, misaligned components, and damages in industrial manufacturing. Due to the rare observations and unknown types of defects, anomaly detection is considered to be challenging in machine learning. To overcome this difficulty, recent approaches utilize the common visual representations pre-trained from natural image datasets and distill the relevant features. However, existing approaches still have the discrepancy between the pre-trained feature and the target data, or require the input augmentation which should be carefully designed, particularly for the industrial dataset. In this paper, we introduce ReConPatch, which constructs discriminative features for anomaly detection by training a linear modulation of patch features extracted from the pre-trained model. ReConPatch employs contrastive representation learning to collect and distribute features in a way that produces a target-oriented and easily separable representation. To address the absence of labeled pairs for the contrastive learning, we utilize two similarity measures between data representations, pairwise and contextual similarities, as pseudo-labels. Our method achieves the state-of-the-art anomaly detection performance (99.72%) for the widely used and challenging MVTec AD dataset. Additionally, we achieved a state-of-the-art anomaly detection performance (95.8%) for the BTAD dataset. △ Less

Submitted 10 January, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted on WACV 2024

arXiv:2304.01576 [pdf, other]

MESAHA-Net: Multi-Encoders based Self-Adaptive Hard Attention Network with Maximum Intensity Projections for Lung Nodule Segmentation in CT Scan

Authors: Muhammad Usman, Azka Rehman, Abdullah Shahid, Siddique Latif, Shi Sub Byon, Sung Hyun Kim, Tariq Mahmood Khan, Yeong Gil Shin

Abstract: Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for develo** robust nodule segmenta… ▽ More Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for develo** robust nodule segmentation methods. In this study, we propose an efficient end-to-end framework, the multi-encoder-based self-adaptive hard attention network (MESAHA-Net), for precise lung nodule segmentation in CT scans. MESAHA-Net comprises three encoding paths, an attention block, and a decoder block, facilitating the integration of three types of inputs: CT slice patches, forward and backward maximum intensity projection (MIP) images, and region of interest (ROI) masks encompassing the nodule. By employing a novel adaptive hard attention mechanism, MESAHA-Net iteratively performs slice-by-slice 2D segmentation of lung nodules, focusing on the nodule region in each slice to generate 3D volumetric segmentation of lung nodules. The proposed framework has been comprehensively evaluated on the LIDC-IDRI dataset, the largest publicly available dataset for lung nodule segmentation. The results demonstrate that our approach is highly robust for various lung nodule types, outperforming previous state-of-the-art techniques in terms of segmentation accuracy and computational complexity, rendering it suitable for real-time clinical implementation. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2303.09917 [pdf, other]

Vision Transformer for Action Units Detection

Authors: Tu Vu, Van Thong Huynh, Soo Hyung Kim

Abstract: Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) com… ▽ More Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge. △ Less

Submitted 20 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: Will be updated

arXiv:2302.05811 [pdf, other]

Hierarchical control and learning of a foraging CyberOctopus

Authors: Chia-Hsien Shih, Noel Naughton, Udit Halder, Heng-Sheng Chang, Seung Hyun Kim, Rhanor Gillette, Prashant G. Mehta, Mattia Gazzola

Abstract: Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition resul… ▽ More Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition results in significant improvements relative to end-to-end methods. Performance is achieved through a mixed-modes approach, whereby qualitatively different tasks are addressed via complementary control schemes. Here, model-free reinforcement learning is employed for high-level decision-making, while model-based energy sha** takes care of arm-level motor execution. To render the pairing computationally tenable, a novel neural-network energy sha** (NN-ES) controller is developed, achieving accurate motions with time-to-solutions 200 times faster than previous attempts. Our hierarchical framework is then successfully deployed in increasingly challenging foraging scenarios, including an arena littered with obstacles in 3D space, demonstrating the viability of our approach. △ Less

Submitted 11 February, 2023; originally announced February 2023.

Comments: 16 pages, 7 figures

arXiv:2302.02506 [pdf]

Generating Dispatching Rules for the Interrupting Swap-Allowed Blocking Job Shop Problem Using Graph Neural Network and Reinforcement Learning

Authors: Vivian W. H. Wong, Sang Hun Kim, Junyoung Park, **kyoo Park, Kincho H. Law

Abstract: The interrupting swap-allowed blocking job shop problem (ISBJSSP) is a complex scheduling problem that is able to model many manufacturing planning and logistics applications realistically by addressing both the lack of storage capacity and unforeseen production interruptions. Subjected to random disruptions due to machine malfunction or maintenance, industry production settings often choose to ad… ▽ More The interrupting swap-allowed blocking job shop problem (ISBJSSP) is a complex scheduling problem that is able to model many manufacturing planning and logistics applications realistically by addressing both the lack of storage capacity and unforeseen production interruptions. Subjected to random disruptions due to machine malfunction or maintenance, industry production settings often choose to adopt dispatching rules to enable adaptive, real-time re-scheduling, rather than traditional methods that require costly re-computation on the new configuration every time the problem condition changes dynamically. To generate dispatching rules for the ISBJSSP problem, we introduce a dynamic disjunctive graph formulation characterized by nodes and edges subjected to continuous deletions and additions. This formulation enables the training of an adaptive scheduler utilizing graph neural networks and reinforcement learning. Furthermore, a simulator is developed to simulate interruption, swap**, and blocking in the ISBJSSP setting. Employing a set of reported benchmark instances, we conduct a detailed experimental study on ISBJSSP instances with a range of machine shutdown probabilities to show that the scheduling policies generated can outperform or are at least as competitive as existing dispatching rules with predetermined priority. This study shows that the ISBJSSP, which requires real-time adaptive solutions, can be scheduled efficiently with the proposed method when production interruptions occur with random machine shutdowns. △ Less

Submitted 28 September, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

Comments: 14 pages, 10 figures. Supplementary Material not included

arXiv:2212.07050 [pdf, other]

Significantly Improving Zero-Shot X-ray Pathology Classification via Fine-tuning Pre-trained Image-Text Encoders

Authors: Jongseong Jang, Daeun Kyung, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae, Edward Choi

Abstract: Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large… ▽ More Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large-scale general domain data. Specifically, researchers took contrastive image-text encoders (e.g., CLIP) and fine-tuned it with chest X-ray images and paired reports to perform zero-shot pathology classification, thus completely removing the need for pathology-annotated images to train a classification model. Existing studies, however, fine-tuned the pre-trained model with the same contrastive learning objective, and failed to exploit the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy based on sentence sampling and positive pair loss relaxation for improving the downstream zero-shot pathology classification performance, which can be applied to any pre-trained contrastive image-text encoders. Our method consistently showed dramatically improved zero-shot pathology classification performance on four different chest X-ray datasets and 3 different pre-trained models (5.77% average AUROC increase). In particular, fine-tuning CLIP with our method showed much comparable or marginally outperformed to board-certified radiologists (0.619 vs 0.625 in F1 score and 0.530 vs 0.544 in MCC) in zero-shot classification of five prominent diseases from the CheXpert dataset. △ Less

Submitted 16 March, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

arXiv:2211.06774 [pdf, other]

Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Authors: Taehoon Kim, Mark Marsden, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Alessandra Sala, Seung Hwan Kim

Abstract: When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this pa… ▽ More When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning. △ Less

Submitted 1 October, 2023; v1 submitted 12 November, 2022; originally announced November 2022.

Comments: Arxiv Preprint. Work in progress

arXiv:2211.03279 [pdf, other]

A Context-Aware Computational Approach for Measuring Vocal Entrainment in Dyadic Conversations

Authors: Rimita Lahiri, Md Nasir, Catherine Lord, So Hyun Kim, Shrikanth Narayanan

Abstract: Vocal entrainment is a social adaptation mechanism in human interaction, knowledge of which can offer useful insights to an individual's cognitive-behavioral characteristics. We propose a context-aware approach for measuring vocal entrainment in dyadic conversations. We use conformers(a combination of convolutional network and transformer) for capturing both short-term and long-term conversational… ▽ More Vocal entrainment is a social adaptation mechanism in human interaction, knowledge of which can offer useful insights to an individual's cognitive-behavioral characteristics. We propose a context-aware approach for measuring vocal entrainment in dyadic conversations. We use conformers(a combination of convolutional network and transformer) for capturing both short-term and long-term conversational context to model entrainment patterns in interactions across different domains. Specifically we use cross-subject attention layers to learn intra- as well as inter-personal signals from dyadic conversations. We first validate the proposed method based on classification experiments to distinguish between real(consistent) and fake(inconsistent/shuffled) conversations. Experimental results on interactions involving individuals with Autism Spectrum Disorder also show evidence of a statistically-significant association between the introduced entrainment measure and clinical scores relevant to symptoms, including across gender and age groups. △ Less

Submitted 6 November, 2022; originally announced November 2022.

arXiv:2211.00003 [pdf, other]

MEDS-Net: Self-Distilled Multi-Encoders Network with Bi-Direction Maximum Intensity projections for Lung Nodule Detection

Authors: Muhammad Usman, Azka Rehman, Abdullah Shahid, Siddique Latif, Shi Sub Byon, Byoung Dai Lee, Sung Hyun Kim, Byung il Lee, Yeong Gil Shin

Abstract: In this study, we propose a lung nodule detection scheme which fully incorporates the clinic workflow of radiologists. Particularly, we exploit Bi-Directional Maximum intensity projection (MIP) images of various thicknesses (i.e., 3, 5 and 10mm) along with a 3D patch of CT scan, consisting of 10 adjacent slices to feed into self-distillation-based Multi-Encoders Network (MEDS-Net). The proposed ar… ▽ More In this study, we propose a lung nodule detection scheme which fully incorporates the clinic workflow of radiologists. Particularly, we exploit Bi-Directional Maximum intensity projection (MIP) images of various thicknesses (i.e., 3, 5 and 10mm) along with a 3D patch of CT scan, consisting of 10 adjacent slices to feed into self-distillation-based Multi-Encoders Network (MEDS-Net). The proposed architecture first condenses 3D patch input to three channels by using a dense block which consists of dense units which effectively examine the nodule presence from 2D axial slices. This condensed information, along with the forward and backward MIP images, is fed to three different encoders to learn the most meaningful representation, which is forwarded into the decoded block at various levels. At the decoder block, we employ a self-distillation mechanism by connecting the distillation block, which contains five lung nodule detectors. It helps to expedite the convergence and improves the learning ability of the proposed architecture. Finally, the proposed scheme reduces the false positives by complementing the main detector with auxiliary detectors. The proposed scheme has been rigorously evaluated on 888 scans of LUNA16 dataset and obtained a CPM score of 93.6\%. The results demonstrate that incorporating of bi-direction MIP images enables MEDS-Net to effectively distinguish nodules from surroundings which help to achieve the sensitivity of 91.5% and 92.8% with false positives rate of 0.25 and 0.5 per scan, respectively. △ Less

Submitted 26 December, 2022; v1 submitted 30 October, 2022; originally announced November 2022.

arXiv:2210.03739 [pdf, other]

Dual-Stage Deeply Supervised Attention-based Convolutional Neural Networks for Mandibular Canal Segmentation in CBCT Scans

Authors: Azka Rehman, Muhammad Usman, Rabeea Jawaid, Amal Muhammad Saleem, Shi Sub Byon, Sung Hyun Kim, Byoung Dai Lee, Byung il Lee, Yeong Gil Shin

Abstract: Accurate segmentation of mandibular canals in lower jaws is important in dental implantology. Medical experts determine the implant position and dimensions manually from 3D CT images to avoid damaging the mandibular nerve inside the canal. In this paper, we propose a novel dual-stage deep learning-based scheme for the automatic segmentation of the mandibular canal. Particularly, we first enhance t… ▽ More Accurate segmentation of mandibular canals in lower jaws is important in dental implantology. Medical experts determine the implant position and dimensions manually from 3D CT images to avoid damaging the mandibular nerve inside the canal. In this paper, we propose a novel dual-stage deep learning-based scheme for the automatic segmentation of the mandibular canal. Particularly, we first enhance the CBCT scans by employing the novel histogram-based dynamic windowing scheme, which improves the visibility of mandibular canals. After enhancement, we design 3D deeply supervised attention U-Net architecture for localizing the volumes of interest (VOIs), which contain the mandibular canals (i.e., left and right canals). Finally, we employed the multi-scale input residual U-Net architecture (MS-R-UNet) to segment the mandibular canals using VOIs accurately. The proposed method has been rigorously evaluated on 500 scans. The results demonstrate that our technique outperforms the current state-of-the-art segmentation performance and robustness methods. △ Less

Submitted 2 November, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: 7 Pages

arXiv:2209.13430 [pdf, other]

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Authors: Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, Junmo Kim

Abstract: Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on i… ▽ More Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance. △ Less

Submitted 31 October, 2022; v1 submitted 27 September, 2022; originally announced September 2022.

Comments: Neural Information Processing Systems (NeurIPS) 2022

arXiv:2208.08112 [pdf, other]

DLCFT: Deep Linear Continual Fine-Tuning for General Incremental Learning

Authors: Hyounguk Shon, Janghyeon Lee, Seung Hwan Kim, Junmo Kim

Abstract: Pre-trained representation is one of the key elements in the success of modern deep learning. However, existing works on continual learning methods have mostly focused on learning models incrementally from scratch. In this paper, we explore an alternative framework to incremental learning where we continually fine-tune the model from a pre-trained representation. Our method takes advantage of line… ▽ More Pre-trained representation is one of the key elements in the success of modern deep learning. However, existing works on continual learning methods have mostly focused on learning models incrementally from scratch. In this paper, we explore an alternative framework to incremental learning where we continually fine-tune the model from a pre-trained representation. Our method takes advantage of linearization technique of a pre-trained neural network for simple and effective continual learning. We show that this allows us to design a linear model where quadratic parameter regularization method is placed as the optimal continual learning policy, and at the same time enjoying the high performance of neural networks. We also show that the proposed algorithm enables parameter regularization methods to be applied to class-incremental problems. Additionally, we provide a theoretical reason why the existing parameter-space regularization algorithms such as EWC underperform on neural networks trained with cross-entropy loss. We show that the proposed method can prevent forgetting while achieving high continual fine-tuning performance on image classification tasks. To show that our method can be applied to general continual learning settings, we evaluate our method in data-incremental, task-incremental, and class-incremental learning problems. △ Less

Submitted 17 August, 2022; originally announced August 2022.

Comments: European Conference on Computer Vision (ECCV) 2022

arXiv:2203.07682 [pdf, other]

Enriched CNN-Transformer Feature Aggregation Networks for Super-Resolution

Authors: **su Yoo, Taehoon Kim, Sihaeng Lee, Seung Hwan Kim, Honglak Lee, Tae Hyun Kim

Abstract: Recent transformer-based super-resolution (SR) methods have achieved promising results against conventional CNN-based methods. However, these approaches suffer from essential shortsightedness created by only utilizing the standard self-attention-based reasoning. In this paper, we introduce an effective hybrid SR network to aggregate enriched features, including local features from CNNs and long-ra… ▽ More Recent transformer-based super-resolution (SR) methods have achieved promising results against conventional CNN-based methods. However, these approaches suffer from essential shortsightedness created by only utilizing the standard self-attention-based reasoning. In this paper, we introduce an effective hybrid SR network to aggregate enriched features, including local features from CNNs and long-range multi-scale dependencies captured by transformers. Specifically, our network comprises transformer and convolutional branches, which synergetically complement each representation during the restoration procedure. Furthermore, we propose a cross-scale token attention module, allowing the transformer branch to exploit the informative relationships among tokens across different scales efficiently. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets. △ Less

Submitted 20 October, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: WACV 2023

arXiv:2202.07741 [pdf, other]

Disentangling Successor Features for Coordination in Multi-agent Reinforcement Learning

Authors: Seung Hyun Kim, Neale Van Stralen, Girish Chowdhary, Huy T. Tran

Abstract: Multi-agent reinforcement learning (MARL) is a promising framework for solving complex tasks with many agents. However, a key challenge in MARL is defining private utility functions that ensure coordination when training decentralized agents. This challenge is especially prevalent in unstructured tasks with sparse rewards and many agents. We show that successor features can help address this chall… ▽ More Multi-agent reinforcement learning (MARL) is a promising framework for solving complex tasks with many agents. However, a key challenge in MARL is defining private utility functions that ensure coordination when training decentralized agents. This challenge is especially prevalent in unstructured tasks with sparse rewards and many agents. We show that successor features can help address this challenge by disentangling an individual agent's impact on the global value function from that of all other agents. We use this disentanglement to compactly represent private utilities that support stable training of decentralized agents in unstructured tasks. We implement our approach using a centralized training, decentralized execution architecture and test it in a variety of multi-agent environments. Our results show improved performance and training time relative to existing methods and suggest that disentanglement of successor features offers a promising approach to coordination in MARL. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: The paper is accepted in AAMAS 2022 (International Conference on Autonomous Agents and Multiagent Systems)

arXiv:2112.00343 [pdf, other]

Camera Motion Agnostic 3D Human Pose Estimation

Authors: Seong Hyun Kim, Sunwon Jeong, Sungbum Park, Ju Yong Chang

Abstract: Although the performance of 3D human pose and shape estimation methods has improved significantly in recent years, existing approaches typically generate 3D poses defined in camera or human-centered coordinate system. This makes it difficult to estimate a person's pure pose and motion in world coordinate system for a video captured using a moving camera. To address this issue, this paper presents… ▽ More Although the performance of 3D human pose and shape estimation methods has improved significantly in recent years, existing approaches typically generate 3D poses defined in camera or human-centered coordinate system. This makes it difficult to estimate a person's pure pose and motion in world coordinate system for a video captured using a moving camera. To address this issue, this paper presents a camera motion agnostic approach for predicting 3D human pose and mesh defined in the world coordinate system. The core idea of the proposed approach is to estimate the difference between two adjacent global poses (i.e., global motion) that is invariant to selecting the coordinate system, instead of the global pose coupled to the camera motion. To this end, we propose a network based on bidirectional gated recurrent units (GRUs) that predicts the global motion sequence from the local pose sequence consisting of relative rotations of joints called global motion regressor (GMR). We use 3DPW and synthetic datasets, which are constructed in a moving-camera environment, for evaluation. We conduct extensive experiments and prove the effectiveness of the proposed method empirically. Code and datasets are available at https://github.com/seonghyunkim1212/GMR △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2111.11133 [pdf, other]

L-Verse: Bidirectional Generation Between Image and Text

Authors: Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae

Abstract: Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between im… ▽ More Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain. △ Less

Submitted 6 April, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Comments: Accepted to CVPR 2022 as Oral Presentation (18 pages, 14 figures, 4 tables)

arXiv:2110.14874 [pdf, other]

Sayer: Using Implicit Feedback to Optimize System Policies

Authors: Mathias Lécuyer, Sang Hoon Kim, Mihir Nanavati, Junchen Jiang, Siddhartha Sen, Amit Sharma, Aleksandrs Slivkins

Abstract: We observe that many system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited <X min, because time has a cumulative property. This feedback tells us about alternative decisions, and ca… ▽ More We observe that many system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited <X min, because time has a cumulative property. This feedback tells us about alternative decisions, and can be used to improve the system policy. However, leveraging implicit feedback is difficult because it tends to be one-sided or incomplete, and may depend on the outcome of the event. As a result, existing practices for using feedback, such as simply incorporating it into a data-driven model, suffer from bias. We develop a methodology, called Sayer, that leverages implicit feedback to evaluate and train new system policies. Sayer builds on two ideas from reinforcement learning -- randomized exploration and unbiased counterfactual estimators -- to leverage data collected by an existing policy to estimate the performance of new candidate policies, without actually deploying those policies. Sayer uses implicit exploration and implicit data augmentation to generate implicit feedback in an unbiased form, which is then used by an implicit counterfactual estimator to evaluate and train new policies. The key idea underlying these techniques is to assign implicit probabilities to decisions that are not actually taken but whose feedback can be inferred; these probabilities are carefully calculated to ensure statistical unbiasedness. We apply Sayer to two production scenarios in Azure, and show that it can evaluate arbitrary policies accurately, and train new policies that outperform the production policies. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2109.08372 [pdf, other]

A physics-informed, vision-based method to reconstruct all deformation modes in slender bodies

Authors: Seung Hyun Kim, Heng-Sheng Chang, Chia-Hsien Shih, Naveen Kumar Uppalapati, Udit Halder, Girish Krishnan, Prashant G. Mehta, Mattia Gazzola

Abstract: This paper is concerned with the problem of estimating (interpolating and smoothing) the shape (pose and the six modes of deformation) of a slender flexible body from multiple camera measurements. This problem is important in both biology, where slender, soft, and elastic structures are ubiquitously encountered across species, and in engineering, particularly in the area of soft robotics. The prop… ▽ More This paper is concerned with the problem of estimating (interpolating and smoothing) the shape (pose and the six modes of deformation) of a slender flexible body from multiple camera measurements. This problem is important in both biology, where slender, soft, and elastic structures are ubiquitously encountered across species, and in engineering, particularly in the area of soft robotics. The proposed mathematical formulation for shape estimation is physics-informed, based on the use of the special Cosserat rod theory whose equations encode slender body mechanics in the presence of bending, shearing, twisting and stretching. The approach is used to derive numerical algorithms which are experimentally demonstrated for fiber reinforced and cable-driven soft robot arms. These experimental demonstrations show that the methodology is accurate (<5 mm error, three times less than the arm diameter) and robust to noise and uncertainties. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: This work has been submitted to the IEEE RA-L with ICRA 2022 for possible publication. Copyright may be transferred without notice. For associated data and code, see https://github.com/GazzolaLab/BR2-vision-based-smoothing

arXiv:2106.01086 [pdf, other]

doi 10.1080/00207543.2020.1870013

Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning

Authors: Junyoung Park, Jaehyeong Chun, Sang Hun Kim, Youngkook Kim, **kyoo Park

Abstract: We propose a framework to learn to schedule a job-shop problem (JSSP) using a graph neural network (GNN) and reinforcement learning (RL). We formulate the scheduling process of JSSP as a sequential decision-making problem with graph representation of the state to consider the structure of JSSP. In solving the formulated problem, the proposed framework employs a GNN to learn that node features that… ▽ More We propose a framework to learn to schedule a job-shop problem (JSSP) using a graph neural network (GNN) and reinforcement learning (RL). We formulate the scheduling process of JSSP as a sequential decision-making problem with graph representation of the state to consider the structure of JSSP. In solving the formulated problem, the proposed framework employs a GNN to learn that node features that embed the spatial structure of the JSSP represented as a graph (representation learning) and derive the optimum scheduling policy that maps the embedded node features to the best scheduling action (policy learning). We employ Proximal Policy Optimization (PPO) based RL strategy to train these two modules in an end-to-end fashion. We empirically demonstrate that the GNN scheduler, due to its superb generalization capability, outperforms practically favored dispatching rules and RL-based schedulers on various benchmark JSSP. We also confirmed that the proposed framework learns a transferable scheduling policy that can be employed to schedule a completely new JSSP (in terms of size and parameters) without further training. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: 16 pages, 8 figures

Journal ref: International Journal of Production Research International Journal of Production Research, Volume 59, 2021 - Issue 11, Pages 3360-3377

arXiv:2008.02043 [pdf, other]

Learning Boost by Exploiting the Auxiliary Task in Multi-task Domain

Authors: Jonghwa Yim, Sang Hwan Kim

Abstract: Learning two tasks in a single shared function has some benefits. Firstly by acquiring information from the second task, the shared function leverages useful information that could have been neglected or underestimated in the first task. Secondly, it helps to generalize the function that can be learned using generally applicable information for both tasks. To fully enjoy these benefits, Multi-task… ▽ More Learning two tasks in a single shared function has some benefits. Firstly by acquiring information from the second task, the shared function leverages useful information that could have been neglected or underestimated in the first task. Secondly, it helps to generalize the function that can be learned using generally applicable information for both tasks. To fully enjoy these benefits, Multi-task Learning (MTL) has long been researched in various domains such as computer vision, language understanding, and speech synthesis. While MTL benefits from the positive transfer of information from multiple tasks, in a real environment, tasks inevitably have a conflict between them during the learning phase, called negative transfer. The negative transfer hampers function from achieving the optimality and degrades the performance. To solve the problem of the task conflict, previous works only suggested partial solutions that are not fundamental, but ad-hoc. A common approach is using a weighted sum of losses. The weights are adjusted to induce positive transfer. Paradoxically, this kind of solution acknowledges the problem of negative transfer and cannot remove it unless the weight of the task is set to zero. Therefore, these previous methods had limited success. In this paper, we introduce a novel approach that can drive positive transfer and suppress negative transfer by leveraging class-wise weights in the learning process. The weights act as an arbitrator of the fundamental unit of information to determine its positive or negative status to the main task. △ Less

Submitted 5 August, 2020; originally announced August 2020.

arXiv:2007.09635 [pdf, other]

Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

Authors: Monisankha Pal, Manoj Kumar, Raghuveer Peri, Tae ** Park, So Hyun Kim, Catherine Lord, Somer Bishop, Shrikanth Narayanan

Abstract: The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN n… ▽ More The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform Kaldi state-of-the-art z-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization as compared to x-vectors and ClusterGAN in telephonic data. △ Less

Submitted 19 July, 2020; originally announced July 2020.

Comments: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

arXiv:2004.01084 [pdf]

doi 10.1088/1748-9326/ab8847

Patterns of population displacement during mega-fires in California detected using Facebook Disaster Maps

Authors: Shenyue Jia, Seung Hee Kim, Son V. Nghiem, Paul Doherty, Menas Kafatos

Abstract: Facebook Disaster Maps (FBDM) is the first platform providing analysis-ready population change products derived from crowdsourced data targeting disaster relief practices. We evaluate the representativeness of FBDM data using the Mann-Kendall test and emerging hot and cold spots in an anomaly analysis to reveal the trend, magnitude, and agglommeration of population displacement during the Mendocin… ▽ More Facebook Disaster Maps (FBDM) is the first platform providing analysis-ready population change products derived from crowdsourced data targeting disaster relief practices. We evaluate the representativeness of FBDM data using the Mann-Kendall test and emerging hot and cold spots in an anomaly analysis to reveal the trend, magnitude, and agglommeration of population displacement during the Mendocino Complex and Woolsey fires in California, USA. Our results show that the distribution of FBDM pre-crisis users fits well with the total population from different sources. Due to usage habits, the elder population is underrepresented in FBDM data. During the two mega-fires in California, FBDM data effectively captured the temporal change of population arising from the placing and lifting of evacuation orders. Coupled with monotonic trends, the fall and rise of cold and hot spots of population revealed the areas with the greatest population drop and potential places to house the displaced residents. A comparison between the Mendocino Complex and Woolsey fires indicates that a densely populated region can be evacuated faster than a scarcely populated one, possibly due to the better access to transportation. In sparsely populated fire-prone areas, resources should be prioritized to move people to shelters as the displaced residents do not have many alternative options, while their counterparts in densely populated areas can utilize their social connections to seek temporary stay at nearby locations during an evacuation. Integrated with an assessment on underrepresented communities, FBDM data and the derivatives can provide much needed information of near real-time population displacement for crisis response and disaster relief. As applications and data generation mature, FBDM will harness crowdsourced data and aid first responder decision-making. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: 16 pages with supplemental information

arXiv:1912.13335 [pdf, other]

Volumetric Lung Nodule Segmentation using Adaptive ROI with Multi-View Residual Learning

Authors: Muhammad Usman, Byoung-Dai Lee, Shi Sub Byon, Sung Hyun Kim, Byung-ilLee

Abstract: Accurate quantification of pulmonary nodules can greatly assist the early diagnosis of lung cancer, which can enhance patient survival possibilities. A number of nodule segmentation techniques have been proposed, however, all of the existing techniques rely on radiologist 3-D volume of interest (VOI) input or use the constant region of interest (ROI) and only investigate the presence of nodule vox… ▽ More Accurate quantification of pulmonary nodules can greatly assist the early diagnosis of lung cancer, which can enhance patient survival possibilities. A number of nodule segmentation techniques have been proposed, however, all of the existing techniques rely on radiologist 3-D volume of interest (VOI) input or use the constant region of interest (ROI) and only investigate the presence of nodule voxels within the given VOI. Such approaches restrain the solutions to investigate the nodule presence outside the given VOI and also include the redundant structures into VOI, which may lead to inaccurate nodule segmentation. In this work, a novel semi-automated approach for 3-D segmentation of nodule in volumetric computerized tomography (CT) lung scans has been proposed. The proposed technique can be segregated into two stages, at the first stage, it takes a 2-D ROI containing the nodule as input and it performs patch-wise investigation along the axial axis with a novel adaptive ROI strategy. The adaptive ROI algorithm enables the solution to dynamically select the ROI for the surrounding slices to investigate the presence of nodule using deep residual U-Net architecture. The first stage provides the initial estimation of nodule which is further utilized to extract the VOI. At the second stage, the extracted VOI is further investigated along the coronal and sagittal axis with two different networks and finally, all the estimated masks are fed into the consensus module to produce the final volumetric segmentation of nodule. The proposed approach has been rigorously evaluated on the LIDC dataset, which is the largest publicly available dataset. The result suggests that the approach is significantly robust and accurate as compared to the previous state of the art techniques. △ Less

Submitted 3 February, 2020; v1 submitted 31 December, 2019; originally announced December 2019.

Comments: The manuscript is currently under review and copyright shall be transferred to the publisher upon acceptance

arXiv:1910.11400 [pdf, other]

Meta-learning for robust child-adult classification from speech

Authors: Nithin Rao Koluguri, Manoj Kumar, So Hyun Kim, Catherine Lord, Shrikanth Narayanan

Abstract: Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. T… ▽ More Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. Training a speaker classification system robust to speaker and channel conditions is challenging due to inherent variability in the speech within children and the adult interlocutors. In this work, we propose the use of meta-learning, in particular, prototypical networks which optimize a metric space across multiple tasks. By modeling every child-adult pair in the training set as a separate task during meta-training, we learn a representation with improved generalizability compared to conventional supervised learning. We demonstrate improvements over state-of-the-art speaker embeddings (x-vectors) under two evaluation settings: weakly supervised classification (up to 14.53% relative improvement in F1-scores) and clustering (up to relative 9.66% improvement in cluster purity). Our results show that protonets can potentially extract robust speaker embeddings for child-adult classification from speech. △ Less

Submitted 28 October, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

arXiv:1910.11398 [pdf, ps, other]

Speaker diarization using latent space clustering in generative adversarial network

Authors: Monisankha Pal, Manoj Kumar, Raghuveer Peri, Tae ** Park, So Hyun Kim, Catherine Lord, Somer Bishop, Shrikanth Narayanan

Abstract: In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a co… ▽ More In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation. △ Less

Submitted 24 October, 2019; originally announced October 2019.

Comments: Submitted to ICASSP 2020

arXiv:1908.05007 [pdf, other]

doi 10.1109/TASE.2019.2935792

Robust Translational Force Control of Multi-Rotor UAV for Precise Acceleration Tracking

Authors: Seung Jae Lee, Seung Hyun Kim, H. ** Kim

Abstract: In this paper, we introduce a translational force control method with disturbance observer (DOB)-based force disturbance cancellation for precise three-dimensional acceleration control of a multi-rotor UAV. The acceleration control of the multi-rotor requires conversion of the desired acceleration signal to the desired roll, pitch, and total thrust. But because the attitude dynamics and the thrust… ▽ More In this paper, we introduce a translational force control method with disturbance observer (DOB)-based force disturbance cancellation for precise three-dimensional acceleration control of a multi-rotor UAV. The acceleration control of the multi-rotor requires conversion of the desired acceleration signal to the desired roll, pitch, and total thrust. But because the attitude dynamics and the thrust dynamics are different, simple kinematic signal conversion without consideration of those difference can cause serious performance degradation in acceleration tracking. Unlike most existing translational force control techniques that are based on such simple inversion, our new method allows controlling the acceleration of the multi-rotor more precisely by considering the dynamics of the multi-rotor during the kinematic inversion. By combining the DOB with the translational force system that includes the improved conversion technique, we achieve robustness with respect to the external force disturbances that hinders the accurate acceleration control. mu-analysis is performed to ensure the robust stability of the overall closed-loop system, considering the combined effect of various possible model uncertainties. Both simulation and experiment are conducted to validate the proposed technique, which confirms the satisfactory performance to track the desired acceleration of the multi-rotor. △ Less

Submitted 14 August, 2019; originally announced August 2019.

Comments: 11 pages, 14 figures, Accepted in the T-ASE Journal on Aug. 10th, 2019

arXiv:1807.08903 [pdf, ps, other]

Traffic-Aware Backscatter Communications in Wireless-Powered Heterogeneous Networks

Authors: Sung Hoon Kim, Dong In Kim

Abstract: With the emerging Internet-of-Things services, massive machine-to-machine (M2M) communication will be deployed on top of human-to-human (H2H) communication in the near future. Due to the coexistence of M2M and H2H communications, the performance of M2M (i.e., secondary) network depends largely on the H2H (i.e., primary) network. In this paper, we propose ambient backscatter communication for the M… ▽ More With the emerging Internet-of-Things services, massive machine-to-machine (M2M) communication will be deployed on top of human-to-human (H2H) communication in the near future. Due to the coexistence of M2M and H2H communications, the performance of M2M (i.e., secondary) network depends largely on the H2H (i.e., primary) network. In this paper, we propose ambient backscatter communication for the M2M network which exploits the energy (signal) sources of the H2H network, referring to traffic applications and popularity. In order to maximize the harvesting and transmission opportunities offered by varying traffic sources of the H2H network, we adopt a Bayesian nonparametric (BNP) learning algorithm to classify traffic applications (patterns) for secondary user (SU). We then analyze the performance of SU using the stochastic geometrical approach, based on a criterion for optimal traffic pattern selection. Results are presented to validate the performance of the proposed BNP classification algorithm and the criterion, as well as the impact of traffic sources and popularity. △ Less

Submitted 24 July, 2018; originally announced July 2018.

Comments: 14 pages, 10 figures

arXiv:1710.03299 [pdf]

A Review on the Applications of Crowdsourcing in Human Pathology

Authors: Roshanak Alialy, Sasan Tavakkol, Elham Tavakkol, Amir Ghorbani-Aghbologhi, Alireza Ghaffarieh, Seon Ho Kim, Cyrus Shahabi

Abstract: The advent of the digital pathology has introduced new avenues of diagnostic medicine. Among them, crowdsourcing has attracted researchers' attention in the recent years, allowing them to engage thousands of untrained individuals in research and diagnosis. While there exist several articles in this regard, prior works have not collectively documented them. We, therefore, aim to review the applicat… ▽ More The advent of the digital pathology has introduced new avenues of diagnostic medicine. Among them, crowdsourcing has attracted researchers' attention in the recent years, allowing them to engage thousands of untrained individuals in research and diagnosis. While there exist several articles in this regard, prior works have not collectively documented them. We, therefore, aim to review the applications of crowdsourcing in human pathology in a semi-systematic manner. We firstly, introduce a novel method to do a systematic search of the literature. Utilizing this method, we, then, collect hundreds of articles and screen them against a pre-defined set of criteria. Furthermore, we crowdsource part of the screening process, to examine another potential application of crowdsourcing. Finally, we review the selected articles and characterize the prior uses of crowdsourcing in pathology. △ Less

Submitted 20 November, 2017; v1 submitted 9 October, 2017; originally announced October 2017.

arXiv:1705.02009 [pdf, ps, other]

On Identifying Disaster-Related Tweets: Matching-based or Learning-based?

Authors: Hien To, Sumeet Agrawal, Seon Ho Kim, Cyrus Shahabi

Abstract: Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such informa… ▽ More Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such information requires timely selection and analysis of tweets that are relevant to a particular disaster. Even though abundant tweets are promising as a data source, it is challenging to automatically identify relevant messages since tweet are short and unstructured, resulting to unsatisfactory classification performance of conventional learning-based approaches. Thus, we propose a simple yet effective algorithm to identify relevant messages based on matching keywords and hashtags, and provide a comparison between matching-based and learning-based approaches. To evaluate the two approaches, we put them into a framework specifically proposed for analyzing disaster-related tweets. Analysis results on eleven datasets with various disaster types show that our technique provides relevant tweets of higher quality and more interpretable results of sentiment analysis tasks when compared to learning approach. △ Less

Submitted 4 May, 2017; originally announced May 2017.

arXiv:1502.06654 [pdf, ps, other]

doi 10.1109/LCOMM.2015.2398866

Variable-Length Feedback Codes under a Strict Delay Constraint

Authors: Seong Hwan Kim, Dan Keun Sung, Tho Le-Ngoc

Abstract: We study variable-length feedback (VLF) codes under a strict delay constraint to maximize their average transmission rate (ATR) in a discrete memoryless channel (DMC) while considering periodic decoding attempts. We first derive a lower bound on the maximum achievable ATR, and confirm that the VLF code can outperform non-feedback codes with a larger delay constraint. We show that for a given decod… ▽ More We study variable-length feedback (VLF) codes under a strict delay constraint to maximize their average transmission rate (ATR) in a discrete memoryless channel (DMC) while considering periodic decoding attempts. We first derive a lower bound on the maximum achievable ATR, and confirm that the VLF code can outperform non-feedback codes with a larger delay constraint. We show that for a given decoding period, as the strict delay constraint, L, increases, the gap between the ATR of the VLF code and the DMC capacity scales at most on the order of O(L^{-1}) instead of O(L^{-1/2}) for non-feedback codes as shown in Polyanskiy et al. ["Channel coding rate in the finite blocklengh regime," IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307-2359, May 2010.]. We also develop an approximation indicating that, for a given L, the achievable ATR increases as the decoding period decreases. △ Less

Submitted 23 February, 2015; originally announced February 2015.

Comments: 5pages, 1 figure, Accepted for publication in IEEE Communications Letters

arXiv:1308.6217 [pdf, other]

doi 10.2514/1.D0067

Numerical Analysis of Gate Conflict Duration and Passenger Transit Time in Airport

Authors: Sang Hyun Kim, Eric Feron

Abstract: Robustness is as important as efficiency in air transportation. All components in the air traffic system are connected to form an interactive network. So, a disturbance that occurs in one component, for example, a severe delay at an airport, can influence the entire network. Delays are easily propagated between flights through gates, but the propagation can be reduced if gate assignments are robus… ▽ More Robustness is as important as efficiency in air transportation. All components in the air traffic system are connected to form an interactive network. So, a disturbance that occurs in one component, for example, a severe delay at an airport, can influence the entire network. Delays are easily propagated between flights through gates, but the propagation can be reduced if gate assignments are robust against stochastic delays. In this paper, we analyze gate delays and suggest an approach that involves assigning gates while making them robust against stochastic delays. We extract an example flight schedule from data source and generate schedules with increased traffic to analyze how the compact flight schedules impact the robustness of gate assignment. Simulation results show that our approach improves the robustness of gate assignment. Particularly, the robust gate assignment reduces average duration of gate conflicts by 96.3% and the number of gate conflicts by 96.7% compared to the baseline assignment. However, the robust gate assignment results in longer transit time for passengers, and a trade-off between the robustness of gate assignment and passenger transit time is presented. △ Less

Submitted 28 August, 2013; originally announced August 2013.

Comments: Submitted to Transportation Research Part B, and presented at AIAA Guidance, Navigation, and Control Conference in 2011 in part

arXiv:1306.3429 [pdf, other]

doi 10.1109/TITS.2013.2285499

Impact of Gate Assignment on Gate-Holding Departure Control Strategies

Authors: Sang Hyun Kim, Eric Feron

Abstract: Gate holding reduces congestion by reducing the number of aircraft present on the airport surface at any time, while not starving the runway. Because some departing flights are held at gates, there is a possibility that arriving flights cannot access the gates and have to wait until the gates are cleared. This is called a gate conflict. Robust gate assignment is an assignment that minimizes gate c… ▽ More Gate holding reduces congestion by reducing the number of aircraft present on the airport surface at any time, while not starving the runway. Because some departing flights are held at gates, there is a possibility that arriving flights cannot access the gates and have to wait until the gates are cleared. This is called a gate conflict. Robust gate assignment is an assignment that minimizes gate conflicts by assigning gates to aircraft to maximize the time gap between two consecutive flights at the same gate; it makes gate assignment robust, but passengers may walk longer to transfer flights. In order to simulate the airport departure process, a queuing model is introduced. The model is calibrated and validated with actual data from New York La Guardia Airport (LGA) and a U.S. hub airport. Then, the model simulates the airport departure process with the current gate assignment and a robust gate assignment to assess the impact of gate assignment on gate-holding departure control. The results show that the robust gate assignment reduces the number of gate conflicts caused by gate holding compared to the current gate assignment. Therefore, robust gate assignment can be combined with gate-holding departure control to improve operations at congested airports with limited gate resources. △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: Submitted to IEEE Transactions on Intelligent Transportation Systems

arXiv:1306.3426 [pdf, other]

doi 10.1109/TITS.2013.2286271

Valuating Surface Surveillance Technology for Collaborative Multiple-Spot Control of Airport Departure Operations

Authors: Pierrick Burgain, Sang Hyun Kim, Eric Feron

Abstract: Airport departure operations are a source of airline delays and passenger frustration. Excessive surface traffic is a cause of increased controller and pilot workload. It is also a source of increased emissions and delays, and does not yield improved runway throughput. Leveraging the extensive past research on airport departure management, this paper explores the environmental and safety benefits… ▽ More Airport departure operations are a source of airline delays and passenger frustration. Excessive surface traffic is a cause of increased controller and pilot workload. It is also a source of increased emissions and delays, and does not yield improved runway throughput. Leveraging the extensive past research on airport departure management, this paper explores the environmental and safety benefits that improved surveillance technologies can bring in the context of gate- or spot-release strategies. The paper shows that improved surveillance technologies can yield 4% to 6% reduction of aircraft on taxiway, and therefore emissions, in addition to the savings currently observed by implementing threshold starategies under evaluation at Boston Logan Airport and other busy airports during congested periods. These calculated benefits contrast sharply with our previous work, which relied on simplified airport ramp areas with a single departure spot, and where fewer environmental and economic benefits of advanced surface surveillance systems could be established. Our work is illustrated by its application to New-York LaGuardia and Seattle Tacoma airports. △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: Submitted to IEEE Transactions on Intelligent Transportation Systems. arXiv admin note: substantial text overlap with arXiv:1102.2673

arXiv:1301.3535 [pdf, other]

doi 10.2514/1.D0079

Airport Gate Scheduling for Passengers, Aircraft, and Operation

Authors: Sang Hyun Kim, Eric Feron, John-Paul Clarke, Aude Marzuoli, Daniel Delahaye

Abstract: Passengers' experience is becoming a key metric to evaluate the air transportation system's performance. Efficient and robust tools to handle airport operations are needed along with a better understanding of passengers' interests and concerns. Among various airport operations, this paper studies airport gate scheduling for improved passengers' experience. Three objectives accounting for passenger… ▽ More Passengers' experience is becoming a key metric to evaluate the air transportation system's performance. Efficient and robust tools to handle airport operations are needed along with a better understanding of passengers' interests and concerns. Among various airport operations, this paper studies airport gate scheduling for improved passengers' experience. Three objectives accounting for passengers, aircraft, and operation are presented. Trade-offs between these objectives are analyzed, and a balancing objective function is proposed. The results show that the balanced objective can improve the efficiency of traffic flow in passenger terminals and on ramps, as well as the robustness of gate operations. △ Less

Submitted 15 January, 2013; originally announced January 2013.

Comments: This paper is submitted to the tenth USA/Europe ATM 2013 seminar

Showing 1–43 of 43 results for author: Kim, S H