Search | arXiv e-print repository

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Authors: Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, **young Yeo, Youngjae Yu

Abstract: The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliabilit… ▽ More The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Preprint; Under review

arXiv:2406.05761 [pdf, other]

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Authors: Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Gui** Son, Ye** Cho, Sheikh Shafayat, **heon Baek, Sue Hyun Park, Hyeonbin Hwang, **kyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang , et al. (7 additional authors not shown)

Abstract: As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on spec… ▽ More As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Work in Progress

arXiv:2404.18428 [pdf, other]

Geospatial Big Data: Survey and Challenges

Authors: Jiayang Wu, Wensheng Gan, Han-Chieh Chao, Philip S. Yu

Abstract: In recent years, geospatial big data (GBD) has obtained attention across various disciplines, categorized into big earth observation data and big human behavior data. Identifying geospatial patterns from GBD has been a vital research focus in the fields of urban management and environmental sustainability. This paper reviews the evolution of GBD mining and its integration with advanced artificial… ▽ More In recent years, geospatial big data (GBD) has obtained attention across various disciplines, categorized into big earth observation data and big human behavior data. Identifying geospatial patterns from GBD has been a vital research focus in the fields of urban management and environmental sustainability. This paper reviews the evolution of GBD mining and its integration with advanced artificial intelligence (AI) techniques. GBD consists of data generated by satellites, sensors, mobile devices, and geographical information systems, and we categorize geospatial data based on different perspectives. We outline the process of GBD mining and demonstrate how it can be incorporated into a unified framework. Additionally, we explore new technologies like large language models (LLM), the Metaverse, and knowledge graphs, and how they could make GBD even more useful. We also share examples of GBD hel** with city management and protecting the environment. Finally, we discuss the real challenges that come up when working with GBD, such as issues with data retrieval and security. Our goal is to give readers a clear view of where GBD mining stands today and where it might go next. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: IEEE JSTARS. 14 pages, 5 figures

arXiv:2404.02575 [pdf, other]

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Authors: Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, **young Yeo

Abstract: Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use progra… ▽ More Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use programming languages (e.g., Python) to express the necessary logic for solving a given instance/question (e.g., Program-of-Thought) as inspired by their strict and precise syntaxes. However, it is non-trivial to write an executable code that expresses the correct logic on the fly within a single inference call. Also, the code generated specifically for an instance cannot be reused for others, even if they are from the same task and might require identical logic to solve. This paper presents Think-and-Execute, a novel framework that decomposes the reasoning process of language models into two steps. (1) In Think, we discover a task-level logic that is shared across all instances for solving a given task and then express the logic with pseudocode; (2) In Execute, we further tailor the generated pseudocode to each instance and simulate the execution of the code. With extensive experiments on seven algorithmic reasoning tasks, we demonstrate the effectiveness of Think-and-Execute. Our approach better improves LMs' reasoning compared to several strong baselines performing instance-specific reasoning (e.g., CoT and PoT), suggesting the helpfulness of discovering task-level logic. Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 38 pages, 4 figures

arXiv:2404.02135 [pdf]

Enhancing Ship Classification in Optical Satellite Imagery: Integrating Convolutional Block Attention Module with ResNet for Improved Performance

Authors: Ryan Donghan Kwon, Gangjoo Robin Nam, Jisoo Tak, Junseob Shin, Hyerin Cha, Yeom Hyeok, Seung Won Lee

Abstract: This study presents an advanced Convolutional Neural Network (CNN) architecture for ship classification from optical satellite imagery, significantly enhancing performance through the integration of the Convolutional Block Attention Module (CBAM) and additional architectural innovations. Building upon the foundational ResNet50 model, we first incorporated a standard CBAM to direct the model's focu… ▽ More This study presents an advanced Convolutional Neural Network (CNN) architecture for ship classification from optical satellite imagery, significantly enhancing performance through the integration of the Convolutional Block Attention Module (CBAM) and additional architectural innovations. Building upon the foundational ResNet50 model, we first incorporated a standard CBAM to direct the model's focus towards more informative features, achieving an accuracy of 87% compared to the baseline ResNet50's 85%. Further augmentations involved multi-scale feature integration, depthwise separable convolutions, and dilated convolutions, culminating in the Enhanced ResNet Model with Improved CBAM. This model demonstrated a remarkable accuracy of 95%, with precision, recall, and f1-scores all witnessing substantial improvements across various ship classes. The bulk carrier and oil tanker classes, in particular, showcased nearly perfect precision and recall rates, underscoring the model's enhanced capability in accurately identifying and classifying ships. Attention heatmap analyses further validated the improved model's efficacy, revealing a more focused attention on relevant ship features, regardless of background complexities. These findings underscore the potential of integrating attention mechanisms and architectural innovations in CNNs for high-resolution satellite imagery classification. The study navigates through the challenges of class imbalance and computational costs, proposing future directions towards scalability and adaptability in new or rare ship type recognition. This research lays a groundwork for the application of advanced deep learning techniques in the domain of remote sensing, offering insights into scalable and efficient satellite image classification. △ Less

Submitted 8 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.02966 [pdf, other]

Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering

Authors: Sungho Ko, Hyun** Cho, Hyungjoo Chae, **young Yeo, Dongha Lee

Abstract: Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language Models (LLMs), yet structured KG verbalization remains challengin. Existing methods, such as triple-form or free-form textual conversion of triple-form facts, encounter several issues. These include reduced evidence density due to duplicated entities or relationships,… ▽ More Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language Models (LLMs), yet structured KG verbalization remains challengin. Existing methods, such as triple-form or free-form textual conversion of triple-form facts, encounter several issues. These include reduced evidence density due to duplicated entities or relationships, and reduced evidence clarity due to an inability to emphasize crucial evidence. To address these issues, we propose EFSum, an Evidence-focused Fact Summarization framework for enhanced QA with knowledge-augmented LLMs. We optimize an open-source LLM as a fact summarizer through distillation and preference alignment. Our extensive experiments show that EFSum improves LLM's zero-shot QA performance, and it is possible to ensure both the helpfulness and faithfulness of the summary. △ Less

Submitted 19 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.19479 [pdf, other]

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov

Abstract: The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, a… ▽ More The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high-quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: CVPR 2024. Project Page: https://snap-research.github.io/Panda-70M

arXiv:2402.18374 [pdf, other]

VerifiNER: Verification-augmented NER via Knowledge-grounded Reasoning with Large Language Models

Authors: Seoyeon Kim, Kwangwook Seo, Hyungjoo Chae, **young Yeo, Dongha Lee

Abstract: Recent approaches in domain-specific named entity recognition (NER), such as biomedical NER, have shown remarkable advances. However, they still lack of faithfulness, producing erroneous predictions. We assume that knowledge of entities can be useful in verifying the correctness of the predictions. Despite the usefulness of knowledge, resolving such errors with knowledge is nontrivial, since the k… ▽ More Recent approaches in domain-specific named entity recognition (NER), such as biomedical NER, have shown remarkable advances. However, they still lack of faithfulness, producing erroneous predictions. We assume that knowledge of entities can be useful in verifying the correctness of the predictions. Despite the usefulness of knowledge, resolving such errors with knowledge is nontrivial, since the knowledge itself does not directly indicate the ground-truth label. To this end, we propose VerifiNER, a post-hoc verification framework that identifies errors from existing NER methods using knowledge and revises them into more faithful predictions. Our framework leverages the reasoning abilities of large language models to adequately ground on knowledge and the contextual information in the verification process. We validate effectiveness of VerifiNER through extensive experiments on biomedical datasets. The results suggest that VerifiNER can successfully verify errors from existing models as a model-agnostic approach. Further analyses on out-of-domain and low-resource settings show the usefulness of VerifiNER on real-world applications. △ Less

Submitted 8 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted to ACL 2024

arXiv:2402.10636 [pdf, other]

PEGASUS: Personalized Generative 3D Avatars with Composable Attributes

Authors: Hyunsoo Cha, Byungjun Kim, Hanbyul Joo

Abstract: We present PEGASUS, a method for constructing a personalized generative 3D face avatar from monocular video sources. Our generative 3D avatar enables disentangled controls to selectively alter the facial attributes (e.g., hair or nose) while preserving the identity. Our approach consists of two stages: synthetic database generation and constructing a personalized generative avatar. We generate a s… ▽ More We present PEGASUS, a method for constructing a personalized generative 3D face avatar from monocular video sources. Our generative 3D avatar enables disentangled controls to selectively alter the facial attributes (e.g., hair or nose) while preserving the identity. Our approach consists of two stages: synthetic database generation and constructing a personalized generative avatar. We generate a synthetic video collection of the target identity with varying facial attributes, where the videos are synthesized by borrowing the attributes from monocular videos of diverse identities. Then, we build a person-specific generative 3D avatar that can modify its attributes continuously while preserving its identity. Through extensive experiments, we demonstrate that our method of generating a synthetic database and creating a 3D generative avatar is the most effective in preserving identity while achieving high realism. Subsequently, we introduce a zero-shot approach to achieve the same goal of generative modeling more efficiently by leveraging a previously constructed personalized generative model. △ Less

Submitted 2 April, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted at CVPR 2024, Project Page: https://snuvclab.github.io/pegasus/

arXiv:2402.00137 [pdf, other]

Multimodal Neurodegenerative Disease Subty** Explained by ChatGPT

Authors: Diego Machado Reyes, Hanqing Chao, Juergen Hahn, Li Shen, **kun Yan

Abstract: Alzheimer's disease (AD) is the most prevalent neurodegenerative disease; yet its currently available treatments are limited to stop** disease progression. Moreover, effectiveness of these treatments is not guaranteed due to the heterogenetiy of the disease. Therefore, it is essential to be able to identify the disease subtypes at a very early stage. Current data driven approaches are able to cl… ▽ More Alzheimer's disease (AD) is the most prevalent neurodegenerative disease; yet its currently available treatments are limited to stop** disease progression. Moreover, effectiveness of these treatments is not guaranteed due to the heterogenetiy of the disease. Therefore, it is essential to be able to identify the disease subtypes at a very early stage. Current data driven approaches are able to classify the subtypes at later stages of AD or related disorders, but struggle when predicting at the asymptomatic or prodromal stage. Moreover, most existing models either lack explainability behind the classification or only use a single modality for the assessment, limiting scope of its analysis. Thus, we propose a multimodal framework that uses early-stage indicators such as imaging, genetics and clinical assessments to classify AD patients into subtypes at early stages. Similarly, we build prompts and use large language models, such as ChatGPT, to interpret the findings of our model. In our framework, we propose a tri-modal co-attention mechanism (Tri-COAT) to explicitly learn the cross-modal feature associations. Our proposed model outperforms baseline models and provides insight into key cross-modal feature associations supported by known biological mechanisms. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.17594 [pdf, other]

5G NR Positioning Enhancements in 3GPP Release-18

Authors: Hyun-Su Cha, Gilsoo Lee, Amitava Ghosh, Matthew Baker, Sean Kelley, Juergen Hofmann

Abstract: New radio (NR) positioning in the Third Generation Partnership Project (3GPP) Release 18 (Rel-18) enables 5G-advanced networks to achieve ultra-high accuracy positioning without dependence on global navigation satellite systems (GNSS) with key enablers such as the carrier phase positioning technique, standardized for the first time in a cellular communications standard and setting a new baseline f… ▽ More New radio (NR) positioning in the Third Generation Partnership Project (3GPP) Release 18 (Rel-18) enables 5G-advanced networks to achieve ultra-high accuracy positioning without dependence on global navigation satellite systems (GNSS) with key enablers such as the carrier phase positioning technique, standardized for the first time in a cellular communications standard and setting a new baseline for future generations. In addition, Rel-18 NR supports positioning functionalities for reduced capability (RedCap) user equipment and bandwidth aggregation for positioning measurements. Moreover, the low power solutions are designed for low power high accuracy positioning use cases. Lastly, sidelink-based positioning is introduced in Rel-18. This article constitutes a comprehensive treatment of the Rel-18 NR positioning enhancements crucial for the development of next-generation networks. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2311.07215 [pdf, other]

Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback

Authors: Seungjun Moon, Hyungjoo Chae, Yongho Song, Taeyoon Kwon, Dong** Kang, Kai Tzu-iunn Ong, Seung-won Hwang, **young Yeo

Abstract: Code editing is an essential step towards reliable program synthesis to automatically correct critical errors generated from code LLMs. Recent studies have demonstrated that closed-source LLMs (i.e., ChatGPT and GPT-4) are capable of generating corrective feedback to edit erroneous inputs. However, it remains challenging for open-source code LLMs to generate feedback for code editing, since these… ▽ More Code editing is an essential step towards reliable program synthesis to automatically correct critical errors generated from code LLMs. Recent studies have demonstrated that closed-source LLMs (i.e., ChatGPT and GPT-4) are capable of generating corrective feedback to edit erroneous inputs. However, it remains challenging for open-source code LLMs to generate feedback for code editing, since these models tend to adhere to the superficial formats of feedback and provide feedback with misleading information. Hence, the focus of our work is to leverage open-source code LLMs to generate helpful feedback with correct guidance for code editing. To this end, we present Coffee, a collected dataset specifically designed for code fixing with feedback. Using this dataset, we construct CoffeePots, a framework for COde Fixing with FEEdback via Preference-Optimized Tuning and Selection. The proposed framework aims to automatically generate helpful feedback for code editing while minimizing the potential risk of superficial feedback. The combination of Coffee and CoffeePots marks a significant advancement, achieving state-of-the-art performance on HumanEvalFix benchmark. Codes and model checkpoints are publicly available at https://github.com/Lune-Blue/COFFEE. △ Less

Submitted 23 February, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Work in progress

arXiv:2310.09343 [pdf, other]

Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents

Authors: Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, Taeyoon Kwon, Min** Kim, Youngjae Yu, Dongha Lee, Dongyeop Kang, **young Yeo

Abstract: Human-like chatbots necessitate the use of commonsense reasoning in order to effectively comprehend and respond to implicit information present within conversations. Achieving such coherence and informativeness in responses, however, is a non-trivial task. Even for large language models (LLMs), the task of identifying and aggregating key evidence within a single hop presents a substantial challeng… ▽ More Human-like chatbots necessitate the use of commonsense reasoning in order to effectively comprehend and respond to implicit information present within conversations. Achieving such coherence and informativeness in responses, however, is a non-trivial task. Even for large language models (LLMs), the task of identifying and aggregating key evidence within a single hop presents a substantial challenge. This complexity arises because such evidence is scattered across multiple turns in a conversation, thus necessitating integration over multiple hops. Hence, our focus is to facilitate such multi-hop reasoning over a dialogue context, namely dialogue chain-of-thought (CoT) reasoning. To this end, we propose a knowledge distillation framework that leverages LLMs as unreliable teachers and selectively distills consistent and helpful rationales via alignment filters. We further present DOCTOR, a DialOgue Chain-of-ThOught Reasoner that provides reliable CoT rationales for response generation. We conduct extensive experiments to show that enhancing dialogue agents with high-quality rationales from DOCTOR significantly improves the quality of their responses. △ Less

Submitted 22 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: 25 pages, 8 figures, Accepted to EMNLP 2023

arXiv:2309.12314 [pdf, other]

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Authors: Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu

Abstract: In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment i… ▽ More In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted By ICCV 2023

arXiv:2309.01207 [pdf, other]

Spectral Adversarial MixUp for Few-Shot Unsupervised Domain Adaptation

Authors: Jia** Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, **kun Yan

Abstract: Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model train… ▽ More Domain shift is a common problem in clinical applications, where the training images (source domain) and the test images (target domain) are under different distributions. Unsupervised Domain Adaptation (UDA) techniques have been proposed to adapt models trained in the source domain to the target domain. However, those methods require a large number of images from the target domain for model training. In this paper, we propose a novel method for Few-Shot Unsupervised Domain Adaptation (FSUDA), where only a limited number of unlabeled target domain samples are available for training. To accomplish this challenging task, first, a spectral sensitivity map is introduced to characterize the generalization weaknesses of models in the frequency domain. We then developed a Sensitivity-guided Spectral Adversarial MixUp (SAMix) method to generate target-style images to effectively suppresses the model sensitivity, which leads to improved model generalizability in the target domain. We demonstrated the proposed method and rigorously evaluated its performance on multiple tasks using several public datasets. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: Accepted by MICCAI 2023

arXiv:2306.12978 [pdf, other]

Rate-Splitting Multiple Access for 6G Networks: Ten Promising Scenarios and Applications

Authors: Jeonghun Park, Byungju Lee, **seok Choi, Hoon Lee, Namyoon Lee, Seok-Hwan Park, Kyoung-Jae Lee, Junil Choi, Sung Ho Chae, Sang-Woon Jeon, Kyung Sup Kwak, Bruno Clerckx, Wonjae Shin

Abstract: In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple acce… ▽ More In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple access (RSMA) has been gaining much attention. RSMA splits an individual message into two parts: a common part, decodable by every user, and a private part, decodable only by the intended user. Each user first decodes the common message and then decodes its private message by applying successive interference cancellation (SIC). By doing so, RSMA not only embraces the existing MA techniques as special cases but also provides significant performance gains by efficiently mitigating inter-user interference in a broad range of interference regimes. In this article, we first present the theoretical foundation of RSMA. Subsequently, we put forth four key benefits of RSMA: spectral efficiency, robustness, scalability, and flexibility. Upon this, we describe how RSMA can enable ten promising scenarios and applications along with future research directions to pave the way for 6G. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 17 pages, 6 figures, submitted to IEEE Network Magazine

arXiv:2306.11731 [pdf, other]

Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning

Authors: Huiguo He, Tianfu Wang, Huan Yang, Jianlong Fu, Nicholas **g Yuan, Jian Yin, Hongyang Chao, Qi Zhang

Abstract: We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effect… ▽ More We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion-based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design. First, the LLM enhances a basic human input (such as "panda") by generating more comprehensive NFT-style prompts that include specific visual attributes, such as "panda with Ninja style and green background." Second, the diffusion-based image generator is fine-tuned using a large-scale NFT dataset to capture fine-grained image styles and accessory compositions of popular NFT elements. Third, we further propose to utilize multiple visual-policies as optimization goals, including visual rarity levels, visual aesthetic scores, and CLIP-based text-image relevances. This design ensures that our proposed Diffusion-MVP is capable of minting NFT images with high visual quality and market value. To facilitate this research, we have collected the largest publicly available NFT image dataset to date, consisting of 1.5 million high-quality images with corresponding texts and market values. Extensive experiments including objective evaluations and user studies demonstrate that our framework can generate NFT images showing more visually engaging elements and higher market value, compared with SOTA approaches. △ Less

Submitted 17 August, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted by ACM-MM 2023

arXiv:2303.11763 [pdf, other]

Reconfigurable Intelligent Surface Aided Hybrid Beamforming: Optimal Placement and Beamforming Design

Authors: Najam Us Saqib, Shumei Hou, Sung Ho Chae, Sang-Woon Jeon

Abstract: We consider reconfigurable intelligent surface (RIS) aided sixth-generation (6G) terahertz (THz) communications for indoor environment in which a base station (BS) wishes to send independent messages to its serving users with the help of multiple RISs. For indoor environment, various obstacles such as pillars, walls, and other objects can result in no line-of-sight signal path between the BS and a… ▽ More We consider reconfigurable intelligent surface (RIS) aided sixth-generation (6G) terahertz (THz) communications for indoor environment in which a base station (BS) wishes to send independent messages to its serving users with the help of multiple RISs. For indoor environment, various obstacles such as pillars, walls, and other objects can result in no line-of-sight signal path between the BS and a user, which can significantly degrade performance. To overcome such limitation of indoor THz communication, we firstly optimize the placement of RISs to maximize the coverage area. Under the optimized RIS placement, we propose 3D hybrid beamforming at the BS and phase adjustment at RISs, which are jointly performed at the BS and RISs via codebook-based 3D beam scanning with low complexity. Numerical simulations demonstrate that the proposed scheme significantly improves the average sum rate compared to the cases of no RIS and randomly deployed RISs. It is further shown that the proposed codebook-based 3D beam scanning efficiently aligns analog beams between BS--user links or BS--RIS--user links and, as a consequence, achieves the average sum rate close to that of coherent beam alignment requiring global channel state information. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: This manuscript contains 18 pages and 9 figures

arXiv:2303.09597 [pdf, other]

Residual Physics Learning and System Identification for Sim-to-real Transfer of Policies on Buoyancy Assisted Legged Robots

Authors: Nitish Sontakke, Hosik Chae, Sangjoon Lee, Tianle Huang, Dennis W. Hong, Sehoon Ha

Abstract: The light and soft characteristics of Buoyancy Assisted Lightweight Legged Unit (BALLU) robots have a great potential to provide intrinsically safe interactions in environments involving humans, unlike many heavy and rigid robots. However, their unique and sensitive dynamics impose challenges to obtaining robust control policies in the real world. In this work, we demonstrate robust sim-to-real tr… ▽ More The light and soft characteristics of Buoyancy Assisted Lightweight Legged Unit (BALLU) robots have a great potential to provide intrinsically safe interactions in environments involving humans, unlike many heavy and rigid robots. However, their unique and sensitive dynamics impose challenges to obtaining robust control policies in the real world. In this work, we demonstrate robust sim-to-real transfer of control policies on the BALLU robots via system identification and our novel residual physics learning method, Environment Mimic (EnvMimic). First, we model the nonlinear dynamics of the actuators by collecting hardware data and optimizing the simulation parameters. Rather than relying on standard supervised learning formulations, we utilize deep reinforcement learning to train an external force policy to match real-world trajectories, which enables us to model residual physics with greater fidelity. We analyze the improved simulation fidelity by comparing the simulation trajectories against the real-world ones. We finally demonstrate that the improved simulator allows us to learn better walking and turning policies that can be successfully deployed on the hardware of BALLU. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2303.04508 [pdf, other]

InFusionSurf: Refining Neural RGB-D Surface Reconstruction Using Per-Frame Intrinsic Refinement and TSDF Fusion Prior Learning

Authors: Seunghwan Lee, Gwanmo Park, Hyewon Son, Jiwon Ryu, Han Joo Chae

Abstract: We introduce InFusionSurf, a novel approach to enhance the fidelity of neural radiance field (NeRF) frameworks for 3D surface reconstruction using RGB-D video frames. Building upon previous methods that have employed feature encoding to improve optimization speed, we further improve the reconstruction quality with minimal impact on optimization time by refining depth information. Our per-frame int… ▽ More We introduce InFusionSurf, a novel approach to enhance the fidelity of neural radiance field (NeRF) frameworks for 3D surface reconstruction using RGB-D video frames. Building upon previous methods that have employed feature encoding to improve optimization speed, we further improve the reconstruction quality with minimal impact on optimization time by refining depth information. Our per-frame intrinsic refinement scheme addresses frame-specific blurs caused by camera motion in each depth frame. Furthermore, InFusionSurf utilizes a classical real-time 3D surface reconstruction method, the truncated signed distance field (TSDF) Fusion, as prior knowledge to pretrain the feature grid to support reconstruction details while accelerating the training. The quantitative and qualitative experiments comparing the performances of InFusionSurf against prior work indicate that our method is capable of accurately reconstructing a scene without sacrificing optimization speed. We also demonstrate the effectiveness of our per-frame intrinsic refinement and TSDF Fusion prior learning techniques via an ablation study. △ Less

Submitted 3 September, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2303.03628 [pdf, other]

CoTEVer: Chain of Thought Prompting Annotation Toolkit for Explanation Verification

Authors: Seungone Kim, Se June Joo, Yul Jang, Hyungjoo Chae, **young Yeo

Abstract: Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models wi… ▽ More Chain-of-thought (CoT) prompting enables large language models (LLMs) to solve complex reasoning tasks by generating an explanation before the final prediction. Despite it's promising ability, a critical downside of CoT prompting is that the performance is greatly affected by the factuality of the generated explanation. To improve the correctness of the explanations, fine-tuning language models with explanation data is needed. However, there exists only a few datasets that can be used for such approaches, and no data collection tool for building them. Thus, we introduce CoTEVer, a tool-kit for annotating the factual correctness of generated explanations and collecting revision data of wrong explanations. Furthermore, we suggest several use cases where the data collected with CoTEVer can be utilized for enhancing the faithfulness of explanations. Our toolkit is publicly available at https://github.com/SeungoneKim/CoTEVer. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted at EACL 2023 Demo

arXiv:2302.12623 [pdf, other]

TUTORING: Instruction-Grounded Conversational Agent for Language Learners

Authors: Hyungjoo Chae, Min** Kim, Chaehyeong Kim, Wonseok Jeong, Hyejoong Kim, Junmyung Lee, **young Yeo

Abstract: In this paper, we propose Tutoring bot, a generative chatbot trained on a large scale of tutor-student conversations for English-language learning. To mimic a human tutor's behavior in language education, the tutor bot leverages diverse educational instructions and grounds to each instruction as additional input context for the tutor response generation. As a single instruction generally involves… ▽ More In this paper, we propose Tutoring bot, a generative chatbot trained on a large scale of tutor-student conversations for English-language learning. To mimic a human tutor's behavior in language education, the tutor bot leverages diverse educational instructions and grounds to each instruction as additional input context for the tutor response generation. As a single instruction generally involves multiple dialogue turns to give the student sufficient speaking practice, the tutor bot is required to monitor and capture when the current instruction should be kept or switched to the next instruction. For that, the tutor bot is learned to not only generate responses but also infer its teaching action and progress on the current conversation simultaneously by a multi-task learning scheme. Our Tutoring bot is deployed under a non-commercial use license at https://tutoringai.com. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2212.07462 [pdf, other]

Harmonic (Quantum) Neural Networks

Authors: Atiyo Ghosh, Antonio A. Gentile, Mario Dagrada, Chul Lee, Seong-Hyok Kim, Hyukgeun Cha, Yunjun Choi, Brad Kim, Jeong-Il Kye, Vincent E. Elfving

Abstract: Harmonic functions are abundant in nature, appearing in limiting cases of Maxwell's, Navier-Stokes equations, the heat and the wave equation. Consequently, there are many applications of harmonic functions from industrial process optimisation to robotic path planning and the calculation of first exit times of random walks. Despite their ubiquity and relevance, there have been few attempts to incor… ▽ More Harmonic functions are abundant in nature, appearing in limiting cases of Maxwell's, Navier-Stokes equations, the heat and the wave equation. Consequently, there are many applications of harmonic functions from industrial process optimisation to robotic path planning and the calculation of first exit times of random walks. Despite their ubiquity and relevance, there have been few attempts to incorporate inductive biases towards harmonic functions in machine learning contexts. In this work, we demonstrate effective means of representing harmonic functions in neural networks and extend such results also to quantum neural networks to demonstrate the generality of our approach. We benchmark our approaches against (quantum) physics-informed neural networks, where we show favourable performance. △ Less

Submitted 13 August, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 12 pages (main), 7 pages (supplementary), 7 figures

Journal ref: PMLR 202:11340-11359, 2023

arXiv:2212.03099 [pdf, other]

Semantic-Conditional Diffusion Networks for Image Captioning

Authors: Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, Tao Mei

Abstract: Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer… ▽ More Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}

arXiv:2212.00850 [pdf, other]

When Neural Networks Fail to Generalize? A Model Sensitivity Perspective

Authors: Jia** Zhang, Hanqing Chao, Amit Dhurandhar, Pin-Yu Chen, Ali Tajer, Yangyang Xu, **kun Yan

Abstract: Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions. This paper considers a more realistic yet more challenging scenario,namely Single Domain Generalization (Single-DG), where only a single source domain is available for training. To tackle this challenge, we first try to understand when neural networks fail to generalize? We empirically… ▽ More Domain generalization (DG) aims to train a model to perform well in unseen domains under different distributions. This paper considers a more realistic yet more challenging scenario,namely Single Domain Generalization (Single-DG), where only a single source domain is available for training. To tackle this challenge, we first try to understand when neural networks fail to generalize? We empirically ascertain a property of a model that correlates strongly with its generalization that we coin as "model sensitivity". Based on our analysis, we propose a novel strategy of Spectral Adversarial Data Augmentation (SADA) to generate augmented images targeted at the highly sensitive frequencies. Models trained with these hard-to-learn samples can effectively suppress the sensitivity in the frequency space, which leads to improved generalization performance. Extensive experiments on multiple public datasets demonstrate the superiority of our approach, which surpasses the state-of-the-art single-DG methods. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: Accepted by AAAI 2023

arXiv:2211.15588 [pdf, other]

Internet of Behaviors: A Survey

Authors: Jiayi Sun, Wensheng Gan, Han-Chieh Chao, Philip S. Yu, Wei** Ding

Abstract: The Internet of Behavior is a research theme that aims to analyze human behavior data on the Internet from the perspective of behavioral psychology, obtain insights about human behavior, and better understand the intention behind the behavior. In this way, the Internet of Behavior can predict human behavioral trends in the future and even change human behavior, which can provide more convenience f… ▽ More The Internet of Behavior is a research theme that aims to analyze human behavior data on the Internet from the perspective of behavioral psychology, obtain insights about human behavior, and better understand the intention behind the behavior. In this way, the Internet of Behavior can predict human behavioral trends in the future and even change human behavior, which can provide more convenience for human life. With the increasing prosperity of the Internet of Things, more and more behavior-related data is collected on the Internet by connected devices such as sensors. People and behavior are connected through the extension of the Internet of Things -- the Internet of Behavior. At present, the Internet of Behavior has gradually been applied to our lives, but it is still in its early stages, and many opportunities and challenges are emerging. This paper provides an in-depth overview of the fundamental aspects of the Internet of Behavior: (1) We introduce the development process and research status of the Internet of Behavior from the perspective of the Internet of Things. (2) We propose the characteristics of the Internet of Behavior and define its development direction in terms of three aspects: real-time, autonomy, and reliability. (3) We provide a comprehensive summary of the current applications of the Internet of Behavior, including specific discussions in five scenarios that give an overview of the application status of the Internet of Behavior. (4) We discuss the challenges of the Internet of Behavior's development and its future directions, which hopefully will bring some progress to the Internet of Behavior. To the best of our knowledge, this is the first survey paper on the Internet of Behavior. We hope that this in-depth review can provide some useful directions for more productive research in related fields. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: Preprint. 9 figures, 1 table

arXiv:2211.14951 [pdf, other]

Metaverse in Education: Vision, Opportunities, and Challenges

Authors: Hong Lin, Shicheng Wan, Wensheng Gan, Jiahui Chen, Han-Chieh Chao

Abstract: Traditional education has been updated with the development of information technology in human history. Within big data and cyber-physical systems, the Metaverse has generated strong interest in various applications (e.g., entertainment, business, and cultural travel) over the last decade. As a novel social work idea, the Metaverse consists of many kinds of technologies, e.g., big data, interactio… ▽ More Traditional education has been updated with the development of information technology in human history. Within big data and cyber-physical systems, the Metaverse has generated strong interest in various applications (e.g., entertainment, business, and cultural travel) over the last decade. As a novel social work idea, the Metaverse consists of many kinds of technologies, e.g., big data, interaction, artificial intelligence, game design, Internet computing, Internet of Things, and blockchain. It is foreseeable that the usage of Metaverse will contribute to educational development. However, the architectures of the Metaverse in education are not yet mature enough. There are many questions we should address for the Metaverse in education. To this end, this paper aims to provide a systematic literature review of Metaverse in education. This paper is a comprehensive survey of the Metaverse in education, with a focus on current technologies, challenges, opportunities, and future directions. First, we present a brief overview of the Metaverse in education, as well as the motivation behind its integration. Then, we survey some important characteristics for the Metaverse in education, including the personal teaching environment and the personal learning environment. Next, we envisage what variations of this combination will bring to education in the future and discuss their strengths and weaknesses. We also review the state-of-the-art case studies (including technical companies and educational institutions) for Metaverse in education. Finally, we point out several challenges and issues in this promising area. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: IEEE BigData 2022. 10 pages, 5 figures, 3 tables

arXiv:2210.11640 [pdf, other]

Not All Asians are the Same: A Disaggregated Approach to Identifying Anti-Asian Racism in Social Media

Authors: Fan Wu, Sanyam Lakhanpal, Qian Li, Kook** Lee, Doowon Kim, Heewon Chae, Hazel K. Kwon

Abstract: Recent policy initiatives have acknowledged the importance of disaggregating data pertaining to diverse Asian ethnic communities to gain a more comprehensive understanding of their current status and to improve their overall well-being. However, research on anti-Asian racism has thus far fallen short of properly incorporating data disaggregation practices. Our study addresses this gap by collectin… ▽ More Recent policy initiatives have acknowledged the importance of disaggregating data pertaining to diverse Asian ethnic communities to gain a more comprehensive understanding of their current status and to improve their overall well-being. However, research on anti-Asian racism has thus far fallen short of properly incorporating data disaggregation practices. Our study addresses this gap by collecting 12-month-long data from X (formerly known as Twitter) that contain diverse sub-ethnic group representations within Asian communities. In this dataset, we break down anti-Asian toxic messages based on both temporal and ethnic factors and conduct a series of comparative analyses of toxic messages, targeting different ethnic groups. Using temporal persistence analysis, $n$-gram-based correspondence analysis, and topic modeling, this study provides compelling evidence that anti-Asian messages comprise various distinctive narratives. Certain messages targeting sub-ethnic Asian groups entail different topics that distinguish them from those targeting Asians in a generic manner or those aimed at major ethnic groups, such as Chinese and Indian. By introducing several techniques that facilitate comparisons of online anti-Asian hate towards diverse ethnic communities, this study highlights the importance of taking a nuanced and disaggregated approach for understanding racial hatred to formulate effective mitigation strategies. △ Less

Submitted 12 February, 2024; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: Accepted at theWebConf 2024 (formerly, WWW)

arXiv:2210.07990 [pdf, other]

Metaverse: Survey, Applications, Security, and Opportunities

Authors: Jiayi Sun, Wensheng Gan, Han-Chieh Chao, Philip S. Yu

Abstract: As a fusion of various emerging digital technologies, the Metaverse aims to build a virtual shared digital space. It is closely related to extended reality, digital twin, blockchain, and other technologies. Its goal is to build a digital space based on the real world, form a virtual economic system, and expand the space of human activities, which injects new vitality into the social, economic, and… ▽ More As a fusion of various emerging digital technologies, the Metaverse aims to build a virtual shared digital space. It is closely related to extended reality, digital twin, blockchain, and other technologies. Its goal is to build a digital space based on the real world, form a virtual economic system, and expand the space of human activities, which injects new vitality into the social, economic, and other fields. In this article, we make the following contributions. We first introduce the basic concepts such as the development process, definition, and characteristics of the Metaverse. After that, we analyze the overall framework and supporting technologies of the Metaverse and summarize the status of common fields. Finally, we focus on the security and privacy issues that exist in the Metaverse, and give corresponding solutions or directions. We also discuss the challenges and possible research directions of the Metaverse. We believe this comprehensive review can provide an overview of the Metaverse and research directions for some multidisciplinary studies. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Preprint. 5 figures, 4 tables

arXiv:2209.12807 [pdf, other]

Out-of-Distribution Detection with Hilbert-Schmidt Independence Optimization

Authors: **gyang Lin, Yu Wang, Qi Cai, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei

Abstract: Outlier detection tasks have been playing a critical role in AI safety. There has been a great challenge to deal with this task. Observations show that deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence. Existing works attempt to solve the problem by explicitly imposing uncertainty on classifiers w… ▽ More Outlier detection tasks have been playing a critical role in AI safety. There has been a great challenge to deal with this task. Observations show that deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence. Existing works attempt to solve the problem by explicitly imposing uncertainty on classifiers when OOD inputs are exposed to the classifier during training. In this paper, we propose an alternative probabilistic paradigm that is both practically useful and theoretically viable for the OOD detection tasks. Particularly, we impose statistical independence between inlier and outlier data during training, in order to ensure that inlier data reveals little information about OOD data to the deep estimator during training. Specifically, we estimate the statistical dependence between inlier and outlier data through the Hilbert-Schmidt Independence Criterion (HSIC), and we penalize such metric during training. We also associate our approach with a novel statistical test during the inference time coupled with our principled motivation. Empirical results show that our method is effective and robust for OOD detection on various benchmarks. In comparison to SOTA models, our approach achieves significant improvement regarding FPR95, AUROC, and AUPR metrics. Code is available: \url{https://github.com/jylins/hood}. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: Source code is available at \url{https://github.com/jylins/hood}

arXiv:2209.07764 [pdf, other]

DS-K3DOM: 3-D Dynamic Occupancy Map** with Kernel Inference and Dempster-Shafer Evidential Theory

Authors: Juyeop Han, Youngjae Min, Hyeok-Joo Chae, Byeong-Min Jeong, Han-Lim Choi

Abstract: Occupancy map** has been widely utilized to represent the surroundings for autonomous robots to perform tasks such as navigation and manipulation. While occupancy map** in 2-D environments has been well-studied, there have been few approaches suitable for 3-D dynamic occupancy map** which is essential for aerial robots. This paper presents a novel 3-D dynamic occupancy map** algorithm call… ▽ More Occupancy map** has been widely utilized to represent the surroundings for autonomous robots to perform tasks such as navigation and manipulation. While occupancy map** in 2-D environments has been well-studied, there have been few approaches suitable for 3-D dynamic occupancy map** which is essential for aerial robots. This paper presents a novel 3-D dynamic occupancy map** algorithm called DS-K3DOM. We first establish a Bayesian method to sequentially update occupancy maps for a stream of measurements based on the random finite set theory. Then, we approximate it with particles in the Dempster-Shafer domain to enable real-time computation. Moreover, the algorithm applies kernel-based inference with Dirichlet basic belief assignment to enable dense map** from sparse measurements. The efficacy of the proposed algorithm is demonstrated through simulations and real experiments. △ Less

Submitted 24 February, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: 7 pages, 3 figures, Accepted to ICRA 2023

MSC Class: J.2

arXiv:2209.00945 [pdf, other]

IMG2IMU: Translating Knowledge from Large-Scale Images to IMU Sensing Applications

Authors: Hyungjun Yoon, Hyeongheon Cha, Hoang C. Nguyen, Taesik Gong, Sung-Ju Lee

Abstract: Pre-training representations acquired via self-supervised learning could achieve high accuracy on even tasks with small training data. Unlike in vision and natural language processing domains, pre-training for IMU-based applications is challenging, as there are few public datasets with sufficient size and diversity to learn generalizable representations. To overcome this problem, we propose IMG2IM… ▽ More Pre-training representations acquired via self-supervised learning could achieve high accuracy on even tasks with small training data. Unlike in vision and natural language processing domains, pre-training for IMU-based applications is challenging, as there are few public datasets with sufficient size and diversity to learn generalizable representations. To overcome this problem, we propose IMG2IMU that adapts pre-trained representation from large-scale images to diverse IMU sensing tasks. We convert the sensor data into visually interpretable spectrograms for the model to utilize the knowledge gained from vision. We further present a sensor-aware pre-training method for images that enables models to acquire particularly impactful knowledge for IMU sensing applications. This involves using contrastive learning on our augmentation set customized for the properties of sensor data. Our evaluation with four different IMU sensing tasks shows that IMG2IMU outperforms the baselines pre-trained on sensor data by an average of 9.6%p F1-score, illustrating that vision knowledge can be usefully incorporated into IMU sensing applications where only limited training data is available. △ Less

Submitted 29 February, 2024; v1 submitted 2 September, 2022; originally announced September 2022.

Comments: 12 pages

MSC Class: 68T20

arXiv:2209.00930 [pdf, other]

Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization

Authors: Seungone Kim, Se June Joo, Hyungjoo Chae, Chaehyeong Kim, Seung-won Hwang, **young Yeo

Abstract: In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commo… ▽ More In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commonsense inferences and selects the most probable one with a similarity-based selection method. Built upon SICK, SICK++ utilizes commonsense as supervision, where the task of generating commonsense inferences is added upon summarizing the dialogue in a multi-task learning setting. Experimental results show that with injected commonsense knowledge, our framework generates more informative and consistent summaries than existing methods. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: Accepted at COLING 2022

arXiv:2207.06633 [pdf, other]

Toward cm-Level Accuracy: Carrier Phase Positioning for IIoT in 5G-Advanced NR Networks

Authors: Abdurrahman Fouda, Ryan Keating, Hyun-Su Cha

Abstract: High-precision positioning accuracy is among the key features of the future fifth-generation (5G-advanced) cellular networks to enable a wide variety of commercial, critical, and consumer use cases. 5G new radio (NR) systems have relied on (1) cellular temporal/angular-based positioning methods to provide the indoor environments with a moderate positioning accuracy that is well below the positioni… ▽ More High-precision positioning accuracy is among the key features of the future fifth-generation (5G-advanced) cellular networks to enable a wide variety of commercial, critical, and consumer use cases. 5G new radio (NR) systems have relied on (1) cellular temporal/angular-based positioning methods to provide the indoor environments with a moderate positioning accuracy that is well below the positioning requirements of these future use cases and (2) highly precise satellite carrier phase/code-based positioning methods for the outdoor deployments that are limited by the availability of the satellite coverage. This paper defines the relevant standard mechanisms and algorithms to use the carrier phase cellular-based measurements as a potential solution to achieve a high-precision positioning estimation accuracy in 5G-advanced NR networks. The presented positioning technique is evaluated using high-fidelity system-level simulations for indoor factory (InF) deployment scenarios. The numerical results demonstrate that the presented technique can significantly improve the positioning accuracy compared with the state-of-the-art NR positioning methods. Our findings in this paper also show that the carrier phase method not only provides an indoor complement to the outdoor satellite positioning but also provides an outdoor alternative to the high-precision satellite methods. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: in Proc. IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (IEEE PIMRC), Sep. 2022

arXiv:2207.05231 [pdf, other]

Regression Metric Loss: Learning a Semantic Representation Space for Medical Images

Authors: Hanqing Chao, Jia** Zhang, **kun Yan

Abstract: Regression plays an essential role in many medical imaging applications for estimating various clinical risk or measurement scores. While training strategies and loss functions have been studied for the deep neural networks in medical image classification tasks, options for regression tasks are very limited. One of the key challenges is that the high-dimensional feature representation learned by e… ▽ More Regression plays an essential role in many medical imaging applications for estimating various clinical risk or measurement scores. While training strategies and loss functions have been studied for the deep neural networks in medical image classification tasks, options for regression tasks are very limited. One of the key challenges is that the high-dimensional feature representation learned by existing popular loss functions like Mean Squared Error or L1 loss is hard to interpret. In this paper, we propose a novel Regression Metric Loss (RM-Loss), which endows the representation space with the semantic meaning of the label space by finding a representation manifold that is isometric to the label space. Experiments on two regression tasks, i.e. coronary artery calcium score estimation and bone age assessment, show that RM-Loss is superior to the existing popular regression losses on both performance and interpretability. Code is available at https://github.com/DIAL-RPI/Regression-Metric-Loss. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: Accepted by MICCAI2022

arXiv:2204.11068 [pdf, ps, other]

Integer Forcing Interference Management for the MIMO Interference Channel

Authors: Sung Ho Chae, Sang-Woon Jeon

Abstract: A new interference management scheme based on integer forcing (IF) receivers is studied for the two-user multiple-input and multiple-output (MIMO) interference channel. The proposed scheme employs a message splitting method that divides each data stream into common and private sub-streams, in which the private stream is recovered by the dedicated receiver only while the common stream is required t… ▽ More A new interference management scheme based on integer forcing (IF) receivers is studied for the two-user multiple-input and multiple-output (MIMO) interference channel. The proposed scheme employs a message splitting method that divides each data stream into common and private sub-streams, in which the private stream is recovered by the dedicated receiver only while the common stream is required to be recovered by both receivers. Specifically, to enable IF sum decoding at the receiver side, all streams are encoded using the same lattice code. Additionally, the number of common and private streams of each user is carefully determined by considering the number of antennas at transmitters and receivers, the channel matrices, and the effective signal-to-noise ratio (SNR) at each receiver to maximize the achievable rate. Furthermore, we consider various assumptions of channel state information at the transmitter side (CSIT) and propose low-complexity linear transmit beamforming suitable for each CSIT assumption. The achievable sum rate and rate region are analytically derived and extensively evaluated by simulation for various environments, demonstrating that the proposed interference management scheme strictly outperforms the previous benchmark schemes in a wide range of channel parameters due to the gain from IF sum decoding. △ Less

Submitted 23 April, 2022; originally announced April 2022.

Comments: Submitted to IEEE Trans. Wireless Commun. (in revision), 31 pages, 6 figures

arXiv:2203.03814 [pdf, other]

doi 10.1109/CVPR52688.2022.00257

Generating 3D Bio-Printable Patches Using Wound Segmentation and Reconstruction to Treat Diabetic Foot Ulcers

Authors: Han Joo Chae, Seunghwan Lee, Hyewon Son, Seungyeob Han, Taebin Lim

Abstract: We introduce AiD Regen, a novel system that generates 3D wound models combining 2D semantic segmentation with 3D reconstruction so that they can be printed via 3D bio-printers during the surgery to treat diabetic foot ulcers (DFUs). AiD Regen seamlessly binds the full pipeline, which includes RGB-D image capturing, semantic segmentation, boundary-guided point-cloud processing, 3D model reconstruct… ▽ More We introduce AiD Regen, a novel system that generates 3D wound models combining 2D semantic segmentation with 3D reconstruction so that they can be printed via 3D bio-printers during the surgery to treat diabetic foot ulcers (DFUs). AiD Regen seamlessly binds the full pipeline, which includes RGB-D image capturing, semantic segmentation, boundary-guided point-cloud processing, 3D model reconstruction, and 3D printable G-code generation, into a single system that can be used out of the box. We developed a multi-stage data preprocessing method to handle small and unbalanced DFU image datasets. AiD Regen's human-in-the-loop machine learning interface enables clinicians to not only create 3D regenerative patches with just a few touch interactions but also customize and confirm wound boundaries. As evidenced by our experiments, our model outperforms prior wound segmentation models and our reconstruction algorithm is capable of generating 3D wound models with compelling accuracy. We further conducted a case study on a real DFU patient and demonstrated the effectiveness of AiD Regen in treating DFU wounds. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022

arXiv:2112.07515 [pdf, other]

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Authors: Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei

Abstract: BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that… ▽ More BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: ACM Multimedia 2021

arXiv:2112.07513 [pdf, other]

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Authors: **gyang Lin, Yingwei Pan, Rongfeng Lai, Xuehang Yang, Hongyang Chao, Ting Yao

Abstract: Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text probl… ▽ More Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (sub-texts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instance-aware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text. Code is available: \url{https://github.com/jylins/CORE-Text}. △ Less

Submitted 14 December, 2021; originally announced December 2021.

Comments: ICME 2021 (Oral); Code is publicly available at: https://github.com/jylins/CORE-Text

arXiv:2111.14725 [pdf, other]

Searching the Search Space of Vision Transformer

Authors: Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling

Abstract: Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to g… ▽ More Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https://github.com/microsoft/Cream. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted to NIPS 2021

arXiv:2111.03481 [pdf, other]

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

Authors: Yanhong Zeng, Huan Yang, Hongyang Chao, Jianbo Wang, Jianlong Fu

Abstract: We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for… ▽ More We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e.,TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with 1024x1024 size, dispensing with convolutions entirely. △ Less

Submitted 18 December, 2021; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: NeurIPS 2021

arXiv:2108.04456 [pdf, other]

doi 10.1109/TIP.2021.3096067

Reference-based Defect Detection Network

Authors: Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao

Abstract: The defect detection task can be regarded as a realistic scenario of object detection in the computer vision field and it is widely used in the industrial field. Directly applying vanilla object detector to defect detection task can achieve promising results, while there still exists challenging issues that have not been solved. The first issue is the texture shift which means a trained defect det… ▽ More The defect detection task can be regarded as a realistic scenario of object detection in the computer vision field and it is widely used in the industrial field. Directly applying vanilla object detector to defect detection task can achieve promising results, while there still exists challenging issues that have not been solved. The first issue is the texture shift which means a trained defect detector model will be easily affected by unseen texture, and the second issue is partial visual confusion which indicates that a partial defect box is visually similar with a complete box. To tackle these two problems, we propose a Reference-based Defect Detection Network (RDDN). Specifically, we introduce template reference and context reference to against those two problems, respectively. Template reference can reduce the texture shift from image, feature or region levels, and encourage the detectors to focus more on the defective area as a result. We can use either well-aligned template images or the outputs of a pseudo template generator as template references in this work, and they are jointly trained with detectors by the supervision of normal samples. To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression. Experiments on two defect detection datasets demonstrate the effectiveness of our proposed approach. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Journal ref: IEEE Transactions on Image Processing, vol. 30, pp. 6637-6647, 2021

arXiv:2108.02696 [pdf, other]

A Low Rank Promoting Prior for Unsupervised Contrastive Learning

Authors: Yu Wang, **gyang Lin, Qi Cai, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei

Abstract: Unsupervised learning is just at a tip** point where it could really take off. Among these approaches, contrastive learning has seen tremendous progress and led to state-of-the-art performance. In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC. In contrast t… ▽ More Unsupervised learning is just at a tip** point where it could really take off. Among these approaches, contrastive learning has seen tremendous progress and led to state-of-the-art performance. In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC. In contrast to the existing conventional self-supervised approaches that only considers independent learning, our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension. This heuristic poses particular joint learning constraints to reduce the degree of freedom of the problem during the search of the optimal network parameterization. Most importantly, we argue that the low rank prior employed here is not unique, and many different priors can be invoked in a similar probabilistic way, corresponding to different hypotheses about underlying truth behind the contrastive features. Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks, including image classification, object detection, instance segmentation and keypoint detection. △ Less

Submitted 5 August, 2021; originally announced August 2021.

arXiv:2107.14222 [pdf, other]

Rethinking and Improving Relative Position Encoding for Vision Transformer

Authors: Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao

Abstract: Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existi… ▽ More Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE. △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: Accepted by ICCV 2021

arXiv:2107.04548 [pdf, other]

Cross-modal Attention for MRI and Ultrasound Volume Registration

Authors: Xinrui Song, Hengtao Guo, Xuanang Xu, Hanqing Chao, Sheng Xu, Baris Turkbey, Bradford J. Wood, Ge Wang, **kun Yan

Abstract: Prostate cancer biopsy benefits from accurate fusion of transrectal ultrasound (TRUS) and magnetic resonance (MR) images. In the past few years, convolutional neural networks (CNNs) have been proved powerful in extracting image features crucial for image registration. However, challenging applications and recent advances in computer vision suggest that CNNs are quite limited in its ability to unde… ▽ More Prostate cancer biopsy benefits from accurate fusion of transrectal ultrasound (TRUS) and magnetic resonance (MR) images. In the past few years, convolutional neural networks (CNNs) have been proved powerful in extracting image features crucial for image registration. However, challenging applications and recent advances in computer vision suggest that CNNs are quite limited in its ability to understand spatial correspondence between features, a task in which the self-attention mechanism excels. This paper aims to develop a self-attention mechanism specifically for cross-modal image registration. Our proposed cross-modal attention block effectively maps each of the features in one volume to all features in the corresponding volume. Our experimental results demonstrate that a CNN network designed with the cross-modal attention block embedded outperforms an advanced CNN network 10 times of its size. We also incorporated visualization techniques to improve the interpretability of our network. The source code of our work is available at https://github.com/DIAL-RPI/Attention-Reg . △ Less

Submitted 11 July, 2021; v1 submitted 9 July, 2021; originally announced July 2021.

Comments: This paper has been accepted by MICCAI 2021

arXiv:2106.14413 [pdf, other]

Co$^2$L: Contrastive Continual Learning

Authors: Hyuntak Cha, Jaeho Lee, **woo Shin

Abstract: Recent breakthroughs in self-supervised learning show that such algorithms learn visual representations that can be transferred better to unseen tasks than joint-training methods relying on task-specific supervision. In this paper, we found that the similar holds in the continual learning con-text: contrastively learned representations are more robust against the catastrophic forgetting than joint… ▽ More Recent breakthroughs in self-supervised learning show that such algorithms learn visual representations that can be transferred better to unseen tasks than joint-training methods relying on task-specific supervision. In this paper, we found that the similar holds in the continual learning con-text: contrastively learned representations are more robust against the catastrophic forgetting than jointly trained representations. Based on this novel observation, we propose a rehearsal-based continual learning algorithm that focuses on continually learning and maintaining transferable representations. More specifically, the proposed scheme (1) learns representations using the contrastive learning objective, and (2) preserves learned representations using a self-supervised distillation step. We conduct extensive experimental validations under popular benchmark image classification datasets, where our method sets the new state-of-the-art performance. △ Less

Submitted 28 June, 2021; originally announced June 2021.

Comments: 14 pages, 5 figures

arXiv:2105.09937 [pdf, other]

AnaXNet: Anatomy Aware Multi-label Finding Classification in Chest X-ray

Authors: Nkechinyere N. Agu, Joy T. Wu, Hanqing Chao, Ismini Lourentzou, Arjun Sharma, Mehdi Moradi, **kun Yan, James Hendler

Abstract: Radiologists usually observe anatomical regions of chest X-ray images as well as the overall image before making a decision. However, most existing deep learning models only look at the entire X-ray image for classification, failing to utilize important anatomical information. In this paper, we propose a novel multi-label chest X-ray classification model that accurately classifies the image findin… ▽ More Radiologists usually observe anatomical regions of chest X-ray images as well as the overall image before making a decision. However, most existing deep learning models only look at the entire X-ray image for classification, failing to utilize important anatomical information. In this paper, we propose a novel multi-label chest X-ray classification model that accurately classifies the image finding and also localizes the findings to their correct anatomical regions. Specifically, our model consists of two modules, the detection module and the anatomical dependency module. The latter utilizes graph convolutional networks, which enable our model to learn not only the label dependency but also the relationship between the anatomical regions in the chest X-ray. We further utilize a method to efficiently create an adjacency matrix for the anatomical regions using the correlation of the label across the different regions. Detailed experiments and analysis of our results show the effectiveness of our method when compared to the current state-of-the-art multi-label chest X-ray image classification methods while also providing accurate location information. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: Accepted to MICCAI 2021

arXiv:2104.01762 [pdf, other]

doi 10.1007/978-981-10-8530-7_10

3D Human Body Resha** with Anthropometric Modeling

Authors: Yanhong Zeng, Jianlong Fu, Hongyang Chao

Abstract: Resha** accurate and realistic 3D human bodies from anthropometric parameters (e.g., height, chest size, etc.) poses a fundamental challenge for person identification, online shop** and virtual reality. Existing approaches for creating such 3D shapes often suffer from complex measurement by range cameras or high-end scanners, which either involve heavy expense cost or result in low quality. Ho… ▽ More Resha** accurate and realistic 3D human bodies from anthropometric parameters (e.g., height, chest size, etc.) poses a fundamental challenge for person identification, online shop** and virtual reality. Existing approaches for creating such 3D shapes often suffer from complex measurement by range cameras or high-end scanners, which either involve heavy expense cost or result in low quality. However, these high-quality equipments limit existing approaches in real applications, because the equipments are not easily accessible for common users. In this paper, we have designed a 3D human body resha** system by proposing a novel feature-selection-based local map** technique, which enables automatic anthropometric parameter modeling for each body facet. Note that the proposed approach can leverage limited anthropometric parameters (i.e., 3-5 measurements) as input, which avoids complex measurement, and thus better user-friendly experience can be achieved in real scenarios. Specifically, the proposed resha** model consists of three steps. First, we calculate full-body anthropometric parameters from limited user inputs by imputation technique, and thus essential anthropometric parameters for 3D body resha** can be obtained. Second, we select the most relevant anthropometric parameters for each facet by adopting relevance masks, which are learned offline by the proposed local map** technique. Third, we generate the 3D body meshes by map** matrices, which are learned by linear regression from the selected parameters to mesh-based body representation. We conduct experiments by anthropomorphic evaluation and a user study from 68 volunteers. Experiments show the superior results of the proposed system in terms of mean reconstruction error against the state-of-the-art approaches. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: ICIMCS 2017(oral). The final publication is available at Springer via https://doi.org/10.1007/978-981-10-8530-7_10

Journal ref: In International Conference on Internet Multimedia Computing and Service (pp. 96-107). Springer, Singapore (2017)

arXiv:2104.01431 [pdf, other]

Aggregated Contextual Transformations for High-Resolution Image Inpainting

Authors: Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

Abstract: State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated CO… ▽ More State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting. △ Less

Submitted 3 April, 2021; originally announced April 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2103.16519 [pdf, other]

Explainable Fuzzy Utility Mining on Sequences

Authors: Wensheng Gan, Zilin Du, Wei** Ding, Chunkai Zhang, Han-Chieh Chao

Abstract: Fuzzy systems have good modeling capabilities in several data science scenarios, and can provide human-explainable intelligence models with explainability and interpretability. In contrast to transaction data, which have been extensively studied, sequence data are more common in real-life applications. To obtain a human-explainable data intelligence model for decision making, in this study, we inv… ▽ More Fuzzy systems have good modeling capabilities in several data science scenarios, and can provide human-explainable intelligence models with explainability and interpretability. In contrast to transaction data, which have been extensively studied, sequence data are more common in real-life applications. To obtain a human-explainable data intelligence model for decision making, in this study, we investigate explainable fuzzy-theoretic utility mining on multi-sequences. Meanwhile, a more normative formulation of the problem of fuzzy utility mining on sequences is formulated. By exploring fuzzy set theory for utility mining, we propose a novel method termed pattern growth fuzzy utility mining (PGFUM) for mining fuzzy high-utility sequences with linguistic meaning. In the case of sequence data, PGFUM reflects the fuzzy quantity and utility regions of sequences. To improve the efficiency and feasibility of PGFUM, we develop two compressed data structures with explainable fuzziness. Furthermore, one existing and two new upper bounds on the explainable fuzzy utility of candidates are adopted in three proposed pruning strategies to substantially reduce the search space and thus expedite the mining process. Finally, the proposed PGFUM algorithm is compared with PFUS, which is the only currently available method for the same task, through extensive experimental evaluation. It is demonstrated that PGFUM achieves not only human-explainable mining results that contain the original nature of revealable intelligibility, but also high efficiency in terms of runtime and memory cost. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: Preprint. 13 figures, 5 tables

Showing 1–50 of 142 results for author: Chae, H