Search | arXiv e-print repository

Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning

Authors: Cilin Yan, Haochen Wang, Xiaolong Jiang, Yao Hu, Xu Tang, Guoliang Kang, Efstratios Gavves

Abstract: Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of d… ▽ More Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of downstream task). However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Specifically, we adopt a transformer module which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchor and target classes as "Value". In this way, the output of such a transformer module represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. We conduct extensive experiments on eleven representative classification datasets. The results show that our method performs favorably against previous state-of-the-arts considering few-shot classification settings. △ Less

Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

arXiv:2405.20233 [pdf, other]

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

Authors: Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee

Abstract: One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training… ▽ More One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at https://github.com/ironjr/grokfast. △ Less

Submitted 5 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: 17 pages, 13 figures. Typo fixed. Project page: https://jaerinlee.com/research/grokfast

arXiv:2404.16685 [pdf, other]

Multi-scale HSV Color Feature Embedding for High-fidelity NIR-to-RGB Spectrum Translation

Authors: Huiyu Zhai, Mo Chen, Xingxing Yang, Gusheng Kang

Abstract: The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral map** ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the map** pr… ▽ More The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral map** ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the map** process into three sub-tasks, including NIR texture maintenance, coarse geometry reconstruction, and RGB color prediction. Thus, we propose three key modules for each corresponding sub-task: the Texture Preserving Block (TPB), the HSV Color Feature Embedding Module (HSV-CFEM), and the Geometry Reconstruction Module (GRM). These modules contribute to our MCFNet methodically tackling spectral translation through a series of escalating resolutions, progressively enriching images with color and texture fidelity in a scale-coherent fashion. The proposed MCFNet demonstrates substantial performance gains over the NIR image colorization task. Code is released at: https://github.com/AlexYangxx/MCFNet. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.15190 [pdf, other]

Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following

Authors: Suyeon Shin, Su** jeon, Junghyun Kim, Gi-Cheon Kang, Byoung-Tak Zhang

Abstract: Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infe… ▽ More Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 14 pages, 6 figures

MSC Class: 68T01 (Primary) 68T40; 68T50; 68T45 (Secondary)

arXiv:2404.04913 [pdf, other]

CodecNeRF: Toward Fast Encoding and Decoding, Compact, and High-quality Novel-view Synthesis

Authors: Gyeong** Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park

Abstract: Neural Radiance Fields (NeRF) have achieved huge success in effectively capturing and representing 3D objects and scenes. However, several factors have impeded its further proliferation as next-generation 3D media. To establish a ubiquitous presence in everyday media formats, such as images and videos, it is imperative to devise a solution that effectively fulfills three key objectives: fast encod… ▽ More Neural Radiance Fields (NeRF) have achieved huge success in effectively capturing and representing 3D objects and scenes. However, several factors have impeded its further proliferation as next-generation 3D media. To establish a ubiquitous presence in everyday media formats, such as images and videos, it is imperative to devise a solution that effectively fulfills three key objectives: fast encoding and decoding time, compact model sizes, and high-quality renderings. Despite significant advancements, a comprehensive algorithm that adequately addresses all objectives has yet to be fully realized. In this work, we present CodecNeRF, a neural codec for NeRF representations, consisting of a novel encoder and decoder architecture that can generate a NeRF representation in a single forward pass. Furthermore, inspired by the recent parameter-efficient finetuning approaches, we develop a novel finetuning method to efficiently adapt the generated NeRF representations to a new test instance, leading to high-quality image renderings and compact code sizes. The proposed CodecNeRF, a newly suggested encoding-decoding-finetuning pipeline for NeRF, achieved unprecedented compression performance of more than 150x and 20x reduction in encoding time while maintaining (or improving) the image quality on widely used 3D object datasets, such as ShapeNet and Objaverse. △ Less

Submitted 28 May, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

Comments: Project page: https://gynjn.github.io/Codec-NeRF/

arXiv:2404.00021 [pdf, other]

Evaluatology: The Science and Engineering of Evaluation

Authors: Jianfeng Zhan, Lei Wang, Wanling Gao, Hongxiao Li, Chenxi Wang, Yunyou Huang, Yatao Li, Zhengxin Yang, Guoxin Kang, Chunjie Luo, Hainan Ye, Shaopeng Dai, Zhifei Zhang

Abstract: Evaluation is a crucial aspect of human existence and plays a vital role in various fields. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant repercussions. This article aims to formally introduce the discipline of evaluatology, which encompasses the science… ▽ More Evaluation is a crucial aspect of human existence and plays a vital role in various fields. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant repercussions. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines. Our research reveals that the essence of evaluation lies in conducting experiments that intentionally apply a well-defined evaluation condition to diverse subjects and infer the impact of different subjects by measuring and/or testing. Derived from the essence of evaluation, we propose five axioms focusing on key aspects of evaluation outcomes as the foundational evaluation theory. These axioms serve as the bedrock upon which we build universal evaluation theories and methodologies. When evaluating a single subject, it is crucial to create evaluation conditions with different levels of equivalency. By applying these conditions to diverse subjects, we can establish reference evaluation models. These models allow us to alter a single independent variable at a time while kee** all other variables as controls. When evaluating complex scenarios, the key lies in establishing a series of evaluation models that maintain transitivity. Building upon the science of evaluation, we propose a formal definition of a benchmark as a simplified and sampled evaluation condition that guarantees different levels of equivalency. This concept serves as the cornerstone for a universal benchmark-based engineering approach to evaluation across various disciplines, which we refer to as benchmarkology. △ Less

Submitted 19 March, 2024; originally announced April 2024.

Comments: 29 pages, 16 figures, and 2 tables

arXiv:2403.15049 [pdf, other]

Continual Vision-and-Language Navigation

Authors: Seongjun Jeong, Gi-Cheon Kang, Seongho Choi, Joochan Kim, Byoung-Tak Zhang

Abstract: Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe. Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation: the introduction of new environments necessitates retraining with previously encountered environments to preserve their knowledge. This makes it dif… ▽ More Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe. Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation: the introduction of new environments necessitates retraining with previously encountered environments to preserve their knowledge. This makes it difficult to train VLN agents that operate in the ever-changing real world. To address this limitation, we present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process. For the training and evaluation of CVLN agents, we re-arrange existing VLN datasets to propose two datasets: CVLN-I, focused on navigation via initial-instruction interpretation, and CVLN-D, aimed at navigation through dialogue with other agents. Furthermore, we propose two novel rehearsal-based methods for CVLN, Perplexity Replay (PerpR) and Episodic Self-Replay (ESR). PerpR prioritizes replaying challenging episodes based on action perplexity, while ESR replays previously predicted action logits to preserve learned behaviors. We demonstrate the effectiveness of the proposed methods on CVLN through extensive experiments. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2401.16808 [pdf, other]

Encoding Temporal Statistical-space Priors via Augmented Representation

Authors: Insu Choi, Woosung Koh, Gimin Kang, Yuntae Jang, Woo Chang Kim

Abstract: Modeling time series data remains a pervasive issue as the temporal dimension is inherent to numerous domains. Despite significant strides in time series forecasting, high noise-to-signal ratio, non-normality, non-stationarity, and lack of data continue challenging practitioners. In response, we leverage a simple representation augmentation technique to overcome these challenges. Our augmented rep… ▽ More Modeling time series data remains a pervasive issue as the temporal dimension is inherent to numerous domains. Despite significant strides in time series forecasting, high noise-to-signal ratio, non-normality, non-stationarity, and lack of data continue challenging practitioners. In response, we leverage a simple representation augmentation technique to overcome these challenges. Our augmented representation acts as a statistical-space prior encoded at each time step. In response, we name our method Statistical-space Augmented Representation (SSAR). The underlying high-dimensional data-generating process inspires our representation augmentation. We rigorously examine the empirical generalization performance on two data sets with two downstream temporal learning algorithms. Our approach significantly beats all five up-to-date baselines. Moreover, the highly modular nature of our approach can easily be applied to various settings. Lastly, fully-fledged theoretical perspectives are available throughout the writing for a clear and rigorous understanding. △ Less

Submitted 3 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: pre-print

arXiv:2312.14611 [pdf, other]

Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

Authors: Xiaoyue Duan, Shuhao Cui, Guoliang Kang, Baochang Zhang, Zhengcong Fei, Mingyuan Fan, Junshi Huang

Abstract: Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non… ▽ More Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2311.13326 [pdf, other]

Curriculum Learning and Imitation Learning for Model-free Control on Financial Time-series

Authors: Woosung Koh, Insu Choi, Yuntae Jang, Gimin Kang, Woo Chang Kim

Abstract: Curriculum learning and imitation learning have been leveraged extensively in the robotics domain. However, minimal research has been done on leveraging these ideas on control tasks over highly stochastic time-series data. Here, we theoretically and empirically explore these approaches in a representative control task over complex time-series data. We implement the fundamental ideas of curriculum… ▽ More Curriculum learning and imitation learning have been leveraged extensively in the robotics domain. However, minimal research has been done on leveraging these ideas on control tasks over highly stochastic time-series data. Here, we theoretically and empirically explore these approaches in a representative control task over complex time-series data. We implement the fundamental ideas of curriculum learning via data augmentation, while imitation learning is implemented via policy distillation from an oracle. Our findings reveal that curriculum learning should be considered a novel direction in improving control-task performance over complex time-series. Our ample random-seed out-sample empirics and ablation studies are highly encouraging for curriculum learning for time-series control. These findings are especially encouraging as we tune all overlap** hyperparameters on the baseline -- giving an advantage to the baseline. On the other hand, we find that imitation learning should be used with caution. △ Less

Submitted 12 January, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: AAAI 2024 AI4TS Workshop Oral

arXiv:2311.00353 [pdf, other]

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Authors: Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo **, Kaiye Wang, Pengfei Yan

Abstract: Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to enc… ▽ More Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a war** operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.19202 [pdf]

Improved Motor Imagery Classification Using Adaptive Spatial Filters Based on Particle Swarm Optimization Algorithm

Authors: Xiong Xiong, Ying Wang, Tianyuan Song, **guo Huang, Guixia Kang

Abstract: As a typical self-paced brain-computer interface (BCI) system, the motor imagery (MI) BCI has been widely applied in fields such as robot control, stroke rehabilitation, and assistance for patients with stroke or spinal cord injury. Many studies have focused on the traditional spatial filters obtained through the common spatial pattern (CSP) method. However, the CSP method can only obtain fixed sp… ▽ More As a typical self-paced brain-computer interface (BCI) system, the motor imagery (MI) BCI has been widely applied in fields such as robot control, stroke rehabilitation, and assistance for patients with stroke or spinal cord injury. Many studies have focused on the traditional spatial filters obtained through the common spatial pattern (CSP) method. However, the CSP method can only obtain fixed spatial filters for specific input signals. Besides, CSP method only focuses on the variance difference of two types of electroencephalogram (EEG) signals, so the decoding ability of EEG signals is limited. To obtain more effective spatial filters for better extraction of spatial features that can improve classification to MI-EEG, this paper proposes an adaptive spatial filter solving method based on particle swarm optimization algorithm (PSO). A training and testing framework based on filter bank and spatial filters (FBCSP-ASP) is designed for MI EEG signal classification. Comparative experiments are conducted on two public datasets (2a and 2b) from BCI competition IV, which show the outstanding average recognition accuracy of FBCSP-ASP. The proposed method has achieved significant performance improvement on MI-BCI. The classification accuracy of the proposed method has reached 74.61% and 81.19% on datasets 2a and 2b, respectively. Compared with the baseline algorithm (FBCSP), the proposed algorithm improves 11.44% and 7.11% on two datasets respectively. Furthermore, the analysis based on mutual information, t-SNE and Shapley values further proves that ASP features have excellent decoding ability for MI-EEG signals, and explains the improvement of classification performance by the introduction of ASP features. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: 25 pages, 8 figures

arXiv:2310.19198 [pdf]

Enhancing Motor Imagery Decoding in Brain Computer Interfaces using Riemann Tangent Space Map** and Cross Frequency Coupling

Authors: Xiong Xiong, Li Su, **guo Huang, Guixia Kang

Abstract: Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper introduces a novel approach termed Riemann Tangent Space Map** using Dichotomous Filter Bank wi… ▽ More Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper introduces a novel approach termed Riemann Tangent Space Map** using Dichotomous Filter Bank with Convolutional Neural Network (DFBRTS) to enhance the representation quality and decoding capability pertaining to MI features. DFBRTS first initiates the process by meticulously filtering EEG signals through a Dichotomous Filter Bank, structured in the fashion of a complete binary tree. Subsequently, it employs Riemann Tangent Space Map** to extract salient EEG signal features within each sub-band. Finally, a lightweight convolutional neural network is employed for further feature extraction and classification, operating under the joint supervision of cross-entropy and center loss. To validate the efficacy, extensive experiments were conducted using DFBRTS on two well-established benchmark datasets: the BCI competition IV 2a (BCIC-IV-2a) dataset and the OpenBMI dataset. The performance of DFBRTS was benchmarked against several state-of-the-art MI decoding methods, alongside other Riemannian geometry-based MI decoding approaches. Results: DFBRTS significantly outperforms other MI decoding algorithms on both datasets, achieving a remarkable classification accuracy of 78.16% for four-class and 71.58% for two-class hold-out classification, as compared to the existing benchmarks. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: 22 pages, 7 figures

arXiv:2310.12547 [pdf, other]

PGA: Personalizing Gras** Agents with Single Human-Robot Interaction

Authors: Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Seoyun Yang, Minjoon Jung, Byoung-Tak Zhang

Abstract: Language-Conditioned Robotic Gras** (LCRG) aims to develop robots that comprehend and grasp objects based on natural language instructions. While the ability to understand personal objects like my wallet facilitates more natural interaction with human users, current LCRG systems only allow generic language instructions, e.g., the black-colored wallet next to the laptop. To this end, we introduce… ▽ More Language-Conditioned Robotic Gras** (LCRG) aims to develop robots that comprehend and grasp objects based on natural language instructions. While the ability to understand personal objects like my wallet facilitates more natural interaction with human users, current LCRG systems only allow generic language instructions, e.g., the black-colored wallet next to the laptop. To this end, we introduce a task scenario GraspMine alongside a novel dataset aimed at pinpointing and gras** personal objects given personal indicators via learning from a single human-robot interaction, rather than a large labeled dataset. Our proposed method, Personalized Gras** Agent (PGA), addresses GraspMine by leveraging the unlabeled image data of the user's environment, called Reminiscence. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. This results in significant efficiency while previous LCRG systems rely on resource-intensive human annotations -- necessitating hundreds of labeled data to learn my wallet. Moreover, PGA outperforms baseline methods across all metrics and even shows comparable performance compared to the fully-supervised method, which learns from 9k annotated data samples. We further validate PGA's real-world applicability by employing a physical robot to execute GrsapMine. Code and data are publicly available at https://github.com/JHKim-snu/PGA. △ Less

Submitted 19 March, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: 8 pages, under review

arXiv:2309.07759 [pdf, other]

PROGrasp: Pragmatic Human-Robot Communication for Object Gras**

Authors: Gi-Cheon Kang, Junghyun Kim, Jaein Kim, Byoung-Tak Zhang

Abstract: Interactive Object Gras** (IOG) is the task of identifying and gras** the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG,… ▽ More Interactive Object Gras** (IOG) is the task of identifying and gras** the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Gras** (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object gras**, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings. Code and data are available at https://github.com/gicheonkang/prograsp. △ Less

Submitted 5 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: ICRA 2024

arXiv:2308.16529 [pdf]

Develo** Social Robots with Empathetic Non-Verbal Cues Using Large Language Models

Authors: Yoon Kyung Lee, Yoonwon Jung, Gyuyi Kang, Sowon Hahn

Abstract: We propose augmenting the empathetic capacities of social robots by integrating non-verbal cues. Our primary contribution is the design and labeling of four types of empathetic non-verbal cues, abbreviated as SAFE: Speech, Action (gesture), Facial expression, and Emotion, in a social robot. These cues are generated using a Large Language Model (LLM). We developed an LLM-based conversational system… ▽ More We propose augmenting the empathetic capacities of social robots by integrating non-verbal cues. Our primary contribution is the design and labeling of four types of empathetic non-verbal cues, abbreviated as SAFE: Speech, Action (gesture), Facial expression, and Emotion, in a social robot. These cues are generated using a Large Language Model (LLM). We developed an LLM-based conversational system for the robot and assessed its alignment with social cues as defined by human counselors. Preliminary results show distinct patterns in the robot's responses, such as a preference for calm and positive social emotions like 'joy' and 'lively', and frequent nodding gestures. Despite these tendencies, our approach has led to the development of a social robot capable of context-aware and more authentic interactions. Our work lays the groundwork for future studies on human-robot interactions, emphasizing the essential role of both verbal and non-verbal cues in creating social and empathetic robots. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Journal ref: In Proceedings of 2023 IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)

arXiv:2307.05963 [pdf, other]

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Authors: Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Suyeon Shin, Byoung-Tak Zhang

Abstract: Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world… ▽ More Language-Guided Robotic Manipulation (LGRM) is a challenging task as it requires a robot to understand human instructions to manipulate everyday objects. Recent approaches in LGRM rely on pre-trained Visual Grounding (VG) models to detect objects without adapting to manipulation environments. This results in a performance drop due to a substantial domain gap between the pre-training and real-world data. A straightforward solution is to collect additional training data, but the cost of human-annotation is extortionate. In this paper, we propose Grounding Vision to Ceaselessly Created Instructions (GVCCI), a lifelong learning framework for LGRM, which continuously learns VG without human supervision. GVCCI iteratively generates synthetic instruction via object detection and trains the VG model with the generated data. We validate our framework in offline and online settings across diverse environments on different VG models. Experimental results show that accumulating synthetic data from GVCCI leads to a steady improvement in VG by up to 56.7% and improves resultant LGRM by up to 29.4%. Furthermore, the qualitative analysis shows that the unadapted VG model often fails to find correct objects due to a strong bias learned from the pre-training data. Finally, we introduce a novel VG dataset for LGRM, consisting of nearly 252k triplets of image-object-instruction from diverse manipulation environments. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: Accepted at IROS2023

arXiv:2307.04422 [pdf, other]

A Versatile Door Opening System with Mobile Manipulator through Adaptive Position-Force Control and Reinforcement Learning

Authors: Gyuree Kang, Hyunki Seong, Daegyu Lee, D. Hyunchul Shim

Abstract: The ability of robots to navigate through doors is crucial for their effective operation in indoor environments. Consequently, extensive research has been conducted to develop robots capable of opening specific doors. However, the diverse combinations of door handles and opening directions necessitate a more versatile door opening system for robots to successfully operate in real-world environment… ▽ More The ability of robots to navigate through doors is crucial for their effective operation in indoor environments. Consequently, extensive research has been conducted to develop robots capable of opening specific doors. However, the diverse combinations of door handles and opening directions necessitate a more versatile door opening system for robots to successfully operate in real-world environments. In this paper, we propose a mobile manipulator system that can autonomously open various doors without prior knowledge. By using convolutional neural networks, point cloud extraction techniques, and external force measurements during exploratory motion, we obtained information regarding handle types, poses, and door characteristics. Through two different approaches, adaptive position-force control and deep reinforcement learning, we successfully opened doors without precise trajectory or excessive external force. The adaptive position-force control method involves moving the end-effector in the direction of the door opening while responding compliantly to external forces, ensuring safety and manipulator workspace. Meanwhile, the deep reinforcement learning policy minimizes applied forces and eliminates unnecessary movements, enabling stable operation across doors with different poses and widths. The RL-based approach outperforms the adaptive position-force control method in terms of compensating for external forces, ensuring smooth motion, and achieving efficient speed. It reduces the maximum force required by 3.27 times and improves motion smoothness by 1.82 times. However, the non-learning-based adaptive position-force control method demonstrates more versatility in opening a wider range of doors, encompassing revolute doors with four distinct opening directions and varying widths. △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2307.00965 [pdf, other]

OpenClinicalAI: An Open and Dynamic Model for Alzheimer's Disease Diagnosis

Authors: Yunyou Huang, Xiaoshuang Liang, Xiangjiang Lu, Xiuxia Miao, Jiyue Xie, Wen**g Liu, Fan Zhang, Guoxin Kang, Li Ma, Suqin Tang, Zhifei Zhang, Jianfeng Zhan

Abstract: Although Alzheimer's disease (AD) cannot be reversed or cured, timely diagnosis can significantly reduce the burden of treatment and care. Current research on AD diagnosis models usually regards the diagnosis task as a typical classification task with two primary assumptions: 1) All target categories are known a priori; 2) The diagnostic strategy for each patient is consistent, that is, the number… ▽ More Although Alzheimer's disease (AD) cannot be reversed or cured, timely diagnosis can significantly reduce the burden of treatment and care. Current research on AD diagnosis models usually regards the diagnosis task as a typical classification task with two primary assumptions: 1) All target categories are known a priori; 2) The diagnostic strategy for each patient is consistent, that is, the number and type of model input data for each patient are the same. However, real-world clinical settings are open, with complexity and uncertainty in terms of both subjects and the resources of the medical institutions. This means that diagnostic models may encounter unseen disease categories and need to dynamically develop diagnostic strategies based on the subject's specific circumstances and available medical resources. Thus, the AD diagnosis task is tangled and coupled with the diagnosis strategy formulation. To promote the application of diagnostic systems in real-world clinical settings, we propose OpenClinicalAI for direct AD diagnosis in complex and uncertain clinical settings. This is the first powerful end-to-end model to dynamically formulate diagnostic strategies and provide diagnostic results based on the subject's conditions and available medical resources. OpenClinicalAI combines reciprocally coupled deep multiaction reinforcement learning (DMARL) for diagnostic strategy formulation and multicenter meta-learning (MCML) for open-set recognition. The experimental results show that OpenClinicalAI achieves better performance and fewer clinical examinations than the state-of-the-art model. Our method provides an opportunity to embed the AD diagnostic system into the current health care system to cooperate with clinicians to improve current health care. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: Real-world clinical setting,Alzheimer's disease,diagnose,AI,deep learning. arXiv admin note: text overlap with arXiv:2109.04004

arXiv:2305.14773 [pdf, other]

doi 10.1109/ICRA48891.2023.10161518

Robust Imaging Sonar-based Place Recognition and Localization in Underwater Environments

Authors: Hogyun Kim, Gilhwan Kang, Seokhwan Jeong, Seungjun Ma, Younggun Cho

Abstract: Place recognition using SOund Navigation and Ranging (SONAR) images is an important task for simultaneous localization and map**(SLAM) in underwater environments. This paper proposes a robust and efficient imaging SONAR based place recognition, SONAR context, and loop closure method. Unlike previous methods, our approach encodes geometric information based on the characteristics of raw SONAR mea… ▽ More Place recognition using SOund Navigation and Ranging (SONAR) images is an important task for simultaneous localization and map**(SLAM) in underwater environments. This paper proposes a robust and efficient imaging SONAR based place recognition, SONAR context, and loop closure method. Unlike previous methods, our approach encodes geometric information based on the characteristics of raw SONAR measurements without prior knowledge or training. We also design a hierarchical searching procedure for fast retrieval of candidate SONAR frames and apply adaptive shifting and padding to achieve robust matching on rotation and translation changes. In addition, we can derive the initial pose through adaptive shifting and apply it to the iterative closest point (ICP) based loop closure factor. We evaluate the performance of SONAR context in the various underwater sequences such as simulated open water, real water tank, and real underwater environments. The proposed approach shows the robustness and improvements of place recognition on various datasets and evaluation metrics. Supplementary materials are available at https://github.com/sparolab/sonar_context.git. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 7 pages, 8 figures

arXiv:2305.11488 [pdf, other]

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

Authors: Runqi Wang, Xiaoyue Duan, Guoliang Kang, Jianzhuang Liu, Shaohui Lin, Songcen Xu, **hu Lv, Baochang Zhang

Abstract: Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be… ▽ More Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP. △ Less

Submitted 20 March, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.07945 [pdf, other]

Deep Learning-based Data-aided Activity Detection with Extraction Network in Grant-free Sparse Code Multiple Access Systems

Authors: Minsig Han, Ameha T. Abebe, Chung G. Kang

Abstract: This letter proposes a deep learning-based data-aided active user detection network (D-AUDN) for grant-free sparse code multiple access (SCMA) systems that leverages both SCMA codebook and Zadoff-Chu preamble for activity detection. Due to disparate data and preamble distribution as well as codebook collision, existing D-AUDNs experience performance degradation when multiple preambles are associat… ▽ More This letter proposes a deep learning-based data-aided active user detection network (D-AUDN) for grant-free sparse code multiple access (SCMA) systems that leverages both SCMA codebook and Zadoff-Chu preamble for activity detection. Due to disparate data and preamble distribution as well as codebook collision, existing D-AUDNs experience performance degradation when multiple preambles are associated with each codebook. To address this, a user activity extraction network (UAEN) is integrated within the D-AUDN to extract a-priori activity information from the codebook, improving activity detection of the associated preambles. Additionally, efficient SCMA codebook design and Zadoff-Chu preamble association are considered to further enhance performance. △ Less

Submitted 19 May, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

arXiv:2304.11609 [pdf]

PiClick: Picking the desired mask from multiple candidates in click-based interactive segmentation

Authors: Cilin Yan, Haochen Wang, Jie Liu, Xiaolong Jiang, Yao Hu, Xu Tang, Guoliang Kang, Efstratios Gavves

Abstract: Click-based interactive segmentation aims to generate target masks via human clicking, which facilitates efficient pixel-level annotation and image editing. In such a task, target ambiguity remains a problem hindering the accuracy and efficiency of segmentation. That is, in scenes with rich context, one click may correspond to multiple potential targets, while most previous interactive segmentors… ▽ More Click-based interactive segmentation aims to generate target masks via human clicking, which facilitates efficient pixel-level annotation and image editing. In such a task, target ambiguity remains a problem hindering the accuracy and efficiency of segmentation. That is, in scenes with rich context, one click may correspond to multiple potential targets, while most previous interactive segmentors only generate a single mask and fail to deal with target ambiguity. In this paper, we propose a novel interactive segmentation network named PiClick, to yield all potentially reasonable masks and suggest the most plausible one for the user. Specifically, PiClick utilizes a Transformer-based architecture to generate all potential target masks by mutually interactive mask queries. Moreover, a Target Reasoning module(TRM) is designed in PiClick to automatically suggest the user-desired mask from all candidates, relieving target ambiguity and extra-human efforts. Extensive experiments on 9 interactive segmentation datasets demonstrate PiClick performs favorably against previous state-of-the-arts considering the segmentation results. Moreover, we show that PiClick effectively reduces human efforts in annotating and picking the desired masks. To ease the usage and inspire future research, we release the source code of PiClick together with a plug-and-play annotation tool at https://github.com/cilinyan/PiClick. △ Less

Submitted 17 June, 2024; v1 submitted 23 April, 2023; originally announced April 2023.

arXiv:2303.05118 [pdf, other]

SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model

Authors: Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, Yunchao Wei

Abstract: The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its ge… ▽ More The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its generalizability remains an open question. In this work, we present an extensive analysis for continual learning on a pre-trained model (CLPM), and attribute the key challenge to a progressive overfitting problem. Observing that selectively reducing the learning rate can almost resolve this issue in the representation layer, we propose a simple but extremely effective approach named Slow Learner with Classifier Alignment (SLCA), which further improves the classification layer by modeling the class-wise distributions and aligning the classification layers in a post-hoc fashion. Across a variety of scenarios, our proposal provides substantial improvements for CLPM (e.g., up to 49.76%, 50.05%, 44.69% and 40.16% on Split CIFAR-100, Split ImageNet-R, Split CUB-200 and Split Cars-196, respectively), and thus outperforms state-of-the-art approaches by a large margin. Based on such a strong baseline, critical factors and promising directions are analyzed in-depth to facilitate subsequent research. Code has been made available at: https://github.com/GengDavid/SLCA. △ Less

Submitted 9 October, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

Comments: ICCV 2023, code released

arXiv:2302.12954 [pdf, other]

WPC: Whole-picture Workload Characterization

Authors: Lei Wang, Kaiyong Yang, Chenxi Wang, Wanling Gao, Chunjie Luo, Fan Zhang, Zhongxin Ge, Li Zhang, Guoxin Kang, Jianfeng Zhan

Abstract: This article raises an important and challenging workload characterization issue: can we uncover each critical component across the stacks contributing what percentages to any specific bottleneck? The typical critical components include languages, programming frameworks, runtime environments, instruction set architectures (ISA), operating systems (OS), and microarchitecture. Tackling this issue co… ▽ More This article raises an important and challenging workload characterization issue: can we uncover each critical component across the stacks contributing what percentages to any specific bottleneck? The typical critical components include languages, programming frameworks, runtime environments, instruction set architectures (ISA), operating systems (OS), and microarchitecture. Tackling this issue could help propose a systematic methodology to guide the software and hardware co-design and critical component optimizations. We propose a whole-picture workload characterization (WPC) methodology to answer the above issue. In essence, WPC is an iterative ORFE loop consisting of four steps: Observation, Reference, Fusion, and Exploration. WPC observes different level data (observation), fuses and normalizes the performance data (fusion) with respect to the well-designed standard reference workloads suite (reference), and explores the software and hardware co-design space (exploration) to investigate the impacts of critical components across the stacks. We build and open-source the WPC tool. Our evaluations confirm WPC can quantitatively reveal the contributions of the language, framework, runtime environment, ISA, OS, and microarchitecture to the primary pipeline efficiency. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.09927 [pdf, other]

NHtapDB: Native HTAP Databases

Authors: Guoxin Kang, Lei Wang, Simin Chen, Jianfeng Zhan

Abstract: Native database (1) provides a near-data machine learning framework to facilitate generating real-time business insight, and predefined change thresholds will trigger online training and deployment of new models, and (2) offers a mixed-format store to guarantee the performance of HTAP workloads, especially the hybrid workloads that consist of OLAP queries in-between online transactions. We make ri… ▽ More Native database (1) provides a near-data machine learning framework to facilitate generating real-time business insight, and predefined change thresholds will trigger online training and deployment of new models, and (2) offers a mixed-format store to guarantee the performance of HTAP workloads, especially the hybrid workloads that consist of OLAP queries in-between online transactions. We make rigorous test plans for native database with an enhanced state-of-the-art HTAP benchmark. △ Less

Submitted 20 February, 2023; originally announced February 2023.

arXiv:2212.00721 [pdf, other]

High fusion computers: The IoTs, edges, data centers, and humans-in-the-loop as a computer

Authors: Wanling Gao, Lei Wang, Mingyu Chen, ** Xiong, Chunjie Luo, Wenli Zhang, Yunyou Huang, Wei** Li, Guoxin Kang, Chen Zheng, Biwei Xie, Shaopeng Dai, Qian He, Hainan Ye, Yungang Bao, Jianfeng Zhan

Abstract: Emerging and future applications rely heavily upon systems consisting of Internet of Things (IoT), edges, data centers, and humans-in-the-loop. Significantly different from warehouse-scale computers that serve independent concurrent user requests, this new class of computer systems directly interacts with the physical world, considering humans an essential part and performing safety-critical and m… ▽ More Emerging and future applications rely heavily upon systems consisting of Internet of Things (IoT), edges, data centers, and humans-in-the-loop. Significantly different from warehouse-scale computers that serve independent concurrent user requests, this new class of computer systems directly interacts with the physical world, considering humans an essential part and performing safety-critical and mission-critical operations; their computations have intertwined dependencies between not only adjacent execution loops but also actions or decisions triggered by IoTs, edge, datacenters, or humans-in-the-loop; the systems must first satisfy the accuracy metric in predicting, interpreting, or taking action before meeting the performance goal under different cases. This article argues we need a paradigm shift to reconstruct the IoTs, edges, data centers, and humans-in-the-loop as a computer rather than a distributed system. We coin a new term, high fusion computers (HFCs), to describe this class of systems. The fusion in the term has two implications: fusing IoTs, edges, data centers, and humans-in-the-loop as a computer, fusing the physical and digital worlds through HFC systems. HFC is a pivotal case of the open-source computer systems initiative. We laid out the challenges, plan, and call for uniting our community's wisdom and actions to address the HFC challenges. Everything, including the source code, will be publicly available from the project homepage: https://www.computercouncil.org/HFC/. △ Less

Submitted 18 November, 2022; originally announced December 2022.

Comments: This paper has been published in BenchCouncil Transactions on Benchmarks, Standards and Evaluations (TBench). Link: https://www.sciencedirect.com/science/article/pii/S277248592200062X

Journal ref: BenchCouncil Transactions on Benchmarks, Standards and Evaluations (2022)

arXiv:2211.15180 [pdf, other]

Rethinking the Number of Shots in Robust Model-Agnostic Meta-Learning

Authors: Xiaoyue Duan, Guoliang Kang, Runqi Wang, Shumin Han, Song Xue, Tian Wang, Baochang Zhang

Abstract: Robust Model-Agnostic Meta-Learning (MAML) is usually adopted to train a meta-model which may fast adapt to novel classes with only a few exemplars and meanwhile remain robust to adversarial attacks. The conventional solution for robust MAML is to introduce robustness-promoting regularization during meta-training stage. With such a regularization, previous robust MAML methods simply follow the typ… ▽ More Robust Model-Agnostic Meta-Learning (MAML) is usually adopted to train a meta-model which may fast adapt to novel classes with only a few exemplars and meanwhile remain robust to adversarial attacks. The conventional solution for robust MAML is to introduce robustness-promoting regularization during meta-training stage. With such a regularization, previous robust MAML methods simply follow the typical MAML practice that the number of training shots should match with the number of test shots to achieve an optimal adaptation performance. However, although the robustness can be largely improved, previous methods sacrifice clean accuracy a lot. In this paper, we observe that introducing robustness-promoting regularization into MAML reduces the intrinsic dimension of clean sample features, which results in a lower capacity of clean representations. This may explain why the clean accuracy of previous robust MAML methods drops severely. Based on this observation, we propose a simple strategy, i.e., increasing the number of training shots, to mitigate the loss of intrinsic dimension caused by robustness-promoting regularization. Though simple, our method remarkably improves the clean accuracy of MAML without much loss of robustness, producing a robust yet accurate model. Extensive experiments demonstrate that our method outperforms prior arts in achieving a better trade-off between accuracy and robustness. Besides, we observe that our method is less sensitive to the number of fine-tuning steps during meta-training, which allows for a reduced number of fine-tuning steps to improve training efficiency. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2210.17302 [pdf, other]

Design, Field Evaluation, and Traffic Analysis of a Competitive Autonomous Driving Model in a Congested Environment

Authors: Daegyu Lee, Hyunki Seong, Seungil Han, Gyuree Kang, D. Hyunchul Shim, Yoon** Yoon

Abstract: Recently, numerous studies have investigated cooperative traffic systems using the communication among vehicle-to-everything (V2X). Unfortunately, when multiple autonomous vehicles are deployed while exposed to communication failure, there might be a conflict of ideal conditions between various autonomous vehicles leading to adversarial situation on the roads. In South Korea, virtual and real-worl… ▽ More Recently, numerous studies have investigated cooperative traffic systems using the communication among vehicle-to-everything (V2X). Unfortunately, when multiple autonomous vehicles are deployed while exposed to communication failure, there might be a conflict of ideal conditions between various autonomous vehicles leading to adversarial situation on the roads. In South Korea, virtual and real-world urban autonomous multi-vehicle races were held in March and November of 2021, respectively. During the competition, multiple vehicles were involved simultaneously, which required maneuvers such as overtaking low-speed vehicles, negotiating intersections, and obeying traffic laws. In this study, we introduce a fully autonomous driving software stack to deploy a competitive driving model, which enabled us to win the urban autonomous multi-vehicle races. We evaluate module-based systems such as navigation, perception, and planning in real and virtual environments. Additionally, an analysis of traffic is performed after collecting multiple vehicle position data over communication to gain additional insight into a multi-agent autonomous driving scenario. Finally, we propose a method for analyzing traffic in order to compare the spatial distribution of multiple autonomous vehicles. We study the similarity distribution between each team's driving log data to determine the impact of competitive autonomous driving on the traffic environment. △ Less

Submitted 6 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

arXiv:2208.10725 [pdf, other]

DRL-based Distributed Resource Allocation for Edge Computing in Cell-Free Massive MIMO Network

Authors: Fitsum Debebe Tilahun, Ameha Tsegaye Abebe, Chung G. Kang

Abstract: In this paper, with the aim of addressing the stringent computing and quality-of-service (QoS) requirements of recently introduced advanced multimedia services, we consider a cell-free massive MIMO-enabled mobile edge network. In particular, benefited from the reliable cell-free links to offload intensive computation to the edge server, resource-constrained end-users can augment on-board (local) p… ▽ More In this paper, with the aim of addressing the stringent computing and quality-of-service (QoS) requirements of recently introduced advanced multimedia services, we consider a cell-free massive MIMO-enabled mobile edge network. In particular, benefited from the reliable cell-free links to offload intensive computation to the edge server, resource-constrained end-users can augment on-board (local) processing with edge computing. To this end, we formulate a joint communication and computing resource allocation (JCCRA) problem to minimize the total energy consumption of the users, while meeting the respective user-specific deadlines. To tackle the problem, we propose a fully distributed solution approach based on cooperative multi-agent reinforcement learning framework, wherein each user is implemented as a learning agent to make joint resource allocation relying on local information only. The simulation results demonstrate that the performance of the proposed distributed approach outperforms the heuristic baselines, converging to a centralized target benchmark, without resorting to large overhead. Moreover, we showed that the proposed algorithm has performed significantly better in cell-free system as compared with the cellular MEC systems, e.g., a small cell-based MEC system. △ Less

Submitted 23 August, 2022; originally announced August 2022.

Comments: 6 pages, 4 figures, conference. arXiv admin note: substantial text overlap with arXiv:2201.09057

arXiv:2208.08128 [pdf, other]

On the Performance of Deep Learning-based Data-aided Active User Detection for GF-SCMA System

Authors: Minsig Han, Ameha Tsegaye Abebe, Chung G. Kang

Abstract: The recent works on a deep learning (DL)-based joint design of preamble set for the transmitters and data-aided active user detection (AUD) in the receiver has demonstrated a significant performance improvement for grant-free sparse code multiple access (GF-SCMA) system. The autoencoder for the joint design can be trained only in a given environment, but in an actual situation where the operating… ▽ More The recent works on a deep learning (DL)-based joint design of preamble set for the transmitters and data-aided active user detection (AUD) in the receiver has demonstrated a significant performance improvement for grant-free sparse code multiple access (GF-SCMA) system. The autoencoder for the joint design can be trained only in a given environment, but in an actual situation where the operating environment is constantly changing, it is difficult to optimize the preamble set for every possible environment. Therefore, a conventional, yet general approach may implement the data-aided AUD while relying on the preamble set that is designed independently rather than the joint design. In this paper, the activity detection error rate (ADER) performance of the data-aided AUD subject to the two preamble designs, i.e., independently designed preamble and jointly designed preamble, were directly compared. Fortunately, it was found that the performance loss in the data-aided AUD induced by the independent preamble design is limited to only 1dB. Furthermore, such performance characteristics of jointly designed preamble set is interpreted through average cross-correlation among the preambles associated with the same codebook (CB) (average intra-CB cross-correlation) and average cross-correlation among preambles associated with the different CBs (average inter-CB cross-correlation). △ Less

Submitted 5 September, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

arXiv:2205.12502 [pdf, other]

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Authors: Gi-Cheon Kang, Sungdong Kim, **-Hwa Kim, Donghyun Kwak, Byoung-Tak Zhang

Abstract: Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Traini… ▽ More Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the synthetic dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe the robustness of GST against both visual and textual adversarial attacks. Finally, GST yields strong performance gains in the low-data regime. Code is available at https://github.com/gicheonkang/gst-visdial. △ Less

Submitted 2 March, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: CVPR 2023

arXiv:2205.10780 [pdf, other]

Data-aided Active User Detection with a User Activity Extraction Network for Grant-free SCMA Systems

Authors: Minsig Han, Ameha T. Abebe, Chung G. Kang

Abstract: In grant-free sparse code multiple access (GF-SCMA) system, active user detection (AUD) is a major performance bottleneck as it involves complex combinatorial problem, which makes joint design of contention resources for users and AUD at the receiver a crucial but a challenging problem. To this end, we propose autoencoder (AE)-based joint optimization of both preamble generation networks (PGNs) in… ▽ More In grant-free sparse code multiple access (GF-SCMA) system, active user detection (AUD) is a major performance bottleneck as it involves complex combinatorial problem, which makes joint design of contention resources for users and AUD at the receiver a crucial but a challenging problem. To this end, we propose autoencoder (AE)-based joint optimization of both preamble generation networks (PGNs) in the encoder side and data-aided AUD in the decoder side. The core architecture of the proposed AE is a novel user activity extraction network (UAEN) in the decoder that extracts a priori user activity information from the SCMA codeword data for the data-aided AUD. An end-to-end training of the proposed AE enables joint optimization of the contention resources, i.e., preamble sequences, each associated with one of the codebooks, and extraction of user activity information from both preamble and SCMA-based data transmission. Furthermore, we propose a self-supervised pre-training scheme for the UAEN prior to the end-to-end training, to ensure the convergence of the UAEN which lies deep inside the AE network. Simulation results demonstrated that the proposed AUD scheme achieved 3 to 5dB gain at a target activity detection error rate of $\bf{{10}^{-3}}$ compared to the state-of-the-art DL-based AUD schemes. △ Less

Submitted 8 August, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

arXiv:2203.16095 [pdf, other]

doi 10.1109/ICDE53745.2022.00182

OLxPBench: Real-time, Semantically Consistent, and Domain-specific are Essential in Benchmarking, Designing, and Implementing HTAP Systems

Authors: Guoxin Kang, Lei Wang, Wanling Gao, Fei Tang, Jianfeng Zhan

Abstract: As real-time analysis of the new data become increasingly compelling, more organizations deploy Hybrid Transactional/Analytical Processing (HTAP) systems to support real-time queries on data recently generated by online transaction processing. This paper argues that real-time queries, semantically consistent schema, and domain-specific workloads are essential in benchmarking, designing, and implem… ▽ More As real-time analysis of the new data become increasingly compelling, more organizations deploy Hybrid Transactional/Analytical Processing (HTAP) systems to support real-time queries on data recently generated by online transaction processing. This paper argues that real-time queries, semantically consistent schema, and domain-specific workloads are essential in benchmarking, designing, and implementing HTAP systems. However, most state-of-the-art and state-of-the-practice benchmarks ignore those critical factors. Hence, they are incommensurable and, at worst, misleading in benchmarking, designing, and implementing HTAP systems. This paper presents OLxPBench, a composite HTAP benchmark suite. OLxPBench proposes: (1) the abstraction of a hybrid transaction, performing a real-time query in-between an online transaction, to model widely-observed behavior pattern -- making a quick decision while consulting real-time analysis; (2) a semantically consistent schema to express the relationships between OLTP and OLAP schema; (3) the combination of domain-specific and general benchmarks to characterize diverse application scenarios with varying resource demands. Our evaluations justify the three design decisions of OLxPBench and pinpoint the bottlenecks of two mainstream distributed HTAP DBMSs. International Open Benchmark Council (BenchCouncil) sets up the OLxPBench homepage at https://www.benchcouncil.org/olxpbench/. Its source code is available from https://github.com/BenchCouncil/olxpbench.git. △ Less

Submitted 5 April, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: Accepted to ICDE 2022. International Open Benchmark Council (BenchCouncil) sets up the OLxPBench homepage at https://www.benchcouncil.org/olxpbench/

arXiv:2201.09057 [pdf, other]

Multi-Agent Reinforcement Learning for Distributed Joint Communication and Computing Resource Allocation over Cell-Free Massive MIMO-enabled Mobile Edge Computing Network

Authors: Fitsum Debebe Tilahun, Ameha Tsegaye Abebe, Chung G. Kang

Abstract: To support the newly introduced multimedia services with ultra-low latency and extensive computation requirements, resource-constrained end user devices should utilize the ubiquitous computing resources available at network edge for augmenting on-board (local) processing with edge computing. In this regard, the capability of cell-free massive MIMO to provide reliable access links by guaranteeing u… ▽ More To support the newly introduced multimedia services with ultra-low latency and extensive computation requirements, resource-constrained end user devices should utilize the ubiquitous computing resources available at network edge for augmenting on-board (local) processing with edge computing. In this regard, the capability of cell-free massive MIMO to provide reliable access links by guaranteeing uniform quality of service without cell edge can be exploited for seamless parallel processing. Taking this into account, we consider a cell-free massive MIMO-enabled mobile edge network to meet the stringent requirements of the advanced services. For the considered mobile edge network, we formulate a joint communication and computing resource allocation (JCCRA) problem with the objective of minimizing energy consumption of the users while meeting the tight delay constraints. We then propose a fully distributed cooperative solution approach based on multiagent deep deterministic policy gradient (MADDPG) algorithm. The simulation results demonstrate that the performance of the proposed distributed approach has converged to that of a centralized deep deterministic policy gradient (DDPG)-based target benchmark, while alleviating the large overhead associated with the latter. Furthermore, it has been shown that our approach significantly outperforms heuristic baselines in terms of energy efficiency, roughly up to 5 times less total energy consumption. △ Less

Submitted 1 July, 2023; v1 submitted 3 December, 2021; originally announced January 2022.

arXiv:2201.06618 [pdf, other]

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Authors: Mengshu Sun, Haoyu Ma, Guoliang Kang, Yifan Jiang, Tianlong Chen, Xiaolong Ma, Zhangyang Wang, Yanzhi Wang

Abstract: The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on… ▽ More The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs. △ Less

Submitted 18 February, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

arXiv:2109.04004 [pdf, ps, other]

OpenClinicalAI: enabling AI to diagnose diseases in real-world clinical settings

Authors: Yunyou Huang, Nana Wang, Suqin Tang, Li Ma, Tianshu Hao, Zihan Jiang, Fan Zhang, Guoxin Kang, Xiuxia Miao, Xianglong Guan, Ruchang Zhang, Zhifei Zhang, Jianfeng Zhan

Abstract: This paper quantitatively reveals the state-of-the-art and state-of-the-practice AI systems only achieve acceptable performance on the stringent conditions that all categories of subjects are known, which we call closed clinical settings, but fail to work in real-world clinical settings. Compared to the diagnosis task in the closed setting, real-world clinical settings pose severe challenges, and… ▽ More This paper quantitatively reveals the state-of-the-art and state-of-the-practice AI systems only achieve acceptable performance on the stringent conditions that all categories of subjects are known, which we call closed clinical settings, but fail to work in real-world clinical settings. Compared to the diagnosis task in the closed setting, real-world clinical settings pose severe challenges, and we must treat them differently. We build a clinical AI benchmark named Clinical AIBench to set up real-world clinical settings to facilitate researches. We propose an open, dynamic machine learning framework and develop an AI system named OpenClinicalAI to diagnose diseases in real-world clinical settings. The first versions of Clinical AIBench and OpenClinicalAI target Alzheimer's disease. In the real-world clinical setting, OpenClinicalAI significantly outperforms the state-of-the-art AI system. In addition, OpenClinicalAI develops personalized diagnosis strategies to avoid unnecessary testing and seamlessly collaborates with clinicians. It is promising to be embedded in the current medical systems to improve medical services. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2106.10446 [pdf, other]

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Authors: Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, Byoung-Tak Zhang

Abstract: Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (M… ▽ More Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch. △ Less

Submitted 19 June, 2021; originally announced June 2021.

Comments: ACL 2021

arXiv:2106.02320 [pdf, other]

Few-Shot Segmentation via Cycle-Consistent Transformer

Authors: Gengwei Zhang, Guoliang Kang, Yi Yang, Yunchao Wei

Abstract: Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as conditional information. These methods cannot utilize all pixel-wise sup… ▽ More Few-shot segmentation aims to train a segmentation model that can fast adapt to novel classes with few exemplars. The conventional training paradigm is to learn to make predictions on query images conditioned on the features from support images. Previous methods only utilized the semantic-level prototypes of support images as conditional information. These methods cannot utilize all pixel-wise support information for the query predictions, which is however critical for the segmentation task. In this paper, we focus on utilizing pixel-wise relationships between support and query images to facilitate the few-shot segmentation task. We design a novel Cycle-Consistent TRansformer (CyCTR) module to aggregate pixel-wise support features into query ones. CyCTR performs cross-attention between features from different images, i.e. support and query images. We observe that there may exist unexpected irrelevant pixel-level support features. Directly performing cross-attention may aggregate these features from support to query and bias the query features. Thus, we propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features and encourage query features to attend to the most informative pixels from support images. Experiments on all few-shot segmentation benchmarks demonstrate that our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods. Specifically, on Pascal-$5^i$ and COCO-$20^i$ datasets, we achieve 67.5% and 45.6% mIoU for 5-shot segmentation, outperforming previous state-of-the-art methods by 5.6% and 7.1% respectively. △ Less

Submitted 7 March, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: Advances in Neural Information Processing Systems (NeurIPS), 2021. Project: https://github.com/GengDavid/CyCTR

arXiv:2104.00818 [pdf, other]

Deep Learning-based Codebook Design for Code-domain Non-Orthogonal Multiple Access Approaching Single-User Bit Error Rate Performance

Authors: Minsig Han, Hanchang Seo, Ameha Tsegaye Abebe, Chung G. Kang

Abstract: A general form of codebook design for code-domain non-orthogonal multiple access (CD-NOMA) can be considered equivalent to an autoencoder (AE)-based constellation design for multi-user multidimensional modulation (MU-MDM). Due to a constrained design space for optimal constellation, e.g., fixed resource map** and equal power allocation to all codebooks, however, existing AE architectures produce… ▽ More A general form of codebook design for code-domain non-orthogonal multiple access (CD-NOMA) can be considered equivalent to an autoencoder (AE)-based constellation design for multi-user multidimensional modulation (MU-MDM). Due to a constrained design space for optimal constellation, e.g., fixed resource map** and equal power allocation to all codebooks, however, existing AE architectures produce constellations with suboptimal bit-error-rate (BER) performance. Accordingly, we propose a new architecture for MU-MDM AE and underlying training methodology for joint optimization of resource map** and a constellation design with bit-to-symbol map**, aiming at approaching the BER performance of a single-user MDM (SU-MDM) AE model with the same spectral efficiency. The core design of the proposed AE architecture is dense resource map** combined with the novel power allocation layer that normalizes the sum of user codebook power across the entire resources. This globalizes the domain of the constellation design by enabling flexible resource map** and power allocation. Furthermore, it allows the AE-based training to approach a global optimal MU-MDM constellations for CD-NOMA. Extensive BER simulation results demonstrate that the proposed design outperforms the existing CD-NOMA designs while approaching the single-user BER performance achieved by the equivalent SU-MDM AE within 0.3 dB over the additive white Gaussian noise channel. △ Less

Submitted 10 October, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

arXiv:2103.11167 [pdf, other]

Multi-sequence Spreading Random Access (MSRA) for Compressive Sensing-based Grant-free Communication

Authors: Ameha Tsegaye Abebe, Chung G. Kang

Abstract: The performance of grant-free random access (GF-RA) is limited by the number of accessible random access resources (RRs) due to the absence of collision resolution. Compressive sensing (CS)-based RA schemes scale up the RRs at the expense of increased non-orthogonality among transmitted signals. This paper presents the design of multi-sequence spreading random access (MSRA) which employs multiple… ▽ More The performance of grant-free random access (GF-RA) is limited by the number of accessible random access resources (RRs) due to the absence of collision resolution. Compressive sensing (CS)-based RA schemes scale up the RRs at the expense of increased non-orthogonality among transmitted signals. This paper presents the design of multi-sequence spreading random access (MSRA) which employs multiple spreading sequences to spread the different symbols of a user as opposed to the conventional schemes in which a user employs the same spreading sequence for each symbol. We show that MSRA provides code diversity, enabling the multi-user detection (MUD) to be modeled into a well-conditioned multiple measurement vector (MMV) CS problem. The code diversity is quantified by the decrease in the average Babel mutual coherence among the spreading sequences. Moreover, we present a two-stage active user detection (AUD) scheme for both wideband and narrowband implementation. Our theoretical analysis shows that with MSRA activity misdetection falls exponentially while the size of GF-RA frame is increased. Finally, the simulation results show that about 82% increase in utilization of RRs, i.e., more active users, is supported by MSRA than the conventional schemes while achieving the RA failure rate lower bound set by random access collision. △ Less

Submitted 20 March, 2021; originally announced March 2021.

arXiv:2011.00147 [pdf, other]

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Authors: Guoliang Kang, Yunchao Wei, Yi Yang, Yueting Zhuang, Alexander G. Hauptmann

Abstract: Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. T… ▽ More Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. They tend to consider the domain discrepancy globally, which ignore the pixel-wise relationships and are less discriminative. In this paper, we propose to build the pixel-level cycle association between source and target pixel pairs and contrastively strengthen their connections to diminish the domain gap and make the features more discriminative. To the best of our knowledge, this is a new perspective for tackling such a challenging task. Experiment results on two representative domain adaptation benchmarks, i.e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes, verify the effectiveness of our proposed method and demonstrate that our method performs favorably against previous state-of-the-arts. Our method can be trained end-to-end in one stage and introduces no additional parameters, which is expected to serve as a general framework and help ease future research in domain adaptive semantic segmentation. Code is available at https://github.com/kgl-prml/Pixel- Level-Cycle-Association. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: Accepted by NeurIPS 2020 (oral). Code: https://github.com/kgl-prml/Pixel- Level-Cycle-Association

arXiv:2008.06208 [pdf]

Adaptable Multi-Domain Language Model for Transformer ASR

Authors: Taewoo Lee, Min-Joong Lee, Tae Gyoon Kang, Seokyeoung Jung, Minseok Kwon, Yeona Hong, Jungin Lee, Kyoung-Gu Woo, Ho-Gyeong Kim, Jiseung Jeong, Jihyun Lee, Hosik Lee, Young Sang Choi

Abstract: We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM c… ▽ More We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM can be expanded to new domains by adding about 2% of parameters for a first domain and 13% parameters for after second domain. The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process. Using proposed adapter based approach, we observed that a general LM with adapter can outperform a dedicated music domain LM in terms of word error rate (WER). △ Less

Submitted 10 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

Comments: This paper is accepted for presentation at IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE ICASSP), 2021

arXiv:2005.02137 [pdf, other]

doi 10.1109/ICASSP40776.2020.9054655

Label Propagation Adaptive Resonance Theory for Semi-supervised Continuous Learning

Authors: Taehyeong Kim, Injune Hwang, Gi-Cheon Kang, Won-Seok Choi, Hyunseo Kim, Byoung-Tak Zhang

Abstract: Semi-supervised learning and continuous learning are fundamental paradigms for human-level intelligence. To deal with real-world problems where labels are rarely given and the opportunity to access the same data is limited, it is necessary to apply these two paradigms in a joined fashion. In this paper, we propose Label Propagation Adaptive Resonance Theory (LPART) for semi-supervised continuous l… ▽ More Semi-supervised learning and continuous learning are fundamental paradigms for human-level intelligence. To deal with real-world problems where labels are rarely given and the opportunity to access the same data is limited, it is necessary to apply these two paradigms in a joined fashion. In this paper, we propose Label Propagation Adaptive Resonance Theory (LPART) for semi-supervised continuous learning. LPART uses an online label propagation mechanism to perform classification and gradually improves its accuracy as the observed data accumulates. We evaluated the proposed model on visual (MNIST, SVHN, CIFAR-10) and audio (NSynth) datasets by adjusting the ratio of the labeled and unlabeled data. The accuracies are much higher when both labeled and unlabeled data are used, demonstrating the significant advantage of LPART in environments where the data labels are scarce. △ Less

Submitted 16 April, 2020; originally announced May 2020.

Comments: 5 pages, 2 figures, 1 table, accepted in ICASSP 2020

arXiv:2004.14326 [pdf, other]

doi 10.21437/Interspeech.2020-1113

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

Authors: Soo-Whan Chung, Hong Goo Kang, Joon Son Chung

Abstract: The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training st… ▽ More The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin. △ Less

Submitted 6 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: Under submission as a conference paper

arXiv:2004.06698 [pdf, other]

Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer

Authors: Gi-Cheon Kang, Junseok Park, Hwaran Lee, Byoung-Tak Zhang, **-Hwa Kim

Abstract: Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse G… ▽ More Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain multiple reasonable answers. As a result, our proposed model significantly improves reasoning capability compared to baseline methods and outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. The source code is available at https://github.com/gicheonkang/SGLKT-VisDial. △ Less

Submitted 30 August, 2021; v1 submitted 14 April, 2020; originally announced April 2020.

Comments: EMNLP 2021 Findings

arXiv:2002.00137 [pdf, other]

Training-free Monocular 3D Event Detection System for Traffic Surveillance

Authors: Lijun Yu, Peng Chen, Wenhe Liu, Guoliang Kang, Alexander G. Hauptmann

Abstract: We focus on the problem of detecting traffic events in a surveillance scenario, including the detection of both vehicle actions and traffic collisions. Existing event detection systems are mostly learning-based and have achieved convincing performance when a large amount of training data is available. However, in real-world scenarios, collecting sufficient labeled training data is expensive and so… ▽ More We focus on the problem of detecting traffic events in a surveillance scenario, including the detection of both vehicle actions and traffic collisions. Existing event detection systems are mostly learning-based and have achieved convincing performance when a large amount of training data is available. However, in real-world scenarios, collecting sufficient labeled training data is expensive and sometimes impossible (e.g. for traffic collision detection). Moreover, the conventional 2D representation of surveillance views is easily affected by occlusions and different camera views in nature. To deal with the aforementioned problems, in this paper, we propose a training-free monocular 3D event detection system for traffic surveillance. Our system firstly projects the vehicles into the 3D Euclidean space and estimates their kinematic states. Then we develop multiple simple yet effective ways to identify the events based on the kinematic patterns, which need no further training. Consequently, our system is robust to the occlusions and the viewpoint changes. Exclusive experiments report the superior result of our method on large-scale real-world surveillance datasets, which validates the effectiveness of our proposed system. △ Less

Submitted 31 January, 2020; originally announced February 2020.

Comments: To be published in 2019 IEEE International Conference on Big Data (Big Data), IEEE

arXiv:1909.08929 [pdf, other]

doi 10.13154/294-6675

Automobile Theft Detection by Clustering Owner Driver Data

Authors: Yong Goo Kang, Kyung Ho Park, Huy Kang Kim

Abstract: As automobiles become intelligent, automobile theft methods are evolving intelligently. Therefore automobile theft detection has become a major research challenge. Data-mining, biometrics, and additional authentication methods have been proposed to address automobile theft, in previous studies. Among these methods, data-mining can be used to analyze driving characteristics and identify a driver co… ▽ More As automobiles become intelligent, automobile theft methods are evolving intelligently. Therefore automobile theft detection has become a major research challenge. Data-mining, biometrics, and additional authentication methods have been proposed to address automobile theft, in previous studies. Among these methods, data-mining can be used to analyze driving characteristics and identify a driver comprehensively. However, it requires a labeled driving dataset to achieve high accuracy. It is impractical to use the actual automobile theft detection system because real theft driving data cannot be collected in advance. Hence, we propose a method to detect an automobile theft attempt using only owner driving data. We cluster the key features of the owner driving data using the k-means algorithm. After reconstructing the driving data into one of these clusters, theft is detected using an error from the original driving data. To validate the proposed models, we tested our actual driving data and obtained 99% accuracy from the best model. This result demonstrates that our proposed method can detect vehicle theft by using only the car owner's driving data. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: 15 pages, 7 figures, 3 tables, In Proceedings of the 17th escar Europe 2019

arXiv:1908.01925 [pdf, other]

Attract or Distract: Exploit the Margin of Open Set

Authors: Qianyu Feng, Guoliang Kang, Hehe Fan, Yi Yang

Abstract: Open set domain adaptation aims to diminish the domain shift across domains, with partially shared classes. There exist unknown target samples out of the knowledge of source domain. Compared to the close set setting, how to separate the unknown (unshared) class from the known (shared) ones plays a key role. Whereas, previous methods did not emphasize the semantic structure of the open set data, wh… ▽ More Open set domain adaptation aims to diminish the domain shift across domains, with partially shared classes. There exist unknown target samples out of the knowledge of source domain. Compared to the close set setting, how to separate the unknown (unshared) class from the known (shared) ones plays a key role. Whereas, previous methods did not emphasize the semantic structure of the open set data, which may introduce bias into the domain alignment and confuse the classifier around the decision boundary. In this paper, we exploit the semantic structure of open set data from two aspects: 1) Semantic Categorical Alignment, which aims to achieve good separability of target known classes by categorically aligning the centroid of target with the source. 2)Semantic Contrastive Map**, which aims to push the unknown class away from the decision boundary. Empirically, we demonstrate that our method performs favourably against the state-of-the-art methods on representative benchmarks, e.g. Digit datasets and Office-31 datasets. △ Less

Submitted 10 August, 2019; v1 submitted 5 August, 2019; originally announced August 2019.

Comments: Presented at ICCV 2019

arXiv:1908.00248 [pdf]

Achievable Degrees of Freedom for Closed-form Solution to Interference Alignment and Cancellation in Gaussian Interference Multiple Access Channel

Authors: Qu Xin, Chung G. Kang

Abstract: A combined technique of interference alignment (IA) and interference cancellation (IC), known as interference alignment and cancellation (IAC) scheme, has been proposed to improve the total achievable degrees of freedom (DoFs) over IA. Since it is NP-hard to solve the transceiver under a given tuple of DoFs or to maximize the total achievable DoFs in the general system configuration by IA (or IAC)… ▽ More A combined technique of interference alignment (IA) and interference cancellation (IC), known as interference alignment and cancellation (IAC) scheme, has been proposed to improve the total achievable degrees of freedom (DoFs) over IA. Since it is NP-hard to solve the transceiver under a given tuple of DoFs or to maximize the total achievable DoFs in the general system configuration by IA (or IAC), the optimal transceiver cannot be obtained in polynomial time. Meanwhile, it has been known that a closed-form yet suboptimal transceiver can be designed for IAC by employing a symbol-to-symbol (STS) alignment structure. As its performance has not been known yet, we aim to derive the total DoFs that can be achieved by such suboptimal but closed-form IAC transceivers for Gaussian interference multiple access channels with K receivers and J users (transmitters), each with M antennas. Our analysis shows that the closed-form IAC transceivers under consideration can achieve a maximum total achievable DoFs of 2M, which turns out to be larger than those achieved in classical IA, e.g., 2MK/(K+1) DoFs by a specific configuration where each link has the same target DoFs. Moreover, considering the NP-hardness of deriving the maximum total achievable DoFs with the optimal IAC transceiver, its upper bound has been derived for comparison with the results of our closed-form IAC transceiver. Numerical results illustrate that its performance can be guaranteed within 20% of the upper bound when the number of multiple access channels are relatively small, e.g., K <=4. △ Less

Submitted 1 August, 2019; originally announced August 2019.

Comments: 10 pages

Showing 1–50 of 66 results for author: Kang, G