Search | arXiv e-print repository

Papez: Resource-Efficient Speech Separation with Auditory Working Memory

Authors: Hyunseok Oh, Juheon Yi, Youngki Lee

Abstract: Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk… ▽ More Transformer-based models recently reached state-of-the-art single-channel speech separation accuracy; However, their extreme computational load makes it difficult to deploy them in resource-constrained mobile or IoT devices. We thus present Papez, a lightweight and computation-efficient single-channel speech separation model. Papez is based on three key techniques. We first replace the inter-chunk Transformer with small-sized auditory working memory. Second, we adaptively prune the input tokens that do not need further processing. Finally, we reduce the number of parameters through the recurrent transformer. Our extensive evaluation shows that Papez achieves the best resource and accuracy tradeoffs with a large margin. We publicly share our source code at \texttt{https://github.com/snuhcs/Papez} △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 5 pages. Accepted by ICASSP 2023

arXiv:2406.16200 [pdf, other]

Towards unlocking the mystery of adversarial fragility of neural networks

Authors: **gchao Gao, Raghu Mudumbai, Xiaodong Wu, Jirong Yi, Catherine Xu, Hui Xie, Weiyu Xu

Abstract: In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural ne… ▽ More In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness. Our matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks. △ Less

Submitted 23 June, 2024; originally announced June 2024.

Comments: 21 pages

arXiv:2406.12260 [pdf, other]

Self-Supervised Time-Series Anomaly Detection Using Learnable Data Augmentation

Authors: Kuk** Choi, Jihun Yi, Jisoo Mok, Sungroh Yoon

Abstract: Continuous efforts are being made to advance anomaly detection in various manufacturing processes to increase the productivity and safety of industrial sites. Deep learning replaced rule-based methods and recently emerged as a promising method for anomaly detection in diverse industries. However, in the real world, the scarcity of abnormal data and difficulties in obtaining labeled data create lim… ▽ More Continuous efforts are being made to advance anomaly detection in various manufacturing processes to increase the productivity and safety of industrial sites. Deep learning replaced rule-based methods and recently emerged as a promising method for anomaly detection in diverse industries. However, in the real world, the scarcity of abnormal data and difficulties in obtaining labeled data create limitations in the training of detection models. In this study, we addressed these shortcomings by proposing a learnable data augmentation-based time-series anomaly detection (LATAD) technique that is trained in a self-supervised manner. LATAD extracts discriminative features from time-series data through contrastive learning. At the same time, learnable data augmentation produces challenging negative samples to enhance learning efficiency. We measured anomaly scores of the proposed technique based on latent feature similarities. As per the results, LATAD exhibited comparable or improved performance to the state-of-the-art anomaly detection assessments on several benchmark datasets and provided a gradient-based diagnosis technique to help identify root causes. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 11 pages, 4 figures, IEEE Transactions on Emerging Topics in Computational Intelligence

arXiv:2406.09664 [pdf, other]

Frequency-mix Knowledge Distillation for Fake Speech Detection

Authors: Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

Abstract: In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA… ▽ More In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31\% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.09133 [pdf]

RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL

Authors: Jiawen Yi, Guo Chen, Zixiang Shen

Abstract: Text-to-SQL is a technology that converts natural language queries into the structured query language SQL. A novel research approach that has recently gained attention focuses on methods based on the complexity of SQL queries, achieving notable performance improvements. However, existing methods entail significant storage and training costs, which hampers their practical application. To address th… ▽ More Text-to-SQL is a technology that converts natural language queries into the structured query language SQL. A novel research approach that has recently gained attention focuses on methods based on the complexity of SQL queries, achieving notable performance improvements. However, existing methods entail significant storage and training costs, which hampers their practical application. To address this issue, this paper introduces a method for Text-to-SQL based on Refined Schema and Hardness Prompt. By filtering out low-relevance schema information with a refined schema and identifying query hardness through a Language Model (LM) to form prompts, this method reduces storage and training costs while maintaining performance. It's worth mentioning that this method is applicable to any sequence-to-sequence (seq2seq) LM. Our experiments on the Spider dataset, specifically with large-scale LMs, achieved an exceptional Execution accuracy (EX) of 82.6%, demonstrating the effectiveness and greater suitability of our method for real-world applications. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 4 pages, 2 figures, 2024 6th International Conference on Electronic Engineering and Informatics (EEI 2024)

arXiv:2406.08863 [pdf, other]

Self-supervised Graph Neural Network for Mechanical CAD Retrieval

Authors: Yuhan Quan, Huan Zhao, **feng Yi, Yuqiang Chen

Abstract: CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making tradit… ▽ More CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making traditional retrieval methods ineffective. While existing representation learning approaches have been developed for CAD, manually labeling similar samples in these methods is expensive. Additionally, CAD models' unique parameterized data structure presents challenges for applying existing 3D shape representation learning techniques directly. In this work, we propose GC-CAD, a self-supervised contrastive graph neural network-based method for mechanical CAD retrieval that directly models parameterized CAD raw files. GC-CAD consists of two key modules: structure-aware representation learning and contrastive graph learning framework. The method leverages graph neural networks to extract both geometric and topological information from CAD models, generating feature representations. We then introduce a simple yet effective contrastive graph learning framework approach, enabling the model to train without manual labels and generate retrieval-ready representations. Experimental results on four datasets including human evaluation demonstrate that the proposed method achieves significant accuracy improvements and up to 100 times efficiency improvement over the baseline methods. △ Less

Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.06999 [pdf, other]

Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

Authors: Junfei Yi, Jianxu Mao, Tengfei Liu, Mingjie Li, Hanyu Gu, Hui Zhang, Xiaojun Chang, Yaonan Wang

Abstract: Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowle… ▽ More Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowledge, as it may overly rely on the teacher's imperfect guidance. In this paper, we propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection, termed "Uncertainty Estimation-Discriminative Knowledge Extraction-Knowledge Transfer (UET)", which can seamlessly integrate with existing distillation methods. By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model, facilitating deeper exploration of latent knowledge. Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources. Extensive experiments validate the effectiveness of our proposed approach across various distillation strategies, detectors, and backbone architectures. Specifically, following our proposed paradigm, the existing FGD method achieves state-of-the-art (SoTA) performance, with ResNet50-based GFL achieving 44.1% mAP on the COCO dataset, surpassing the baselines by 3.9%. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06086 [pdf, other]

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

Abstract: Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepf… ▽ More Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.04840 [pdf, other]

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these problems, we propose TraceableSpeech, a novel TTS model that directly generates watermarked speech, improving watermark imperceptibility and speech quality. Furthermore, We design the frame-wise imprinting and extraction of watermarks, achieving higher robustness against resplicing attacks and temporal flexibility in operation. Experimental results show that TraceableSpeech outperforms the strong baseline where VALL-E or HiFicodec individually uses WavMark in watermark imperceptibility, speech quality and resilience against resplicing attacks. It also can apply to speech of various durations. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: acceped by interspeech 2024

arXiv:2406.03411 [pdf, other]

Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Authors: Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, Sungroh Yoon

Abstract: In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling… ▽ More In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: To appear in ACL 2024 Main

arXiv:2405.08596

EVDA: Evolving Deepfake Audio Detection Continual Learning Benchmark

Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

Abstract: The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts… ▽ More The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods △ Less

Submitted 15 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

Comments: This paper need more modification

arXiv:2405.03917 [pdf, other]

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Authors: Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

Abstract: Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression,… ▽ More Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2404.17113 [pdf, other]

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Authors: Zheng Lian, Haiyang Sun, Licai Sun, Zhuofan Wen, Siyuan Zhang, Shun Chen, Hao Gu, **ming Zhao, Ziyang Ma, Xie Chen, Jiangyan Yi, Rui Liu, Kele Xu, Bin Liu, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao

Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefor… ▽ More Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing dataset size and building more effective architectures. However, due to various reasons (such as complex environments and inaccurate annotations), current systems are hard to meet the demands of practical applications. Therefore, we organize a series of challenges around emotion recognition to further promote the development of this area. Last year, we launched MER2023, focusing on three topics: multi-label learning, noise robustness, and semi-supervised learning. This year, we continue to organize MER2024. In addition to expanding the dataset size, we introduce a new track around open-vocabulary emotion recognition. The main consideration for this track is that existing datasets often fix the label space and use majority voting to enhance annotator consistency, but this process may limit the model's ability to describe subtle emotions. In this track, we encourage participants to generate any number of labels in any category, aiming to describe the emotional state as accurately as possible. Our baseline is based on MERTools and the code is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024. △ Less

Submitted 23 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.16346 [pdf, other]

Light-weight Retinal Layer Segmentation with Global Reasoning

Authors: Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, Yi Wan

Abstract: Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications… ▽ More Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications. Therefore, it is desired to design a light-weight network with high performance for retinal layer segmentation. In this paper, we propose LightReSeg for retinal layer segmentation which can be applied to OCT images. Specifically, our approach follows an encoder-decoder structure, where the encoder part employs multi-scale feature extraction and a Transformer block for fully exploiting the semantic information of feature maps at all scales and making the features have better global reasoning capabilities, while the decoder part, we design a multi-scale asymmetric attention (MAA) module for preserving the semantic information at each encoder scale. The experiments show that our approach achieves a better segmentation performance compared to the current state-of-the-art method TransUnet with 105.7M parameters on both our collected dataset and two other public datasets, with only 3.3M parameters. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: IEEE Transactions on Instrumentation & Measurement

arXiv:2404.14687 [pdf, other]

Pegasus-v1 Technical Report

Authors: Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, **-Young Kim, Junwan Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Abraham Jo, Ed Park, Hassan Kianinejad, SJ Kim, Tony Moon, Wade Jeong, Andrei Popescu, Esther Kim, EK Yoon , et al. (19 additional authors not shown)

Abstract: This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's archi… ▽ More This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.12455 [pdf, ps, other]

Contingency Model Predictive Control for Bipedal Locomotion on Moving Surfaces with a Linear Inverted Pendulum Model

Authors: Kuo Chen, Xinyan Huang, Xunjie Chen, **gang Yi

Abstract: Gait control of legged robotic walkers on dynamically moving surfaces (e.g., ships and vehicles) is challenging due to the limited balance control actuation and unknown surface motion. We present a contingent model predictive control (CMPC) for bipedal walker locomotion on moving surfaces with a linear inverted pendulum (LIP) model. The CMPC is a robust design that is built on regular model predic… ▽ More Gait control of legged robotic walkers on dynamically moving surfaces (e.g., ships and vehicles) is challenging due to the limited balance control actuation and unknown surface motion. We present a contingent model predictive control (CMPC) for bipedal walker locomotion on moving surfaces with a linear inverted pendulum (LIP) model. The CMPC is a robust design that is built on regular model predictive control (MPC) to incorporate the "worst case" predictive motion of the moving surface. Integrated with an LIP model and walking stability constraints, the CMPC framework generates a set of consistent control inputs considering to anticipated uncertainties of the surface motions. Simulation results and comparison with the regular MPC for bipedal walking are conducted and presented. The results confirm the feasibility and superior performance of the proposed CMPC design over the regular MPC under various motion profiles of moving surfaces. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 2024 American Control Conference (ACC 2024)

arXiv:2404.05950 [pdf, other]

Efficient Multi-Task Reinforcement Learning via Task-Specific Action Correction

Authors: **yuan Feng, Min Chen, Zhiqiang Pu, Tenghai Qiu, Jianqiang Yi

Abstract: Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed f… ▽ More Multi-task reinforcement learning (MTRL) demonstrate potential for enhancing the generalization of a robot, enabling it to perform multiple tasks concurrently. However, the performance of MTRL may still be susceptible to conflicts between tasks and negative interference. To facilitate efficient MTRL, we propose Task-Specific Action Correction (TSAC), a general and complementary approach designed for simultaneous learning of multiple tasks. TSAC decomposes policy learning into two separate policies: a shared policy (SP) and an action correction policy (ACP). To alleviate conflicts resulting from excessive focus on specific tasks' details in SP, ACP incorporates goal-oriented sparse rewards, enabling an agent to adopt a long-term perspective and achieve generalization across tasks. Additional rewards transform the original problem into a multi-objective MTRL problem. Furthermore, to convert the multi-objective MTRL into a single-objective formulation, TSAC assigns a virtual expected budget to the sparse rewards and employs Lagrangian method to transform a constrained single-objective optimization into an unconstrained one. Experimental evaluations conducted on Meta-World's MT10 and MT50 benchmarks demonstrate that TSAC outperforms existing state-of-the-art methods, achieving significant improvements in both sample efficiency and effective action execution. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.00667 [pdf, other]

Weakly-Supervised Cross-Domain Segmentation of Electron Microscopy with Sparse Point Annotation

Authors: Dafei Qiu, Shan Xiong, Jia** Yi, Jialin Peng

Abstract: Accurate segmentation of organelle instances from electron microscopy (EM) images plays an essential role in many neuroscience researches. However, practical scenarios usually suffer from high annotation costs, label scarcity, and large domain diversity. While unsupervised domain adaptation (UDA) that assumes no annotation effort on the target data is promising to alleviate these challenges, its p… ▽ More Accurate segmentation of organelle instances from electron microscopy (EM) images plays an essential role in many neuroscience researches. However, practical scenarios usually suffer from high annotation costs, label scarcity, and large domain diversity. While unsupervised domain adaptation (UDA) that assumes no annotation effort on the target data is promising to alleviate these challenges, its performance on complicated segmentation tasks is still far from practical usage. To address these issues, we investigate a highly annotation-efficient weak supervision, which assumes only sparse center-points on a small subset of object instances in the target training images. To achieve accurate segmentation with partial point annotations, we introduce instance counting and center detection as auxiliary tasks and design a multitask learning framework to leverage correlations among the counting, detection, and segmentation, which are all tasks with partial or no supervision. Building upon the different domain-invariances of the three tasks, we enforce counting estimation with a novel soft consistency loss as a global prior for center detection, which further guides the per-pixel segmentation. To further compensate for annotation sparsity, we develop a cross-position cut-and-paste for label augmentation and an entropy-based pseudo-label selection. The experimental results highlight that, by simply using extremely weak annotation, e.g., 15\% sparse points, for model training, the proposed model is capable of significantly outperforming UDA methods and produces comparable performance as the supervised counterpart. The high robustness of our model shown in the validations and the low requirement of expert knowledge for sparse point annotation further improve the potential application value of our model. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.18057 [pdf, other]

Prioritized League Reinforcement Learning for Large-Scale Heterogeneous Multiagent Systems

Authors: Qingxu Fu, Zhiqiang Pu, Min Chen, Tenghai Qiu, Jianqiang Yi

Abstract: Large-scale heterogeneous multiagent systems feature various realistic factors in the real world, such as agents with diverse abilities and overall system cost. In comparison to homogeneous systems, heterogeneous systems offer significant practical advantages. Nonetheless, they also present challenges for multiagent reinforcement learning, including addressing the non-stationary problem and managi… ▽ More Large-scale heterogeneous multiagent systems feature various realistic factors in the real world, such as agents with diverse abilities and overall system cost. In comparison to homogeneous systems, heterogeneous systems offer significant practical advantages. Nonetheless, they also present challenges for multiagent reinforcement learning, including addressing the non-stationary problem and managing an imbalanced number of agents with different types. We propose a Prioritized Heterogeneous League Reinforcement Learning (PHLRL) method to address large-scale heterogeneous cooperation problems. PHLRL maintains a record of various policies that agents have explored during their training and establishes a heterogeneous league consisting of diverse policies to aid in future policy optimization. Furthermore, we design a prioritized policy gradient approach to compensate for the gap caused by differences in the number of different types of agents. Next, we use Unreal Engine to design a large-scale heterogeneous cooperation benchmark named Large-Scale Multiagent Operation (LSMO), which is a complex two-team competition scenario that requires collaboration from both ground and airborne agents. We use experiments to show that PHLRL outperforms state-of-the-art methods, including QTRAN and QPLEX in LSMO. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.18056 [pdf, other]

Self-Clustering Hierarchical Multi-Agent Reinforcement Learning with Extensible Cooperation Graph

Authors: Qingxu Fu, Tenghai Qiu, Jianqiang Yi, Zhiqiang Pu, Xiaolin Ai

Abstract: Multi-Agent Reinforcement Learning (MARL) has been successful in solving many cooperative challenges. However, classic non-hierarchical MARL algorithms still cannot address various complex multi-agent problems that require hierarchical cooperative behaviors. The cooperative knowledge and policies learned in non-hierarchical algorithms are implicit and not interpretable, thereby restricting the int… ▽ More Multi-Agent Reinforcement Learning (MARL) has been successful in solving many cooperative challenges. However, classic non-hierarchical MARL algorithms still cannot address various complex multi-agent problems that require hierarchical cooperative behaviors. The cooperative knowledge and policies learned in non-hierarchical algorithms are implicit and not interpretable, thereby restricting the integration of existing knowledge. This paper proposes a novel hierarchical MARL model called Hierarchical Cooperation Graph Learning (HCGL) for solving general multi-agent problems. HCGL has three components: a dynamic Extensible Cooperation Graph (ECG) for achieving self-clustering cooperation; a group of graph operators for adjusting the topology of ECG; and an MARL optimizer for training these graph operators. HCGL's key distinction from other MARL models is that the behaviors of agents are guided by the topology of ECG instead of policy neural networks. ECG is a three-layer graph consisting of an agent node layer, a cluster node layer, and a target node layer. To manipulate the ECG topology in response to changing environmental conditions, four graph operators are trained to adjust the edge connections of ECG dynamically. The hierarchical feature of ECG provides a unique approach to merge primitive actions (actions executed by the agents) and cooperative actions (actions executed by the clusters) into a unified action space, allowing us to integrate fundamental cooperative knowledge into an extensible interface. In our experiments, the HCGL model has shown outstanding performance in multi-agent benchmarks with sparse rewards. We also verify that HCGL can easily be transferred to large-scale scenarios with high zero-shot transfer success rates. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.17863 [pdf, other]

An AI-Native Runtime for Multi-Wearable Environments

Authors: Chulhong Min, Utku Günay Acer, SiYoung Jang, Sangwon Choi, Diana A. Vasile, Taesik Gong, Juheon Yi, Fahim Kawsar

Abstract: The miniaturization of AI accelerators is paving the way for next-generation wearable applications within wearable technologies. We introduce Mojito, an AI-native runtime with advanced MLOps designed to facilitate the development and deployment of these applications on wearable devices. It emphasizes the necessity of dynamic orchestration of distributed resources equipped with ultra-low-power AI a… ▽ More The miniaturization of AI accelerators is paving the way for next-generation wearable applications within wearable technologies. We introduce Mojito, an AI-native runtime with advanced MLOps designed to facilitate the development and deployment of these applications on wearable devices. It emphasizes the necessity of dynamic orchestration of distributed resources equipped with ultra-low-power AI accelerators to overcome challenges associated with unpredictable runtime environments. Through its innovative approaches, Mojito demonstrates how future wearable technologies can evolve to be more autonomous. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 7 pages, 4 figures

arXiv:2403.07598 [pdf, other]

Mondrian: On-Device High-Performance Video Analytics with Compressive Packed Inference

Authors: Changmin Jeon, Seonjun Kim, Juheon Yi, Youngki Lee

Abstract: In this paper, we present Mondrian, an edge system that enables high-performance object detection on high-resolution video streams. Many lightweight models and system optimization techniques have been proposed for resource-constrained devices, but they do not fully utilize the potential of the accelerators over dynamic, high-resolution videos. To enable such capability, we devise a novel Compressi… ▽ More In this paper, we present Mondrian, an edge system that enables high-performance object detection on high-resolution video streams. Many lightweight models and system optimization techniques have been proposed for resource-constrained devices, but they do not fully utilize the potential of the accelerators over dynamic, high-resolution videos. To enable such capability, we devise a novel Compressive Packed Inference to minimize per-pixel processing costs by selectively determining the necessary pixels to process and combining them to maximize processing parallelism. In particular, our system quickly extracts ROIs and dynamically shrinks them, reflecting the effect of the fast-changing characteristics of objects and scenes. It then intelligently combines such scaled ROIs into large canvases to maximize the utilization of inference accelerators such as GPU. Evaluation across various datasets, models, and devices shows Mondrian outperforms state-of-the-art baselines (e.g., input rescaling, ROI extractions, ROI extractions+batching) by 15.0-19.7% higher accuracy, leading to $\times$6.65 higher throughput than frame-wise inference for processing various 1080p video streams. We will release the code after the paper review. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.06072

Channel Estimation Considerate Precoder Design for Multi-user Massive MIMO-OFDM Systems: The Concept and Fast Algorithms

Authors: Liu Junkai, Jiang Yi

Abstract: The sixth-generation (6G) communication networks target peak data rates exceeding 1Tbps, necessitating base stations (BS) to support up to 100 simultaneous data streams. However, sparse pilot allocation to accommodate such streams poses challenges for users' channel estimation. This paper presents Channel Estimation Considerate Precoding (CECP), where BS precoders prioritize facilitating channel e… ▽ More The sixth-generation (6G) communication networks target peak data rates exceeding 1Tbps, necessitating base stations (BS) to support up to 100 simultaneous data streams. However, sparse pilot allocation to accommodate such streams poses challenges for users' channel estimation. This paper presents Channel Estimation Considerate Precoding (CECP), where BS precoders prioritize facilitating channel estimation alongside maximizing transmission rate. To address the computational complexity of 6G large-scale multi-input multi-output (MIMO) systems, we propose a computationally-efficient space-time block diagonal channel shortening (ST-BDCS) precoding scheme. By leveraging the sparse Toeplitz property of orthogonal frequency division multiplexing (OFDM) channels, this time-domain precoding design effectively mitigates multi-user interference in the downlink and shortens the effective channel's temporal length. Consequently, users can estimate the channels using sparse pilots. To enable fast implementation, we develop a generalized complex-valued Toeplitz matrix QR decomposition algorithm applicable to various space-time signal processing problems. Simulation results demonstrate that the ST-BDCS precoding method approximates the rate performance of conventional subcarrier-by-subcarrier precoding schemes. However, it offers the advantages of easier channel estimation for users and significantly reduced computational complexity for the BS. △ Less

Submitted 7 April, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

Comments: The work is supported by HUAWEI cooperation, which is related to the current HUAWEI project. HUAWEI cooperation requires to withdraw the paper

arXiv:2403.05125 [pdf, other]

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

Authors: Muxi Chen, Yi Liu, Jian Yi, Changran Xu, Qiuxia Lai, Hongliang Wang, Tsung-Yi Ho, Qiang Xu

Abstract: In this paper, we present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models, applied to human image synthesis. Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness. We introduce an innovative… ▽ More In this paper, we present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models, applied to human image synthesis. Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness. We introduce an innovative aesthetic score prediction model that assesses the visual appeal of generated images and unveils the first dataset marked with low-quality regions in generated human images to facilitate automatic defect detection. Our exploration into concept coverage probes the model's effectiveness in interpreting and rendering text-based concepts accurately, while our analysis of fairness reveals biases in model outputs, with an emphasis on gender, race, and age. While our study is grounded in human imagery, this dual-faceted approach is designed with the flexibility to be applicable to other forms of image generation, enhancing our understanding of generative models and paving the way to the next generation of more sophisticated, contextually aware, and ethically attuned generative models. We will release our code, the data used for evaluating generative models and the dataset annotated with defective areas soon. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.03460 [pdf, ps, other]

Foot Shape-Dependent Resistive Force Model for Bipedal Walkers on Granular Terrains

Authors: Xunjie Chen, Aditya Anikode, **gang Yi, Tao Liu

Abstract: Legged robots have demonstrated high efficiency and effectiveness in unstructured and dynamic environments. However, it is still challenging for legged robots to achieve rapid and efficient locomotion on deformable, yielding substrates, such as granular terrains. We present an enhanced resistive force model for bipedal walkers on soft granular terrains by introducing effective intrusion depth corr… ▽ More Legged robots have demonstrated high efficiency and effectiveness in unstructured and dynamic environments. However, it is still challenging for legged robots to achieve rapid and efficient locomotion on deformable, yielding substrates, such as granular terrains. We present an enhanced resistive force model for bipedal walkers on soft granular terrains by introducing effective intrusion depth correction. The enhanced force model captures fundamental kinetic results considering the robot foot shape, walking gait speed variation, and energy expense. The model is validated by extensive foot intrusion experiments with a bipedal robot. The results confirm the model accuracy on the given type of granular terrains. The model can be further integrated with the motion control of bipedal robotic walkers. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: ICRA 2024

arXiv:2403.03408 [pdf, other]

Scene Depth Estimation from Traditional Oriental Landscape Paintings

Authors: Sungho Kang, YeongHyeon Park, Hyunkyu Park, Juneho Yi

Abstract: Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape pain… ▽ More Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape painting images, we propose a novel framework that consists of two-step Image-to-Image translation method with CLIP-based image matching at the front end to predict the real scene image that best matches with the given oriental landscape painting image. Then, we employ a pre-trained SOTA depth estimation model for the generated real scene image. In the first step, CycleGAN converts an oriental landscape painting image into a pseudo-real scene image. We utilize CLIP to semantically match landscape photo images with an oriental landscape painting image for training CycleGAN in an unsupervised manner. Then, the pseudo-real scene image and oriental landscape painting image are fed into DiffuseIT to predict a final real scene image in the second step. Finally, we measure depth of the generated real scene image using a pre-trained depth estimation model such as MiDaS. Experimental results show that our approach performs well enough to predict real scene images corresponding to oriental landscape painting images. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images. Our research potentially assists visually impaired people in experiencing paintings in diverse ways. We will release our code and resulting dataset. △ Less

Submitted 6 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.03105 [pdf, ps, other]

Biomechanical Comparison of Human Walking Locomotion on Solid Ground and Sand

Authors: Chunchu Zhu, Xunjie Chen, **gang Yi

Abstract: Current studies on human locomotion focus mainly on solid ground walking conditions. In this paper, we present a biomechanic comparison of human walking locomotion on solid ground and sand. A novel dataset containing 3-dimensional motion and biomechanical data from 20 able-bodied adults for locomotion on solid ground and sand is collected. We present the data collection methods and report the sens… ▽ More Current studies on human locomotion focus mainly on solid ground walking conditions. In this paper, we present a biomechanic comparison of human walking locomotion on solid ground and sand. A novel dataset containing 3-dimensional motion and biomechanical data from 20 able-bodied adults for locomotion on solid ground and sand is collected. We present the data collection methods and report the sensor data along with the kinematic and kinetic profiles of joint biomechanics. A comprehensive analysis of human gait and joint stiffness profiles is presented. The kinematic and kinetic analysis reveals that human walking locomotion on sand shows different ground reaction forces and joint torque profiles, compared with those patterns from walking on solid ground. These gait differences reflect that humans adopt motion control strategies for yielding terrain conditions such as sand. The dataset also provides a source of locomotion data for researchers to study human activity recognition and assistive devices for walking on different terrains. △ Less

Submitted 28 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 10 pages, 10 figures, submitted to the Journal of Biomechanical Engineering

arXiv:2403.02617 [pdf, ps, other]

A Reduced-Order Resistive Force Model for Robotic Foot-Mud Interactions

Authors: Xunjie Chen, **gang Yi, Jerry Shan

Abstract: Legged robots are well-suited for broad exploration tasks in complex environments with yielding terrain. Understanding robotic foot-terrain interactions is critical for safe locomotion and walking efficiency for legged robots. This paper presents a reduced-order resistive-force model for robotic-foot/mud interactions. We focus on vertical robot locomotion on mud and propose a visco-elasto-plastic… ▽ More Legged robots are well-suited for broad exploration tasks in complex environments with yielding terrain. Understanding robotic foot-terrain interactions is critical for safe locomotion and walking efficiency for legged robots. This paper presents a reduced-order resistive-force model for robotic-foot/mud interactions. We focus on vertical robot locomotion on mud and propose a visco-elasto-plastic analog to model the foot/mud interaction forces. Dynamic behaviors such as mud visco-elasticity, withdrawing cohesive suction, and yielding are explicitly discussed with the proposed model. Besides comparing with dry/wet granular materials, mud intrusion experiments are conducted to validate the force model. The dependency of the model parameter on water content and foot velocity is also studied to reveal in-depth model properties under various conditions. The proposed force model potentially provides an enabling tool for legged robot locomotion and control on muddy terrain. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: AIM 2024

arXiv:2403.01273 [pdf, other]

NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

Authors: Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava

Abstract: Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs t… ▽ More Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2$\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2403.00613 [pdf, other]

doi 10.1109/TRO.2024.3392077

Complete and Near-Optimal Robotic Crack Coverage and Filling in Civil Infrastructure

Authors: Vishnu Veeraraghavan, Kyle Hunte, **gang Yi, Kaiyan Yu

Abstract: We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack map** and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for map**) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., crack) and employ classic… ▽ More We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack map** and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for map**) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., crack) and employ classic cell decomposition methods to achieve complete sensing coverage of the workspace and complete robotic footprint coverage using the least-cost route. Subsequently, we generalize the algorithm to handle unknown target information, allowing the robot to scan and incrementally construct the target graph online while conducting robotic footprint coverage. The online polynomial-time SIFC planning algorithm minimizes the total robot traveling distance, guarantees complete sensing coverage of the entire workspace, and achieves near-optimal robotic footprint coverage, as demonstrated through empirical experiments. For the demonstrated application, we design coordinated nozzle motion control with the planned robot trajectory to efficiently fill all cracks within the robot's footprint. Experimental results are presented to illustrate the algorithm's design, performance, and comparisons. The SIFC algorithm offers a high-efficiency motion planning solution for various robotic applications requiring simultaneous sensing and actuation coverage. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Journal ref: in IEEE Transactions on Robotics, vol. 40, pp. 2850-2867, 2024

arXiv:2402.12765 [pdf, other]

GOOD: Towards Domain Generalized Orientated Object Detection

Authors: Qi Bi, Beichen Zhou, **gjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia

Abstract: Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target dom… ▽ More Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains. Learning domain generalized oriented object detectors is particularly challenging, as the cross-domain style variation not only negatively impacts the content representation, but also leads to unreliable orientation predictions. To address these challenges, we propose a generalized oriented object detector (GOOD). After style hallucination by the emerging contrastive language-image pre-training (CLIP), it consists of two key components, namely, rotation-aware content consistency learning (RAC) and style consistency learning (SEC). The proposed RAC allows the oriented object detector to learn stable orientation representation from style-diversified samples. The proposed SEC further stabilizes the generalization ability of content representation from different image styles. Extensive experiments on multiple cross-domain settings show the state-of-the-art performance of GOOD. Source code will be publicly available. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 8 pages, 6 figures

arXiv:2402.10055 [pdf]

Robust semi-automatic vessel tracing in the human retinal image by an instance segmentation neural network

Authors: Siyi Chen, Amir H. Kashani, Ji Yi

Abstract: The morphology and hierarchy of the vascular systems are essential for perfusion in supporting metabolism. In human retina, one of the most energy-demanding organs, retinal circulation nourishes the entire inner retina by an intricate vasculature emerging and remerging at the optic nerve head (ONH). Thus, tracing the vascular branching from ONH through the vascular tree can illustrate vascular hie… ▽ More The morphology and hierarchy of the vascular systems are essential for perfusion in supporting metabolism. In human retina, one of the most energy-demanding organs, retinal circulation nourishes the entire inner retina by an intricate vasculature emerging and remerging at the optic nerve head (ONH). Thus, tracing the vascular branching from ONH through the vascular tree can illustrate vascular hierarchy and allow detailed morphological quantification, and yet remains a challenging task. Here, we presented a novel approach for a robust semi-automatic vessel tracing algorithm on human fundus images by an instance segmentation neural network (InSegNN). Distinct from semantic segmentation, InSegNN separates and labels different vascular trees individually and therefore enable tracing each tree throughout its branching. We have built-in three strategies to improve robustness and accuracy with temporal learning, spatial multi-sampling, and dynamic probability map. We achieved 83% specificity, and 50% improvement in Symmetric Best Dice (SBD) compared to literature, and outperformed baseline U-net. We have demonstrated tracing individual vessel trees from fundus images, and simultaneously retain the vessel hierarchy information. InSegNN paves a way for any subsequent morphological analysis of vascular morphology in relation to retinal diseases. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2401.14132 [pdf, other]

Enabling Cross-Camera Collaboration for Video Analytics on Distributed Smart Cameras

Authors: Chulhong Min, Juheon Yi, Utku Gunay Acer, Fahim Kawsar

Abstract: Overlap** cameras offer exciting opportunities to view a scene from different angles, allowing for more advanced, comprehensive and robust analysis. However, existing visual analytics systems for multi-camera streams are mostly limited to (i) per-camera processing and aggregation and (ii) workload-agnostic centralized processing architectures. In this paper, we present Argus, a distributed video… ▽ More Overlap** cameras offer exciting opportunities to view a scene from different angles, allowing for more advanced, comprehensive and robust analysis. However, existing visual analytics systems for multi-camera streams are mostly limited to (i) per-camera processing and aggregation and (ii) workload-agnostic centralized processing architectures. In this paper, we present Argus, a distributed video analytics system with cross-camera collaboration on smart cameras. We identify multi-camera, multi-target tracking as the primary task of multi-camera video analytics and develop a novel technique that avoids redundant, processing-heavy identification tasks by leveraging object-wise spatio-temporal association in the overlap** fields of view across multiple cameras. We further develop a set of techniques to perform these operations across distributed cameras without cloud support at low latency by (i) dynamically ordering the camera and object inspection sequence and (ii) flexibly distributing the workload across smart cameras, taking into account network transmission and heterogeneous computational capacities. Evaluation of three real-world overlap** camera datasets with two Nvidia Jetson devices shows that Argus reduces the number of object identifications and end-to-end latency by up to 7.13x and 2.19x (4.86x and 1.60x compared to the state-of-the-art), while achieving comparable tracking quality. △ Less

Submitted 26 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: 18 pages, under review

arXiv:2401.11257 [pdf, other]

Measuring Policy Distance for Multi-Agent Reinforcement Learning

Authors: Tianyi Hu, Zhiqiang Pu, Xiaolin Ai, Tenghai Qiu, Jianqiang Yi

Abstract: Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the… ▽ More Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance. △ Less

Submitted 28 January, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: 9 pages, 6 figures

arXiv:2401.08860 [pdf, other]

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

Authors: Qi Bi, Wei Ji, **gjun Yi, Haolan Zhan, Gui-Song Xia

Abstract: High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent researches find that existing self-supervised learning methods are less qualified to re… ▽ More High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent researches find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-text representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle the challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained pre-text representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on CUB-200-2011, Stanford Cars and FGVC Aircraft show that the proposed method outperforms the contemporary method by upto 10.14% and existing state-of-the-art self-supervised learning approaches by upto 19.78% on both top-1 accuracy and Rank-1 retrieval metric. △ Less

Submitted 26 February, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: work in progress

arXiv:2401.03650 [pdf, other]

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

Authors: Jayeon Yi, Junghyun Koo, Kyogu Lee

Abstract: Clip** is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declip** (SD) is desired. In this work, we introduce DDD… ▽ More Clip** is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declip** (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declip** deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declip** performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six. △ Less

Submitted 7 January, 2024; originally announced January 2024.

Comments: To appear, ICASSP 2024. Demo samples at https://stet-stet.github.io/DDD, repo at https://github.com/stet-stet/DDD

arXiv:2401.03488 [pdf, other]

Data-Driven Subsampling in the Presence of an Adversarial Actor

Authors: Abu Shafin Mohammad Mahdee Jameel, Ahmed P. Mohamed, **ho Yi, Aly El Gamal, Akshay Malhotra

Abstract: Deep learning based automatic modulation classification (AMC) has received significant attention owing to its potential applications in both military and civilian use cases. Recently, data-driven subsampling techniques have been utilized to overcome the challenges associated with computational complexity and training time for AMC. Beyond these direct advantages of data-driven subsampling, these me… ▽ More Deep learning based automatic modulation classification (AMC) has received significant attention owing to its potential applications in both military and civilian use cases. Recently, data-driven subsampling techniques have been utilized to overcome the challenges associated with computational complexity and training time for AMC. Beyond these direct advantages of data-driven subsampling, these methods also have regularizing properties that may improve the adversarial robustness of the modulation classifier. In this paper, we investigate the effects of an adversarial attack on an AMC system that employs deep learning models both for AMC and for subsampling. Our analysis shows that subsampling itself is an effective deterrent to adversarial attacks. We also uncover the most efficient subsampling strategy when an adversarial attack on both the classifier and the subsampler is anticipated. △ Less

Submitted 7 January, 2024; originally announced January 2024.

Comments: Accepted for publication at ICMLCN 2024

arXiv:2312.14197 [pdf, other]

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Authors: **gwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Abstract: The integration of large language models (LLMs) with external content has enabled more up-to-date and wide-ranging applications of LLMs, such as Microsoft Copilot. However, this integration has also exposed LLMs to the risk of indirect prompt injection attacks, where an attacker can embed malicious instructions within external content, compromising LLM output and causing responses to deviate from… ▽ More The integration of large language models (LLMs) with external content has enabled more up-to-date and wide-ranging applications of LLMs, such as Microsoft Copilot. However, this integration has also exposed LLMs to the risk of indirect prompt injection attacks, where an attacker can embed malicious instructions within external content, compromising LLM output and causing responses to deviate from user expectations. To investigate this important but underexplored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks. Based on the evaluation, our work makes a key analysis of the underlying reason for the success of the attack, namely the inability of LLMs to distinguish between instructions and external content and the absence of LLMs' awareness to not execute instructions within external content. Building upon this analysis, we develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training accordingly. Experimental results demonstrate that black-box defenses are highly effective in mitigating these attacks, while the white-box defense reduces the attack success rate to near-zero levels. Overall, our work systematically investigates indirect prompt injection attacks by introducing a benchmark, analyzing the underlying reason for the success of the attack, and develo** an initial set of defenses. △ Less

Submitted 8 March, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.10155 [pdf, ps, other]

Gaussian Process-Based Learning Control of Underactuated Balance Robots with an External and Internal Convertible Modeling Structure

Authors: Feng Han, **gang Yi

Abstract: External and internal convertible (EIC) form-based motion control is one of the effective designs of simultaneously trajectory tracking and balance for underactuated balance robots. Under certain conditions, the EIC-based control design however leads to uncontrolled robot motion. We present a Gaussian process (GP)-based data-driven learning control for underactuated balance robots with the EIC mod… ▽ More External and internal convertible (EIC) form-based motion control is one of the effective designs of simultaneously trajectory tracking and balance for underactuated balance robots. Under certain conditions, the EIC-based control design however leads to uncontrolled robot motion. We present a Gaussian process (GP)-based data-driven learning control for underactuated balance robots with the EIC modeling structure. Two GP-based learning controllers are presented by using the EIC structure property. The partial EIC (PEIC)-based control design partitions the robotic dynamics into a fully actuated subsystem and one reduced-order underactuated system. The null-space EIC (NEIC)-based control compensates for the uncontrolled motion in a subspace, while the other closed-loop dynamics are not affected. Under the PEIC- and NEIC-based, the tracking and balance tasks are guaranteed and convergence rate and bounded errors are achieved without causing any uncontrolled motion by the original EIC-based control. We validate the results and demonstrate the GP-based learning control design performance using two inverted pendulum platforms. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.09651 [pdf, other]

What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao

Abstract: The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the… ▽ More The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2312.06632 [pdf, other]

Control Risk for Potential Misuse of Artificial Intelligence in Science

Authors: Jiyan He, Weitao Feng, Yaosen Min, **gwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie, Weiming Zhang, Nenghai Yu, Shuxin Zheng

Abstract: The expanding application of Artificial Intelligence (AI) in scientific fields presents unprecedented opportunities for discovery and innovation. However, this growth is not without risks. AI models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. In this study, we aim to raise awareness of the dangers of AI misuse in scien… ▽ More The expanding application of Artificial Intelligence (AI) in scientific fields presents unprecedented opportunities for discovery and innovation. However, this growth is not without risks. AI models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. In this study, we aim to raise awareness of the dangers of AI misuse in science, and call for responsible AI development and use in this domain. We first itemize the risks posed by AI in scientific contexts, then demonstrate the risks by highlighting real-world examples of misuse in chemical science. These instances underscore the need for effective risk management strategies. In response, we propose a system called SciGuard to control misuse risks for AI models in science. We also propose a red-teaming benchmark SciMT-Safety to assess the safety of different systems. Our proposed SciGuard shows the least harmful impact in the assessment without compromising performance in benign tests. Finally, we highlight the need for a multidisciplinary and collaborative effort to ensure the safe and ethical use of AI models in science. We hope that our study can spark productive discussions on using AI ethically in science among researchers, practitioners, policymakers, and the public, to maximize benefits and minimize the risks of misuse. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.06172 [pdf, other]

Decoupling SQL Query Hardness Parsing for Text-to-SQL

Authors: Jiawen Yi, Guo Chen

Abstract: The fundamental goal of the Text-to-SQL task is to translate natural language question into SQL query. Current research primarily emphasizes the information coupling between natural language questions and schemas, and significant progress has been made in this area. The natural language questions as the primary task requirements source determines the hardness of correspond SQL queries, the correla… ▽ More The fundamental goal of the Text-to-SQL task is to translate natural language question into SQL query. Current research primarily emphasizes the information coupling between natural language questions and schemas, and significant progress has been made in this area. The natural language questions as the primary task requirements source determines the hardness of correspond SQL queries, the correlation between the two always be ignored. However, when the correlation between questions and queries was decoupled, it may simplify the task. In this paper, we introduce an innovative framework for Text-to-SQL based on decoupling SQL query hardness parsing. This framework decouples the Text-to-SQL task based on query hardness by analyzing questions and schemas, simplifying the multi-hardness task into a single-hardness challenge. This greatly reduces the parsing pressure on the language model. We evaluate our proposed framework and achieve a new state-of-the-art performance of fine-turning methods on Spider dev. △ Less

Submitted 29 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: 10 pages, 3 figures

arXiv:2311.13687 [pdf, other]

Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts

Authors: Jayeon Yi, Sungho Lee, Kyogu Lee

Abstract: In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a success… ▽ More In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a successful training. Our model is found to outperform the baselines on a large dataset, and is also found to benefit from pretraining and finetuning. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: ISMIR 2023 LBD. Demo videos and code at stet-stet.github.io/goct

arXiv:2311.07613 [pdf]

A Physics-informed Machine Learning-based Control Method for Nonlinear Dynamic Systems with Highly Noisy Measurements

Authors: Mason Ma, Jiajie Wu, Chase Post, Tony Shi, **gang Yi, Tony Schmitz, Hong Wang

Abstract: This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-info… ▽ More This study presents a physics-informed machine learning-based control method for nonlinear dynamic systems with highly noisy measurements. Existing data-driven control methods that use machine learning for system identification cannot effectively cope with highly noisy measurements, resulting in unstable control performance. To address this challenge, the present study extends current physics-informed machine learning capabilities for modeling nonlinear dynamics with control and integrates them into a model predictive control framework. To demonstrate the capability of the proposed method we test and validate with two noisy nonlinear dynamic systems: the chaotic Lorenz 3 system, and turning machine tool. Analysis of the results illustrate that the proposed method outperforms state-of-the-art benchmarks as measured by both modeling accuracy and control performance for nonlinear dynamic systems under high-noise conditions. △ Less

Submitted 11 November, 2023; originally announced November 2023.

arXiv:2311.03965 [pdf, other]

Fast Sun-aligned Outdoor Scene Relighting based on TensoRF

Authors: Yeon** Chang, Yearim Kim, Seunghyeon Seo, Jung Yi, Nojun Kwak

Abstract: In this work, we introduce our method of outdoor scene relighting for Neural Radiance Fields (NeRF) named Sun-aligned Relighting TensoRF (SR-TensoRF). SR-TensoRF offers a lightweight and rapid pipeline aligned with the sun, thereby achieving a simplified workflow that eliminates the need for environment maps. Our sun-alignment strategy is motivated by the insight that shadows, unlike viewpoint-dep… ▽ More In this work, we introduce our method of outdoor scene relighting for Neural Radiance Fields (NeRF) named Sun-aligned Relighting TensoRF (SR-TensoRF). SR-TensoRF offers a lightweight and rapid pipeline aligned with the sun, thereby achieving a simplified workflow that eliminates the need for environment maps. Our sun-alignment strategy is motivated by the insight that shadows, unlike viewpoint-dependent albedo, are determined by light direction. We directly use the sun direction as an input during shadow generation, simplifying the requirements of the inference process significantly. Moreover, SR-TensoRF leverages the training efficiency of TensoRF by incorporating our proposed cubemap concept, resulting in notable acceleration in both training and rendering processes compared to existing methods. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: WACV 2024

arXiv:2311.02536 [pdf, other]

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Authors: **gru Yi, Burak Uzkent, Oana Ignat, Zili Li, Amanmeet Garg, Xiang Yu, Linda Liu

Abstract: Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language task… ▽ More Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flip** to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flip**. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in https://github.com/amzn/augment-the-pairs-wacv2024. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: Accepted to WACV2024

arXiv:2310.19468 [pdf, other]

Regret-Minimization Algorithms for Multi-Agent Cooperative Learning Systems

Authors: Jialin Yi

Abstract: A Multi-Agent Cooperative Learning (MACL) system is an artificial intelligence (AI) system where multiple learning agents work together to complete a common task. Recent empirical success of MACL systems in various domains (e.g. traffic control, cloud computing, robotics) has sparked active research into the design and analysis of MACL systems for sequential decision making problems. One important… ▽ More A Multi-Agent Cooperative Learning (MACL) system is an artificial intelligence (AI) system where multiple learning agents work together to complete a common task. Recent empirical success of MACL systems in various domains (e.g. traffic control, cloud computing, robotics) has sparked active research into the design and analysis of MACL systems for sequential decision making problems. One important metric of the learning algorithm for decision making problems is its regret, i.e. the difference between the highest achievable reward and the actual reward that the algorithm gains. The design and development of a MACL system with low-regret learning algorithms can create huge economic values. In this thesis, I analyze MACL systems for different sequential decision making problems. Concretely, the Chapter 3 and 4 investigate the cooperative multi-agent multi-armed bandit problems, with full-information or bandit feedback, in which multiple learning agents can exchange their information through a communication network and the agents can only observe the rewards of the actions they choose. Chapter 5 considers the communication-regret trade-off for online convex optimization in the distributed setting. Chapter 6 discusses how to form high-productive teams for agents based on their unknown but fixed types using adaptive incremental matchings. For the above problems, I present the regret lower bounds for feasible learning algorithms and provide the efficient algorithms to achieve this bound. The regret bounds I present in Chapter 3, 4 and 5 quantify how the regret depends on the connectivity of the communication network and the communication delay, thus giving useful guidance on design of the communication protocol in MACL systems △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Thesis submitted to London School of Economics and Political Science for PhD in Statistics

arXiv:2310.18701 [pdf, other]

Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards

Authors: Bo Xue, Yimu Wang, Yuanyu Wan, **feng Yi, Lijun Zhang

Abstract: This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose $(1+ε)$-th moment is bounded for some $ε\in (0,1]$. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propos… ▽ More This paper investigates the problem of generalized linear bandits with heavy-tailed rewards, whose $(1+ε)$-th moment is bounded for some $ε\in (0,1]$. Although there exist methods for generalized linear bandits, most of them focus on bounded or sub-Gaussian rewards and are not well-suited for many real-world scenarios, such as financial markets and web-advertising. To address this issue, we propose two novel algorithms based on truncation and mean of medians. These algorithms achieve an almost optimal regret bound of $\widetilde{O}(dT^{\frac{1}{1+ε}})$, where $d$ is the dimension of contextual information and $T$ is the time horizon. Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches. Additionally, our mean-of-medians-based algorithm requires only $O(\log T)$ rewards and one estimator per epoch, making it more practical. Moreover, our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when $ε=1$. Numerical experimental results confirm the merits of our algorithms. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.18303 [pdf, other]

Socially Cognizant Robotics for a Technology Enhanced Society

Authors: Kristin J. Dana, Clinton Andrews, Kostas Bekris, Jacob Feldman, Matthew Stone, Pernille Hemmer, Aaron Mazzeo, Hal Salzman, **gang Yi

Abstract: Emerging applications of robotics, and concerns about their impact, require the research community to put human-centric objectives front-and-center. To meet this challenge, we advocate an interdisciplinary approach, socially cognizant robotics, which synthesizes technical and social science methods. We argue that this approach follows from the need to empower stakeholder participation (from synchr… ▽ More Emerging applications of robotics, and concerns about their impact, require the research community to put human-centric objectives front-and-center. To meet this challenge, we advocate an interdisciplinary approach, socially cognizant robotics, which synthesizes technical and social science methods. We argue that this approach follows from the need to empower stakeholder participation (from synchronous human feedback to asynchronous societal assessment) in sha** AI-driven robot behavior at all levels, and leads to a range of novel research perspectives and problems both for improving robots' interactions with individuals and impacts on society. Drawing on these arguments, we develop best practices for socially cognizant robot design that balance traditional technology-based metrics (e.g. efficiency, precision and accuracy) with critically important, albeit challenging to measure, human and society-based metrics. △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.11665 [pdf, other]

Forward Kinematics of Object Transporting by a Multi-Robot System with a Deformable Sheet

Authors: Jiawei Hu, Wenhang Liu, **gang Yi, Zhenhua Xiong

Abstract: We present object handling and transporting by a multi-robot team with a deformable sheet as a carrier. Due to the deformability of the sheet and the high dimension of the whole system, it is challenging to clearly describe all the possible positions of the object on the sheet for a given formation of the multi-robot system. A complete forward kinematics (FK) method is proposed for object handling… ▽ More We present object handling and transporting by a multi-robot team with a deformable sheet as a carrier. Due to the deformability of the sheet and the high dimension of the whole system, it is challenging to clearly describe all the possible positions of the object on the sheet for a given formation of the multi-robot system. A complete forward kinematics (FK) method is proposed for object handling by an $N$-mobile robot team with a deformable sheet. Based on the virtual variable cables model (VVCM), a constrained quadratic problem (CQP) is formulated by combining the geometric constraints and minimum potential energy conditions of the system. Analytical solutions to the CQP are presented and then further verified with the force closure condition. We present an FK algorithm based on the FK method to obtain all possible solutions with the given initial sheet shape and the robot team formation. We demonstrate the effectiveness, completeness, and efficiency of the FK algorithm with experimental results and case study examples. △ Less

Submitted 23 January, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 8 pages, 6 figures, has been submitted to IEEE Robotics and Automation Letters

Showing 1–50 of 275 results for author: Yi, J