Search | arXiv e-print repository

Active Inference as a Model of Agency

Authors: Lancelot Da Costa, Samuel Tenka, Dominic Zhao, Noor Sajid

Abstract: Is there a canonical way to think of agency beyond reward maximisation? In this paper, we show that any type of behaviour complying with physically sound assumptions about how macroscopic biological agents interact with the world canonically integrates exploration and exploitation in the sense of minimising risk and ambiguity about states of the world. This description, known as active inference,… ▽ More Is there a canonical way to think of agency beyond reward maximisation? In this paper, we show that any type of behaviour complying with physically sound assumptions about how macroscopic biological agents interact with the world canonically integrates exploration and exploitation in the sense of minimising risk and ambiguity about states of the world. This description, known as active inference, refines the free energy principle, a popular descriptive framework for action and perception originating in neuroscience. Active inference provides a normative Bayesian framework to simulate and model agency that is widely used in behavioural neuroscience, reinforcement learning (RL) and robotics. The usefulness of active inference for RL is three-fold. \emph{a}) Active inference provides a principled solution to the exploration-exploitation dilemma that usefully simulates biological agency. \emph{b}) It provides an explainable recipe to simulate behaviour, whence behaviour follows as an explainable mixture of exploration and exploitation under a generative world model, and all differences in behaviour are explicit in differences in world model. \emph{c}) This framework is universal in the sense that it is theoretically possible to rewrite any RL algorithm conforming to the descriptive assumptions of active inference as an active inference algorithm. Thus, active inference can be used as a tool to uncover and compare the commitments and assumptions of more specific models of agency. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted in RLDM2022 for the workshop 'RL as a model of agency'

arXiv:2401.11997 [pdf, other]

PAC.V. The Roles of Mass and Environment in the Quenching of Galaxies

Authors: Yun Zheng, Kun Xu, Y. P. **g, Donghai Zhao, Hongyu Gao, Xiaolin Luo, Jianxin Han, Yu Yu, Ming Li

Abstract: The roles that mass and environment play in the galaxy quenching are still under debate. Leveraging the Photometric objects Around Cosmic webs (PAC) method, we analyze the excess surface distribution $\bar{n}_2w_{\rm{p}}(r_{\rm{p}})$ of photometric galaxies in different color (rest-frame $u-r$) within the stellar mass range of $10^{9.0}M_{\odot}\sim10^{11.0}M_{\odot}$ around spectroscopic massive… ▽ More The roles that mass and environment play in the galaxy quenching are still under debate. Leveraging the Photometric objects Around Cosmic webs (PAC) method, we analyze the excess surface distribution $\bar{n}_2w_{\rm{p}}(r_{\rm{p}})$ of photometric galaxies in different color (rest-frame $u-r$) within the stellar mass range of $10^{9.0}M_{\odot}\sim10^{11.0}M_{\odot}$ around spectroscopic massive central galaxies ($10^{10.9}\sim10^{11.7}M_{\odot}$) at the redshift interval $0<z_s<0.7$, utilizing data from the Hyper SuprimeCam Subaru Strategic Program and the spectroscopic samples of Slogan Digital Sky Survey (i.e. Main, LOWZ and CMASS samples). We find that both mass and environment quenching contribute to the evolution of companion galaxies. To isolate the environment effect, we quantify the quenched fraction excess (QFE) of companion galaxies encircling massive central galaxies within $0.01h^{-1}{\rm{Mpc}}<r_{\rm{p}}<20h^{-1}\rm{Mpc}$, representing the surplus quenched fraction relative to the average. We find that the high density halo environment affects the star formation quenching up to about three times of the virial radius, and this effect becomes stronger at lower redshift. We also find that even after being scaled by the virial radius, the environment quenching efficiency is higher for more massive halos or for companion galaxies of higher stellar mass, though the trends are quite weak. We present a fitting formula that comprehensively captures the QFE across central and companion stellar mass bins, halo-centric distance bins, and redshift bins, offering a valuable tool for constraining galaxy formation models. Furthermore, we have made a quantitative comparison with Illustris-TNG that underscores some important differences, particularly in the excessive quenching of low-mass companion galaxies ($<10^{9.5}M_{\odot}$) by TNG. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 23 pages, 14 figures. Submitted to ApJ. Comments welcome :-)

arXiv:2401.11687 [pdf, other]

TIM: An Efficient Temporal Interaction Module for Spiking Transformer

Authors: Sicheng Shen, Dongcheng Zhao, Guobin Shen, Yi Zeng

Abstract: Spiking Neural Networks (SNNs), as the third generation of neural networks, have gained prominence for their biological plausibility and computational efficiency, especially in processing diverse datasets. The integration of attention mechanisms, inspired by advancements in neural network architectures, has led to the development of Spiking Transformers. These have shown promise in enhancing SNNs'… ▽ More Spiking Neural Networks (SNNs), as the third generation of neural networks, have gained prominence for their biological plausibility and computational efficiency, especially in processing diverse datasets. The integration of attention mechanisms, inspired by advancements in neural network architectures, has led to the development of Spiking Transformers. These have shown promise in enhancing SNNs' capabilities, particularly in the realms of both static and neuromorphic datasets. Despite their progress, a discernible gap exists in these systems, specifically in the Spiking Self Attention (SSA) mechanism's effectiveness in leveraging the temporal processing potential of SNNs. To address this, we introduce the Temporal Interaction Module (TIM), a novel, convolution-based enhancement designed to augment the temporal data processing abilities within SNN architectures. TIM's integration into existing SNN frameworks is seamless and efficient, requiring minimal additional parameters while significantly boosting their temporal information handling capabilities. Through rigorous experimentation, TIM has demonstrated its effectiveness in exploiting temporal information, leading to state-of-the-art performance across various neuromorphic datasets. The code is available at https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/TIM. △ Less

Submitted 9 May, 2024; v1 submitted 21 January, 2024; originally announced January 2024.

Comments: Accepted by the 33rd International Joint Conference on Artificial Intelligence(IJCAI 2024)

arXiv:2401.10450 [pdf, other]

Observation of tunable topological polaritons in a cavity waveguide

Authors: Dong Zhao, Ziyao Wang, Linyun Yang, Yuxin Zhong, Xiang Xi, Zhenxiao Zhu, Maohua Gong, Qingan Tu, Yan Meng, Bei Yan, Ce Shang, Zhen Gao

Abstract: Topological polaritons characterized by light-matter interactions have become a pivotal platform in exploring new topological phases of matter. Recent theoretical advances unveiled a novel mechanism for tuning topological phases of polaritons by modifying the surrounding photonic environment (light-matter interactions) without altering the lattice structure. Here, by embedding a dimerized chain of… ▽ More Topological polaritons characterized by light-matter interactions have become a pivotal platform in exploring new topological phases of matter. Recent theoretical advances unveiled a novel mechanism for tuning topological phases of polaritons by modifying the surrounding photonic environment (light-matter interactions) without altering the lattice structure. Here, by embedding a dimerized chain of microwave helical resonators (electric dipole emitters) in a metallic cavity waveguide, we report the pioneering observation of tunable topological phases of polaritons by varying the cavity width which governs the surrounding photonic environment and the strength of light-matter interactions. Moreover, we experimentally identified a new type of topological phase transition which includes three non-coincident critical points in the parameter space: the closure of the polaritonic bandgap, the transition of the Zak phase, and the hybridization of the topological edge states with the bulk states. These results reveal some remarkable and uncharted properties of topological matter when strongly coupled to light and provide an innovative design principle for tunable topological photonic devices. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 6 pages, 4 figures

arXiv:2401.08819 [pdf, other]

Learning from Sparse Offline Datasets via Conservative Density Estimation

Authors: Zhepeng Cen, Zuxin Liu, Zitong Wang, Yihang Yao, Henry Lam, Ding Zhao

Abstract: Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Densi… ▽ More Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL. △ Less

Submitted 11 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: ICLR 2024

arXiv:2401.07159 [pdf, other]

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Authors: Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

Abstract: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from… ▽ More Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 $\times$. △ Less

Submitted 13 January, 2024; originally announced January 2024.

ACM Class: I.2.7

arXiv:2401.05709 [pdf, other]

Probability-based Distance Estimation Model for 3D DV-Hop Localization in WSNs

Authors: Penghong Wang, Hao Wang, Wenrui Li, Xiaopeng Fan, Debin Zhao

Abstract: Localization is one of the pivotal issues in wireless sensor network applications. In 3D localization studies, most algorithms focus on enhancing the location prediction process, lacking theoretical derivation of the detection distance of an anchor node at the varying hops, engenders a localization performance bottleneck. To address this issue, we propose a probability-based average distance estim… ▽ More Localization is one of the pivotal issues in wireless sensor network applications. In 3D localization studies, most algorithms focus on enhancing the location prediction process, lacking theoretical derivation of the detection distance of an anchor node at the varying hops, engenders a localization performance bottleneck. To address this issue, we propose a probability-based average distance estimation (PADE) model that utilizes the probability distribution of node distances detected by an anchor node. The aim is to mathematically derive the average distances of nodes detected by an anchor node at different hops. First, we develop a probability-based maximum distance estimation (PMDE) model to calculate the upper bound of the distance detected by an anchor node. Then, we present the PADE model, which relies on the upper bound obtained of the distance by the PMDE model. Finally, the obtained average distance is used to construct a distance loss function, and it is embedded with the traditional distance loss function into a multi-objective genetic algorithm to predict the locations of unknown nodes. The experimental results demonstrate that the proposed method achieves state-of-the-art performance in random and multimodal distributed sensor networks. The average localization accuracy is improved by 3.49\%-12.66\% and 3.99%-22.34%, respectively. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.03901 [pdf, other]

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

Authors: Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

Abstract: Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for vide… ▽ More Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: To appear in AAAI 2024

arXiv:2401.03141 [pdf, other]

Estimating the Lateral Motion States of an Underwater Robot by Propeller Wake Sensing Using an Artificial Lateral Line

Authors: Jun Wang, Dexin Zhao, Youxi Zhao, Feitian Zhang, Tongsheng Shen

Abstract: An artificial lateral line (ALL) is a bioinspired flow sensing system of an underwater robot that consists of distributed flow sensors. The ALL has achieved great success in sensing the motion states of bioinspired underwater robots, e.g., robotic fish, that are driven by body undulation and/or tail flap**. However, the ALL has not been systematically tested and studied in the sensing of underwa… ▽ More An artificial lateral line (ALL) is a bioinspired flow sensing system of an underwater robot that consists of distributed flow sensors. The ALL has achieved great success in sensing the motion states of bioinspired underwater robots, e.g., robotic fish, that are driven by body undulation and/or tail flap**. However, the ALL has not been systematically tested and studied in the sensing of underwater robots driven by rotating propellers due to the highly dynamic and complex flow field therein. This paper makes a bold hypothesis that the distributed flow measurements sampled from the propeller wake flow, although infeasible to represent the entire flow dynamics, provides sufficient information for estimating the lateral motion states of the leader underwater robot. An experimental testbed is constructed to investigate the feasibility of such a state estimator which comprises a cylindrical ALL sensory system, a rotating leader propeller, and a water tank with a planar sliding guide. Specifically, a hybrid network that consists of a one-dimensional convolution network (1DCNN) and a bidirectional long short-term memory network (BiLSTM) is designed to extract the spatiotemporal features of the time series of distributed pressure measurements. A multi-output deep learning network is adopted to estimate the lateral motion states of the leader propeller. In addition, the state estimator is optimized using the whale optimization algorithm (WOA) considering the comprehensive estimation performance. Extensive experiments are conducted the results of which validate the proposed data-driven algorithm in estimating the motion states of the leader underwater robot by propeller wake sensing. △ Less

Submitted 6 January, 2024; originally announced January 2024.

Comments: 10 pages, 8 figures

arXiv:2401.02673 [pdf, other]

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Authors: Dongdi Zhao, Jianbo Ma, Lu Lu, **ke Li, Xuan Ji, Lei Zhu, Fuming Fang, Ming Liu, Feijun Jiang

Abstract: Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen… ▽ More Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2401.00565 [pdf, other]

doi 10.3847/1538-4357/ad3b96

Photometric Objects Around Cosmic Webs (PAC). VI. High Satellite Fraction of Quasars

Authors: Shanquan Gui, Kun Xu, Y. P. **g, Donghai Zhao, Hongyu Gao

Abstract: The Photometric objects Around Cosmic webs (PAC) approach developed in Xu et al. (2022b) has the advantage of making full use of spectroscopic and deeper photometric surveys. With the merits of PAC, the excess surface density $\bar{n}_2w_{\rm{p}}$ of neighboring galaxies can be measured down to stellar mass $10^{10.80}\,M_{\odot}$ around quasars at redshift $0.8<z_{\rm{s}}<1.0$, with the data from… ▽ More The Photometric objects Around Cosmic webs (PAC) approach developed in Xu et al. (2022b) has the advantage of making full use of spectroscopic and deeper photometric surveys. With the merits of PAC, the excess surface density $\bar{n}_2w_{\rm{p}}$ of neighboring galaxies can be measured down to stellar mass $10^{10.80}\,M_{\odot}$ around quasars at redshift $0.8<z_{\rm{s}}<1.0$, with the data from the Sloan Digital Sky Survey IV (SDSS-IV) extended Baryon Oscillation Spectroscopic Survey (eBOSS) and the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys. We find that $\bar{n}_2w_{\rm{p}}$ generally increases quite steeply with the decrease of the separation. Using subhalo abundance matching method, we can accurately model the $\bar{n}_2w_{\rm{p}}$ both on small and large scales. We show that the steep increase of the $\bar{n}_2w_{\rm{p}}$ towards the quasars requires that a large fraction $f_{\mathrm{sate}}=0.29_{-0.06}^{+0.05}$ of quasars should be satellites in massive halos, and find that this fraction measurement is insensitive to the assumptions of our modeling. This high satellite fraction indicates that the subhalos have nearly the same probability to host quasars as the halos for the same (infall) halo mass, and the large scale environment has negligible effect on the quasar activity. We show that even with this high satellite fraction, each massive halo on average does not host more than one satellite quasar due to the sparsity of quasars. △ Less

Submitted 15 May, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: 15 pages, 11 figures, 2 tables, accepted for publication in the Astrophysical Journal

Journal ref: The Astrophysical Journal, 967:17 (13pp), 2024 May 20

arXiv:2401.00124 [pdf, other]

Generative AI-driven Semantic Communication Networks: Architecture, Technologies and Applications

Authors: Chengsi Liang, Hongyang Du, Yao Sun, Dusit Niyato, Jiawen Kang, Dezong Zhao, Muhammad Ali Imran

Abstract: Generative artificial intelligence (GAI) has emerged as a rapidly burgeoning field demonstrating significant potential in creating diverse contents intelligently and automatically. To support such artificial intelligence-generated content (AIGC) services, future communication systems should fulfill much more stringent requirements (including data rate, throughput, latency, etc.) with limited yet p… ▽ More Generative artificial intelligence (GAI) has emerged as a rapidly burgeoning field demonstrating significant potential in creating diverse contents intelligently and automatically. To support such artificial intelligence-generated content (AIGC) services, future communication systems should fulfill much more stringent requirements (including data rate, throughput, latency, etc.) with limited yet precious spectrum resources. To tackle this challenge, semantic communication (SemCom), dramatically reducing resource consumption via extracting and transmitting semantics, has been deemed as a revolutionary communication scheme. The advanced GAI algorithms facilitate SemCom on sophisticated intelligence for model training, knowledge base construction and channel adaption. Furthermore, GAI algorithms also play an important role in the management of SemCom networks. In this survey, we first overview the basics of GAI and SemCom as well as the synergies of the two technologies. Especially, the GAI-driven SemCom framework is presented, where many GAI models for information creation, SemCom-enabled information transmission and information effectiveness for AIGC are discussed separately. We then delve into the GAI-driven SemCom network management involving with novel management layers, knowledge management, and resource allocation. Finally, we envision several promising use cases, i.e., autonomous driving, smart city, and the Metaverse for a more comprehensive exploration. △ Less

Submitted 7 January, 2024; v1 submitted 29 December, 2023; originally announced January 2024.

arXiv:2312.17493 [pdf, other]

Differentially Private Low-Rank Adaptation of Large Language Model Using Federated Learning

Authors: Xiao-Yang Liu, Rongyi Zhu, Daochen Zha, Jiechao Gao, Shan Zhong, Matt White, Meikang Qiu

Abstract: The surge in interest and application of large language models (LLMs) has sparked a drive to fine-tune these models to suit specific applications, such as finance and medical science. However, concerns regarding data privacy have emerged, especially when multiple stakeholders aim to collaboratively enhance LLMs using sensitive data. In this scenario, federated learning becomes a natural choice, al… ▽ More The surge in interest and application of large language models (LLMs) has sparked a drive to fine-tune these models to suit specific applications, such as finance and medical science. However, concerns regarding data privacy have emerged, especially when multiple stakeholders aim to collaboratively enhance LLMs using sensitive data. In this scenario, federated learning becomes a natural choice, allowing decentralized fine-tuning without exposing raw data to central servers. Motivated by this, we investigate how data privacy can be ensured in LLM fine-tuning through practical federated learning approaches, enabling secure contributions from multiple parties to enhance LLMs. Yet, challenges arise: 1) despite avoiding raw data exposure, there is a risk of inferring sensitive information from model outputs, and 2) federated learning for LLMs incurs notable communication overhead. To address these challenges, this article introduces DP-LoRA, a novel federated learning algorithm tailored for LLMs. DP-LoRA preserves data privacy by employing a Gaussian mechanism that adds noise in weight updates, maintaining individual data privacy while facilitating collaborative model training. Moreover, DP-LoRA optimizes communication efficiency via low-rank adaptation, minimizing the transmission of updated weights during distributed training. The experimental results across medical, financial, and general datasets using various LLMs demonstrate that DP-LoRA effectively ensures strict privacy constraints while minimizing communication overhead. △ Less

Submitted 2 June, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 21 pages, 1 figure, 19 tables

arXiv:2312.16352 [pdf, ps, other]

Smuche: Scalar-Multiplicative Caching in Homomorphic Encryption

Authors: Dongfang Zhao

Abstract: Addressing the challenge of balancing security and efficiency when deploying machine learning systems in untrusted environments, such as federated learning, remains a critical concern. A promising strategy to tackle this issue involves optimizing the performance of fully homomorphic encryption (HE). Recent research highlights the efficacy of advanced caching techniques, such as Rache, in significa… ▽ More Addressing the challenge of balancing security and efficiency when deploying machine learning systems in untrusted environments, such as federated learning, remains a critical concern. A promising strategy to tackle this issue involves optimizing the performance of fully homomorphic encryption (HE). Recent research highlights the efficacy of advanced caching techniques, such as Rache, in significantly enhancing the performance of HE schemes without compromising security. However, Rache is constrained by an inherent limitation: its performance overhead is heavily influenced by the characteristics of plaintext models, specifically exhibiting a caching time complexity of $\mathcal{O}(N)$, where $N$ represents the number of cached pivots based on specific radixes. This caching overhead becomes impractical for handling large-scale data. In this study, we introduce a novel \textit{constant-time} caching technique that is independent of any parameters. The core concept involves applying scalar multiplication to a single cached ciphertext, followed by the introduction of a completely new and constant-time randomness. Leveraging the inherent characteristics of constant-time construction, we coin the term ``Smuche'' for this innovative caching technique, which stands for Scalar-multiplicative Caching of Homomorphic Encryption. We implemented Smuche from scratch and conducted comparative evaluations against two baseline schemes, Rache and CKKS. Our experimental results underscore the effectiveness of Smuche in addressing the identified limitations and optimizing the performance of homomorphic encryption in practical scenarios. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.15127 [pdf, other]

Gradient Sha** for Multi-Constraint Safe Reinforcement Learning

Authors: Yihang Yao, Zuxin Liu, Zhepeng Cen, Peide Huang, Tingnan Zhang, Wenhao Yu, Ding Zhao

Abstract: Online safe reinforcement learning (RL) involves training a policy that maximizes task efficiency while satisfying constraints via interacting with the environments. In this paper, our focus lies in addressing the complex challenges associated with solving multi-constraint (MC) safe RL problems. We approach the safe RL problem from the perspective of Multi-Objective Optimization (MOO) and propose… ▽ More Online safe reinforcement learning (RL) involves training a policy that maximizes task efficiency while satisfying constraints via interacting with the environments. In this paper, our focus lies in addressing the complex challenges associated with solving multi-constraint (MC) safe RL problems. We approach the safe RL problem from the perspective of Multi-Objective Optimization (MOO) and propose a unified framework designed for MC safe RL algorithms. This framework highlights the manipulation of gradients derived from constraints. Leveraging insights from this framework and recognizing the significance of \textit{redundant} and \textit{conflicting} constraint conditions, we introduce the Gradient Sha** (GradS) method for general Lagrangian-based safe RL algorithms to improve the training efficiency in terms of both reward and constraint satisfaction. Our extensive experimentation demonstrates the effectiveness of our proposed method in encouraging exploration and learning a policy that improves both safety and reward performance across various challenging MC safe RL tasks as well as good scalability to the number of constraints. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2312.13303 [pdf, other]

RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios

Authors: Wenhao Ding, Yulong Cao, Ding Zhao, Chaowei Xiao, Marco Pavone

Abstract: Simulation plays a crucial role in the development of autonomous vehicles (AVs) due to the potential risks associated with real-world testing. Although significant progress has been made in the visual aspects of simulators, generating complex behavior among agents remains a formidable challenge. It is not only imperative to ensure realism in the scenarios generated but also essential to incorporat… ▽ More Simulation plays a crucial role in the development of autonomous vehicles (AVs) due to the potential risks associated with real-world testing. Although significant progress has been made in the visual aspects of simulators, generating complex behavior among agents remains a formidable challenge. It is not only imperative to ensure realism in the scenarios generated but also essential to incorporate preferences and conditions to facilitate controllable generation for AV training and evaluation. Traditional methods, mainly relying on memorizing the distribution of training datasets, often fall short in generating unseen scenarios. Inspired by the success of retrieval augmented generation in large language models, we present RealGen, a novel retrieval-based in-context learning framework for traffic scenario generation. RealGen synthesizes new scenarios by combining behaviors from multiple retrieved examples in a gradient-free way, which may originate from templates or tagged scenarios. This in-context learning framework endows versatile generative capabilities, including the ability to edit scenarios, compose various behaviors, and produce critical scenarios. Evaluations show that RealGen offers considerable flexibility and controllability, marking a new direction in the field of controllable traffic scenario generation. Check our project website for more information: https://realgen.github.io. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.11945 [pdf, other]

Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting

Authors: Haowei Du, Dinghao Zhang, Chen Li, Yang Li, Dongyan Zhao

Abstract: Recent approaches in Incomplete Utterance Rewriting (IUR) fail to capture the source of important words, which is crucial to edit the incomplete utterance, and introduce words from irrelevant utterances. We propose a novel and effective multi-task information interaction framework including context selection, edit matrix construction, and relevance merging to capture the multi-granularity of seman… ▽ More Recent approaches in Incomplete Utterance Rewriting (IUR) fail to capture the source of important words, which is crucial to edit the incomplete utterance, and introduce words from irrelevant utterances. We propose a novel and effective multi-task information interaction framework including context selection, edit matrix construction, and relevance merging to capture the multi-granularity of semantic information. Benefiting from fetching the relevant utterance and figuring out the important words, our approach outperforms existing state-of-the-art models on two benchmark datasets Restoration-200K and CANAND in this field. Code will be provided on \url{https://github.com/yanmenxue/QR}. △ Less

Submitted 8 January, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Findings of EMNLP2023 (short)

arXiv:2312.11922 [pdf, other]

Relation-Aware Question Answering for Heterogeneous Knowledge Graphs

Authors: Haowei Du, Quzhe Huang, Chen Li, Chen Zhang, Yang Li, Dongyan Zhao

Abstract: Multi-hop Knowledge Base Question Answering(KBQA) aims to find the answer entity in a knowledge graph (KG), which requires multiple steps of reasoning. Existing retrieval-based approaches solve this task by concentrating on the specific relation at different hops and predicting the intermediate entity within the reasoning path. During the reasoning process of these methods, the representation of r… ▽ More Multi-hop Knowledge Base Question Answering(KBQA) aims to find the answer entity in a knowledge graph (KG), which requires multiple steps of reasoning. Existing retrieval-based approaches solve this task by concentrating on the specific relation at different hops and predicting the intermediate entity within the reasoning path. During the reasoning process of these methods, the representation of relations are fixed but the initial relation representation may not be optimal. We claim they fail to utilize information from head-tail entities and the semantic connection between relations to enhance the current relation representation, which undermines the ability to capture information of relations in KGs. To address this issue, we construct a \textbf{dual relation graph} where each node denotes a relation in the original KG (\textbf{primal entity graph}) and edges are constructed between relations sharing same head or tail entities. Then we iteratively do primal entity graph reasoning, dual relation graph information propagation, and interaction between these two graphs. In this way, the interaction between entity and relation is enhanced, and we derive better entity and relation representations. Experiments on two public datasets, WebQSP and CWQ, show that our approach achieves a significant performance gain over the prior state-of-the-art. Our code is available on \url{https://github.com/yanmenxue/RAH-KBQA}. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: Findings of EMNLP2023 (Long)

arXiv:2312.10825 [pdf, other]

Latent Space Editing in Transformer-Based Flow Matching

Authors: Vincent Tao Hu, David W Zhang, Pascal Mettes, Meng Tang, Deli Zhao, Cees G. M. Snoek

Abstract: This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone of… ▽ More This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/ △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: AAAI 2024 with Appendix

arXiv:2312.10515 [pdf, other]

doi 10.1109/TGRS.2023.3343453

PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection

Authors: Wentao Li, Danpei Zhao, Bo Yuan, Yue Gao, Zhenwei Shi

Abstract: Fine-grained object detection (FGOD) extends object detection with the capability of fine-grained recognition. In recent two-stage FGOD methods, the region proposal serves as a crucial link between detection and fine-grained recognition. However, current methods overlook that some proposal-related procedures inherited from general detection are not equally suitable for FGOD, limiting the multi-tas… ▽ More Fine-grained object detection (FGOD) extends object detection with the capability of fine-grained recognition. In recent two-stage FGOD methods, the region proposal serves as a crucial link between detection and fine-grained recognition. However, current methods overlook that some proposal-related procedures inherited from general detection are not equally suitable for FGOD, limiting the multi-task learning from generation, representation, to utilization. In this paper, we present PETDet (Proposal Enhancement for Two-stage fine-grained object detection) to better handle the sub-tasks in two-stage FGOD methods. Firstly, an anchor-free Quality Oriented Proposal Network (QOPN) is proposed with dynamic label assignment and attention-based decomposition to generate high-quality oriented proposals. Additionally, we present a Bilinear Channel Fusion Network (BCFN) to extract independent and discriminative features of the proposals. Furthermore, we design a novel Adaptive Recognition Loss (ARL) which offers guidance for the R-CNN head to focus on high-quality proposals. Extensive experiments validate the effectiveness of PETDet. Quantitative analysis reveals that PETDet with ResNet50 reaches state-of-the-art performance on various FGOD datasets, including FAIR1M-v1.0 (42.96 AP), FAIR1M-v2.0 (48.81 AP), MAR20 (85.91 AP) and ShipRSImageNet (74.90 AP). The proposed method also achieves superior compatibility between accuracy and inference speed. Our code and models will be released at https://github.com/canoe-Z/PETDet. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: IEEE TGRS 2023

arXiv:2312.09785 [pdf, other]

RJUA-QA: A Comprehensive QA Dataset for Urology

Authors: Shiwei Lyu, Chenfei Chi, Hongbo Cai, Lei Shi, Xiaoyan Yang, Lei Liu, Xiang Chen, Deng Zhao, Zhiqiang Zhang, Xianguo Lyu, Ming Zhang, Fangzhou Li, Xiaowei Ma, Yue Shen, **jie Gu, Wei Xue, Yiran Huang

Abstract: We introduce RJUA-QA, a novel medical dataset for question answering (QA) and reasoning with clinical evidence, contributing to bridge the gap between general large language models (LLMs) and medical-specific LLM applications. RJUA-QA is derived from realistic clinical scenarios and aims to facilitate LLMs in generating reliable diagnostic and advice. The dataset contains 2,132 curated Question-Co… ▽ More We introduce RJUA-QA, a novel medical dataset for question answering (QA) and reasoning with clinical evidence, contributing to bridge the gap between general large language models (LLMs) and medical-specific LLM applications. RJUA-QA is derived from realistic clinical scenarios and aims to facilitate LLMs in generating reliable diagnostic and advice. The dataset contains 2,132 curated Question-Context-Answer pairs, corresponding about 25,000 diagnostic records and clinical cases. The dataset covers 67 common urological disease categories, where the disease coverage exceeds 97.6\% of the population seeking medical services in urology. Each data instance in RJUA-QA comprises: (1) a question mirroring real patient to inquiry about clinical symptoms and medical conditions, (2) a context including comprehensive expert knowledge, serving as a reference for medical examination and diagnosis, (3) a doctor response offering the diagnostic conclusion and suggested examination guidance, (4) a diagnosed clinical disease as the recommended diagnostic outcome, and (5) clinical advice providing recommendations for medical examination. RJUA-QA is the first medical QA dataset for clinical reasoning over the patient inquiries, where expert-level knowledge and experience are required for yielding diagnostic conclusions and medical examination advice. A comprehensive evaluation is conducted to evaluate the performance of both medical-specific and general LLMs on the RJUA-QA dataset. Our data is are publicly available at \url{https://github.com/alipay/RJU_Ant_QA}. △ Less

Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: An initial version

arXiv:2312.07625 [pdf, other]

Astrocyte-Enabled Advancements in Spiking Neural Networks for Large Language Modeling

Authors: Guobin Shen, Dongcheng Zhao, Yiting Dong, Yang Li, **dong Li, Kang Sun, Yi Zeng

Abstract: Within the complex neuroarchitecture of the brain, astrocytes play crucial roles in development, structure, and metabolism. These cells regulate neural activity through tripartite synapses, directly impacting cognitive processes such as learning and memory. Despite the growing recognition of astrocytes' significance, traditional Spiking Neural Network (SNN) models remain predominantly neuron-centr… ▽ More Within the complex neuroarchitecture of the brain, astrocytes play crucial roles in development, structure, and metabolism. These cells regulate neural activity through tripartite synapses, directly impacting cognitive processes such as learning and memory. Despite the growing recognition of astrocytes' significance, traditional Spiking Neural Network (SNN) models remain predominantly neuron-centric, overlooking the profound influence of astrocytes on neural dynamics. Inspired by these biological insights, we have developed an Astrocyte-Modulated Spiking Unit (AM-SU), an innovative framework that integrates neuron-astrocyte interactions into the computational paradigm, demonstrating wide applicability across various hardware platforms. Our Astrocyte-Modulated Spiking Neural Network (AstroSNN) exhibits exceptional performance in tasks involving memory retention and natural language generation, particularly in handling long-term dependencies and complex linguistic structures. The design of AstroSNN not only enhances its biological authenticity but also introduces novel computational dynamics, enabling more effective processing of complex temporal dependencies. Furthermore, AstroSNN shows low latency, high throughput, and reduced memory usage in practical applications, making it highly suitable for resource-constrained environments. By successfully integrating astrocytic dynamics into intelligent neural networks, our work narrows the gap between biological plausibility and neural modeling, laying the groundwork for future biologically-inspired neural computing research that includes both neurons and astrocytes. △ Less

Submitted 25 December, 2023; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.06964 [pdf, other]

Ground Calibration Result of the Lobster Eye Imager for Astronomy

Authors: Huaqing Cheng, Zhixing Ling, Chen Zhang, Xiao** Sun, Shengli Sun, Yuan Liu, Yanfeng Dai, Zhenqing Jia, Haiwu Pan, Wenxin Wang, Donghua Zhao, Yifan Chen, Zhiwei Cheng, Wei Fu, Yixiao Han, Junfei Li, Zhengda Li, Xiaohao Ma, Yulong Xue, Ailiang Yan, Qiang Zhang, Yusa Wang, Xiongtao Yang, Zijian Zhao, Weimin Yuan

Abstract: We report on results of the on-ground X-ray calibration of the Lobster Eye Imager for Astronomy (LEIA), an experimental space wide-field (18.6*18.6 square degrees) X-ray telescope built from novel lobster eye mirco-pore optics. LEIA was successfully launched on July 27, 2022 onboard the SATech-01 satellite. To achieve full characterisation of its performance before launch, a series of tests and ca… ▽ More We report on results of the on-ground X-ray calibration of the Lobster Eye Imager for Astronomy (LEIA), an experimental space wide-field (18.6*18.6 square degrees) X-ray telescope built from novel lobster eye mirco-pore optics. LEIA was successfully launched on July 27, 2022 onboard the SATech-01 satellite. To achieve full characterisation of its performance before launch, a series of tests and calibrations have been carried out at different levels of devices, assemblies and the complete module. In this paper, we present the results of the end-to-end calibration campaign of the complete module carried out at the 100-m X-ray Test Facility at IHEP. The PSF, effective area and energy response of the detectors were measured in a wide range of incident directions at several X-ray line energies. The distributions of the PSF and effective areas are roughly uniform across the FoV, in large agreement with the prediction of lobster-eye optics. The mild variations and deviations from the prediction of idealized lobster-eye optics can be understood to be caused by the imperfect shapes and alignment of the micro-pores as well as the obscuration by the supporting frames, which can be well reproduced by MC simulations. The spatial resolution of LEIA defined by the FWHM of the focal spot ranges from 4-8 arcmin with a median of 5.7. The measured effective areas are in range of 2-3 $cm^2$ at ~1.25 keV across the entire FoV, and its dependence on photon energy is in large agreement with simulations. The gains of the CMOS sensors are in range of 6.5-6.9 eV/DN, and the energy resolutions in the range of ~120-140 eV at 1.25 keV and ~170-190 eV at 4.5 keV. These results have been ingested into the calibration database and applied to the analysis of the scientific data acquired by LEIA. This work paves the way for the calibration of the Wide-field X-Ray Telescope modules of the Einstein Probe mission. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 24 pages, 13 figures. Submitted to Experimental Astronomy

arXiv:2312.06331 [pdf, other]

Semantic Connectivity-Driven Pseudo-labeling for Cross-domain Segmentation

Authors: Dong Zhao, Ruizhi Yang, Shuang Wang, Qi Zang, Yang Hu, Licheng Jiao, Nicu Sebe, Zhun Zhong

Abstract: Presently, self-training stands as a prevailing approach in cross-domain semantic segmentation, enhancing model efficacy by training with pixels assigned with reliable pseudo-labels. However, we find two critical limitations in this paradigm. (1) The majority of reliable pixels exhibit a speckle-shaped pattern and are primarily located in the central semantic region. This presents challenges for t… ▽ More Presently, self-training stands as a prevailing approach in cross-domain semantic segmentation, enhancing model efficacy by training with pixels assigned with reliable pseudo-labels. However, we find two critical limitations in this paradigm. (1) The majority of reliable pixels exhibit a speckle-shaped pattern and are primarily located in the central semantic region. This presents challenges for the model in accurately learning semantics. (2) Category noise in speckle pixels is difficult to locate and correct, leading to error accumulation in self-training. To address these limitations, we propose a novel approach called Semantic Connectivity-driven pseudo-labeling (SeCo). This approach formulates pseudo-labels at the connectivity level and thus can facilitate learning structured and low-noise semantics. Specifically, SeCo comprises two key components: Pixel Semantic Aggregation (PSA) and Semantic Connectivity Correction (SCC). Initially, PSA divides semantics into 'stuff' and 'things' categories and aggregates speckled pseudo-labels into semantic connectivity through efficient interaction with the Segment Anything Model (SAM). This enables us not only to obtain accurate boundaries but also simplifies noise localization. Subsequently, SCC introduces a simple connectivity classification task, which enables locating and correcting connectivity noise with the guidance of loss distribution. Extensive experiments demonstrate that SeCo can be flexibly applied to various cross-domain semantic segmentation tasks, including traditional unsupervised, source-free, and black-box domain adaptation, significantly improving the performance of existing state-of-the-art methods. The code is available at https://github.com/DZhaoXd/SeCo. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.06185 [pdf, other]

KnowGPT: Knowledge Graph based Prompting for Large Language Models

Authors: Qinggang Zhang, Junnan Dong, Hao Chen, Daochen Zha, Zailiang Yu, Xiao Huang

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in many real-world applications. Nonetheless, LLMs are often criticized for their tendency to produce hallucinations, wherein the models fabricate incorrect statements on tasks beyond their knowledge and perception. To alleviate this issue, researchers have explored leveraging the factual knowledge in knowledge graphs (KGs) to… ▽ More Large Language Models (LLMs) have demonstrated remarkable capabilities in many real-world applications. Nonetheless, LLMs are often criticized for their tendency to produce hallucinations, wherein the models fabricate incorrect statements on tasks beyond their knowledge and perception. To alleviate this issue, researchers have explored leveraging the factual knowledge in knowledge graphs (KGs) to ground the LLM's responses in established facts and principles. However, most state-of-the-art LLMs are closed-source, making it challenging to develop a prompting framework that can efficiently and effectively integrate KGs into LLMs with hard prompts only. Generally, existing KG-enhanced LLMs usually suffer from three critical issues, including huge search space, high API costs, and laborious prompt engineering, that impede their widespread application in practice. To this end, we introduce a novel Knowledge Graph based PrompTing framework, namely KnowGPT, to enhance LLMs with domain knowledge. KnowGPT contains a knowledge extraction module to extract the most informative knowledge from KGs, and a context-aware prompt construction module to automatically convert extracted knowledge into effective prompts. Experiments on three benchmarks demonstrate that KnowGPT significantly outperforms all competitors. Notably, KnowGPT achieves a 92.6% accuracy on OpenbookQA leaderboard, comparable to human-level performance. △ Less

Submitted 4 June, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.03868 [pdf, other]

Uncertainty-Informed Renewable Energy Scheduling: A Scalable Bilevel Framework

Authors: Dongwei Zhao, Vladimir Dvorkin, Stefanos Delikaraoglou, Alberto J. Lamadrid L., Audun Botterud

Abstract: This work proposes an uncertainty-informed bid adjustment framework for integrating variable renewable energy sources (VRES) into electricity markets. This framework adopts a bilevel model to compute the optimal VRES day-ahead bids. It aims to minimize the expected system cost across day-ahead and real-time stages and approximate the cost efficiency of the stochastic market design. However, solvin… ▽ More This work proposes an uncertainty-informed bid adjustment framework for integrating variable renewable energy sources (VRES) into electricity markets. This framework adopts a bilevel model to compute the optimal VRES day-ahead bids. It aims to minimize the expected system cost across day-ahead and real-time stages and approximate the cost efficiency of the stochastic market design. However, solving the bilevel optimization problem is computationally challenging for large-scale systems. To overcome this challenge, we introduce a novel technique based on strong duality and McCormick envelopes, which relaxes the problem to a linear program, enabling large-scale applications. The proposed bilevel framework is applied to the 1576-bus NYISO system and benchmarked against a myopic strategy, where the VRES bid is the mean value of the probabilistic power forecast. Results demonstrate that, under high VRES penetration levels (e.g., 40%), our framework can significantly reduce system costs and market-price volatility, by optimizing VRES quantities efficiently in the day-ahead market. Furthermore, we find that when transmission capacity increases, the proposed bilevel model will still reduce the system cost, whereas the myopic strategy may incur a much higher cost due to over-scheduling of VRES in the day-ahead market and the lack of flexible conventional generators in real time. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: IEEE Transactions on Energy Markets, Policy, and Regulation

arXiv:2312.00956 [pdf, other]

A Cyclic Small Phase Theorem

Authors: Chao Chen, Wei Chen, Di Zhao, Jianqi Chen, Li Qiu

Abstract: This paper introduces a brand-new phase definition called the segmental phase for multi-input multi-output linear time-invariant systems. The underpinning of the definition lies in the matrix segmental phase which, as its name implies, is graphically based on the smallest circular segment covering the matrix normalized numerical range in the unit disk. The matrix segmental phase has the crucial pr… ▽ More This paper introduces a brand-new phase definition called the segmental phase for multi-input multi-output linear time-invariant systems. The underpinning of the definition lies in the matrix segmental phase which, as its name implies, is graphically based on the smallest circular segment covering the matrix normalized numerical range in the unit disk. The matrix segmental phase has the crucial product eigen-phase bound, which makes itself stand out from several existing phase notions in the literature. The proposed bound paves the way for stability analysis of a cyclic feedback system consisting of multiple subsystems. A cyclic small phase theorem is then established as our main result, which requires the loop system phase to lie between $-π$ and $π$. The proposed theorem complements a cyclic version of the celebrated small gain theorem. In addition, a generalization of the proposed theorem is made via the use of angular scaling techniques for reducing conservatism. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.18829 [pdf, other]

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Authors: Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, **gxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo

Abstract: We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two signific… ▽ More We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples. △ Less

Submitted 29 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Project page: https://wangyanhui666.github.io/MicroCinema.github.io/

arXiv:2311.18166 [pdf, other]

A-Scan2BIM: Assistive Scan to Building Information Modeling

Authors: Weilian Song, Jieliang Luo, Dale Zhao, Yan Fu, Chin-Yi Cheng, Yasutaka Furukawa

Abstract: This paper proposes an assistive system for architects that converts a large-scale point cloud into a standardized digital representation of a building for Building Information Modeling (BIM) applications. The process is known as Scan-to-BIM, which requires many hours of manual work even for a single building floor by a professional architect. Given its challenging nature, the paper focuses on hel… ▽ More This paper proposes an assistive system for architects that converts a large-scale point cloud into a standardized digital representation of a building for Building Information Modeling (BIM) applications. The process is known as Scan-to-BIM, which requires many hours of manual work even for a single building floor by a professional architect. Given its challenging nature, the paper focuses on hel** architects on the Scan-to-BIM process, instead of replacing them. Concretely, we propose an assistive Scan-to-BIM system that takes the raw sensor data and edit history (including the current BIM model), then auto-regressively predicts a sequence of model editing operations as APIs of a professional BIM software (i.e., Autodesk Revit). The paper also presents the first building-scale Scan2BIM dataset that contains a sequence of model editing operations as the APIs of Autodesk Revit. The dataset contains 89 hours of Scan2BIM modeling processes by professional architects over 16 scenes, spanning over 35,000 m^2. We report our system's reconstruction quality with standard metrics, and we introduce a novel metric that measures how natural the order of reconstructed operations is. A simple modification to the reconstruction module helps improve performance, and our method is far superior to two other baselines in the order metric. We will release data, code, and models at a-scan2bim.github.io. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: BMVC 2023, order evaluation updated after fixing evaluation bug

arXiv:2311.15649 [pdf, other]

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks

Authors: Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, He Wang

Abstract: Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to use LLMs in complex robot planning. Despite LLMs' great generalization and comprehension of instruction tasks, LLMs-generated task plans sometimes lack feasibili… ▽ More Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to use LLMs in complex robot planning. Despite LLMs' great generalization and comprehension of instruction tasks, LLMs-generated task plans sometimes lack feasibility and correctness. To address the problem, we propose a RoboGPT agent\footnote{our code and dataset will be released soon} for making embodied long-term decisions for daily tasks, with two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals; 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT. The new robotic dataset of 67k daily instruction tasks is gathered for fine-tuning the Llama model and obtaining RoboGPT. RoboGPT planner with strong generalization can plan hundreds of daily instruction tasks. Additionally, a low-computational Re-Plan module is designed to allow plans to flexibly adapt to the environment, thereby addressing the nomenclature diversity challenge. The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks. Moreover, RoboGPT planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks, and even other domain tasks, while kee** the large model's original broad application and generality. △ Less

Submitted 30 June, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.15542 [pdf]

Arbitrary Engineering of Spatial Caustics with 3D-printed Metasurfaces

Authors: Xiaoyan Zhou, Hongtao Wang, Shuxi Liu, Hao Wang, John You En Chan, Cheng-Feng Pan, Daomu Zhao, Joel K. W. Yang, Cheng-Wei Qiu

Abstract: Caustics occur in diverse physical systems, spanning the nano-scale in electron microscopy to astronomical-scale in gravitational lensing. As envelopes of rays, optical caustics result in sharp edges or extended networks. Caustics in structured light, characterized by complex-amplitude distributions, have innovated numerous applications including particle manipulation, high-resolution imaging tech… ▽ More Caustics occur in diverse physical systems, spanning the nano-scale in electron microscopy to astronomical-scale in gravitational lensing. As envelopes of rays, optical caustics result in sharp edges or extended networks. Caustics in structured light, characterized by complex-amplitude distributions, have innovated numerous applications including particle manipulation, high-resolution imaging techniques, and optical communication. However, these applications have encountered limitations due to a major challenge in engineering caustic fields with customizable propagation trajectories and in-plane intensity profiles. Here, we introduce the compensation phase via 3D-printed metasurfaces to shape caustic fields with curved trajectories in free space. The in-plane caustic patterns can be preserved or morphed from one structure to another during propagation. Large-scale fabrication of these metasurfaces is enabled by the fast-prototy** and cost-effective two-photon polymerization lithography. Our optical elements with the ultra-thin profile and sub-millimeter extension offer a compact solution to generating caustic structured light for beam sha**, high-resolution microscopy, and light-matter-interaction studies. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.14992 [pdf, ps, other]

Model-free Reinforcement Learning for ${H_{2}/H_{\infty}}$ Control of Stochastic Discrete-time Systems

Authors: Xiushan Jiang, Li Wang, Dongya Zhao, Ling Shi

Abstract: This paper proposes a reinforcement learning (RL) algorithm for infinite horizon $\rm {H_{2}/H_{\infty}}$ problem in a class of stochastic discrete-time systems, rather than using a set of coupled generalized algebraic Riccati equations (GAREs). The algorithm is able to learn the optimal control policy for the system even when its parameters are unknown. Additionally, the paper explores the effect… ▽ More This paper proposes a reinforcement learning (RL) algorithm for infinite horizon $\rm {H_{2}/H_{\infty}}$ problem in a class of stochastic discrete-time systems, rather than using a set of coupled generalized algebraic Riccati equations (GAREs). The algorithm is able to learn the optimal control policy for the system even when its parameters are unknown. Additionally, the paper explores the effect of detection noise as well as the convergence of the algorithm, and shows that the control policy is admissible after a finite number of iterations. The algorithm is also able to handle multi-objective control problems within stochastic fields. Finally, the algorithm is applied to the F-16 aircraft autopilot with multiplicative noise. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2311.12626 [pdf]

Acoustic Vortex in Waveguide with Chiral Gradient Sawtooth Metasurface

Authors: Zeliang Song, Shuhuan Xie, Yong Li, Hua Ding, Feiyan Cai, Yugui Peng, Xuefeng Zhu, Degang Zhao

Abstract: The acoustic vortex states with spiral phase dislocation that can carry orbital angular moment (OAM) have aroused many research interests in recent years. The mainstream methods of generating acoustic vortex are based on Huygens-Fresnel principle to modulate the wavefront to create spatial spiral phase dislocation. In this work, we propose an entirely new scenario to generate acoustic vortex in a… ▽ More The acoustic vortex states with spiral phase dislocation that can carry orbital angular moment (OAM) have aroused many research interests in recent years. The mainstream methods of generating acoustic vortex are based on Huygens-Fresnel principle to modulate the wavefront to create spatial spiral phase dislocation. In this work, we propose an entirely new scenario to generate acoustic vortex in a waveguide with chiral gradient sawtooth metasurface. The physical mechanism of our method is to lift the degenerate dipole eigenmodes through the scattering effect of the chiral surface structure, and then the superposition of them will generate both and order vortices in place. Compared to the existing methods of acoustic vortex production, our design has many merits, such as easy to manufacture and control, the working frequency is broadband, sign of vortex order can be readily flipped. Both the full-wave simulations and experimental measurements validate the existence of the acoustic vortices. The torque effect of the acoustic vortices is also successfully performed by rotating a foam disk as a practical application. Our work opens up a new route for generating acoustic vortex and could have potential significances in microfluidics, acoustic tweezers and ultrasonic communication, etc. △ Less

Submitted 14 January, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.12292 [pdf, other]

Map** "Brain Coral" Regions on Mars using Deep Learning

Authors: Kyle A. Pearson, Eldar Noe, Daniel Zhao, Alphan Altinok, Alex Morgan

Abstract: One of the main objectives of the Mars Exploration Program is to search for evidence of past or current life on the planet. To achieve this, Mars exploration has been focusing on regions that may have liquid or frozen water. A set of critical areas may have seen cycles of ice thawing in the relatively recent past in response to periodic changes in the obliquity of Mars. In this work, we use convol… ▽ More One of the main objectives of the Mars Exploration Program is to search for evidence of past or current life on the planet. To achieve this, Mars exploration has been focusing on regions that may have liquid or frozen water. A set of critical areas may have seen cycles of ice thawing in the relatively recent past in response to periodic changes in the obliquity of Mars. In this work, we use convolutional neural networks to detect surface regions containing "Brain Coral" terrain, a landform on Mars whose similarity in morphology and scale to sorted stone circles on Earth suggests that it may have formed as a consequence of freeze/thaw cycles. We use large images (~100-1000 megapixels) from the Mars Reconnaissance Orbiter to search for these landforms at resolutions close to a few tens of centimeters per pixel (~25--50 cm). Over 52,000 images (~28 TB) were searched (~5% of the Martian surface) where we found detections in over 200 images. To expedite the processing we leverage a classifier network (prior to segmentation) in the Fourier domain that can take advantage of JPEG compression by leveraging blocks of coefficients from a discrete cosine transform in lieu of decoding the entire image at the full spatial resolution. The hybrid pipeline approach maintains ~93% accuracy while cutting down on ~95% of the total processing time compared to running the segmentation network at the full resolution on every image. The timely processing of big data sets helps inform mission operations, geologic surveys to prioritize candidate landing sites, avoid hazardous areas, or map the spatial extent of certain terrain. The segmentation masks and source code are available on Github for the community to explore and build upon. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: Submitted for publication, seeking comments from the community. Code available: https://github.com/pearsonkyle/Mars-Brain-Coral-Network

arXiv:2311.10802 [pdf, other]

Is Conventional SNN Really Efficient? A Perspective from Network Quantization

Authors: Guobin Shen, Dongcheng Zhao, Tenglong Li, **dong Li, Yi Zeng

Abstract: Spiking Neural Networks (SNNs) have been widely praised for their high energy efficiency and immense potential. However, comprehensive research that critically contrasts and correlates SNNs with quantized Artificial Neural Networks (ANNs) remains scant, often leading to skewed comparisons lacking fairness towards ANNs. This paper introduces a unified perspective, illustrating that the time steps i… ▽ More Spiking Neural Networks (SNNs) have been widely praised for their high energy efficiency and immense potential. However, comprehensive research that critically contrasts and correlates SNNs with quantized Artificial Neural Networks (ANNs) remains scant, often leading to skewed comparisons lacking fairness towards ANNs. This paper introduces a unified perspective, illustrating that the time steps in SNNs and quantized bit-widths of activation values present analogous representations. Building on this, we present a more pragmatic and rational approach to estimating the energy consumption of SNNs. Diverging from the conventional Synaptic Operations (SynOps), we champion the "Bit Budget" concept. This notion permits an intricate discourse on strategically allocating computational and storage resources between weights, activation values, and temporal steps under stringent hardware constraints. Guided by the Bit Budget paradigm, we discern that pivoting efforts towards spike patterns and weight quantization, rather than temporal attributes, elicits profound implications for model performance. Utilizing the Bit Budget for holistic design consideration of SNNs elevates model performance across diverse data types, encompassing static imagery and neuromorphic datasets. Our revelations bridge the theoretical chasm between SNNs and quantized ANNs and illuminate a pragmatic trajectory for future endeavors in energy-efficient neural computations. △ Less

Submitted 17 November, 2023; originally announced November 2023.

arXiv:2311.10747 [pdf, other]

doi 10.1109/LRA.2024.3379805

Safety-aware Causal Representation for Trustworthy Offline Reinforcement Learning in Autonomous Driving

Authors: Haohong Lin, Wenhao Ding, Zuxin Liu, Yaru Niu, Jiacheng Zhu, Yuming Niu, Ding Zhao

Abstract: In the domain of autonomous driving, the offline Reinforcement Learning~(RL) approaches exhibit notable efficacy in addressing sequential decision-making problems from offline datasets. However, maintaining safety in diverse safety-critical scenarios remains a significant challenge due to long-tailed and unforeseen scenarios absent from offline datasets. In this paper, we introduce the saFety-awar… ▽ More In the domain of autonomous driving, the offline Reinforcement Learning~(RL) approaches exhibit notable efficacy in addressing sequential decision-making problems from offline datasets. However, maintaining safety in diverse safety-critical scenarios remains a significant challenge due to long-tailed and unforeseen scenarios absent from offline datasets. In this paper, we introduce the saFety-aware strUctured Scenario representatION (FUSION), a pioneering representation learning method in offline RL to facilitate the learning of a generalizable end-to-end driving policy by leveraging structured scenario information. FUSION capitalizes on the causal relationships between the decomposed reward, cost, state, and action space, constructing a framework for structured sequential reasoning in dynamic traffic environments. We conduct extensive evaluations in two typical real-world settings of the distribution shift in autonomous vehicles, demonstrating the good balance between safety cost and utility reward compared to the current state-of-the-art safe RL and IL baselines. Empirical evidence in various driving scenarios attests that FUSION significantly enhances the safety and generalizability of autonomous driving agents, even in the face of challenging and unseen environments. Furthermore, our ablation studies reveal noticeable improvements in the integration of causal representation into the offline safe RL algorithm. Our code implementation is available at: https://sites.google.com/view/safe-fusion/. △ Less

Submitted 12 March, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

arXiv:2311.08911 [pdf, other]

Connection Incentives in Cost Sharing Mechanisms with Budgets

Authors: Tianyi Zhang, Dengji Zhao, Junyu Zhang, Sizhe Gu

Abstract: In a cost sharing problem on a weighted undirected graph, all other nodes want to connect to the source node for some service. Each edge has a cost denoted by a weight and all the connected nodes should share the total cost for the connectivity. The goal of the existing solutions (e.g. folk solution and cycle-complete solution) is to design cost sharing rules with nice properties, e.g. budget bala… ▽ More In a cost sharing problem on a weighted undirected graph, all other nodes want to connect to the source node for some service. Each edge has a cost denoted by a weight and all the connected nodes should share the total cost for the connectivity. The goal of the existing solutions (e.g. folk solution and cycle-complete solution) is to design cost sharing rules with nice properties, e.g. budget balance and cost monotonicity. However, they did not consider the cases that each non-source node has a budget which is the maximum it can pay for its cost share and may cut its adjacent edges to reduce its cost share. In this paper, we design two cost sharing mechanisms taking into account the nodes' budgets and incentivizing all nodes to report all their adjacent edges so that we can minimize the total cost for the connectivity. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2201.05976

arXiv:2311.08903 [pdf, other]

Cost Sharing under Private Costs and Connection Control on Directed Acyclic Graphs

Authors: Tianyi Zhang, Dengji Zhao, Junyu Zhang, Sizhe Gu

Abstract: We consider a cost sharing problem on a weighted directed acyclic graph (DAG) with a source node to which all the other nodes want to connect. The cost (weight) of each edge is private information reported by multiple contractors, and among them, only one contractor is selected as the builder. All the nodes except for the source need to share the total cost of the used edges. However, they may blo… ▽ More We consider a cost sharing problem on a weighted directed acyclic graph (DAG) with a source node to which all the other nodes want to connect. The cost (weight) of each edge is private information reported by multiple contractors, and among them, only one contractor is selected as the builder. All the nodes except for the source need to share the total cost of the used edges. However, they may block others' connections to the source by strategically cutting their outgoing edges to reduce their cost share, which may increase the total cost of connectivity. To minimize the total cost of connectivity, we design a cost sharing mechanism to incentivize each node to offer all its outgoing edges and each contractor to report all the edges' weights truthfully, and show the properties of the proposed mechanism. In addition, our mechanism outperforms the two benchmark mechanisms. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.07491 [pdf, other]

A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question Decomposition with Large Language Models

Authors: He**g Cao, Zhenwei An, Jiazhan Feng, Kun Xu, Liwei Chen, Dongyan Zhao

Abstract: While large language models exhibit remarkable performance in the Question Answering task, they are susceptible to hallucinations. Challenges arise when these models grapple with understanding multi-hop relations in complex questions or lack the necessary knowledge for a comprehensive response. To address this issue, we introduce the "Decompose-and-Query" framework (D&Q). This framework guides the… ▽ More While large language models exhibit remarkable performance in the Question Answering task, they are susceptible to hallucinations. Challenges arise when these models grapple with understanding multi-hop relations in complex questions or lack the necessary knowledge for a comprehensive response. To address this issue, we introduce the "Decompose-and-Query" framework (D&Q). This framework guides the model to think and utilize external knowledge similar to ReAct, while also restricting its thinking to reliable information, effectively mitigating the risk of hallucinations. Experiments confirm the effectiveness of D&Q: On our ChitChatQA dataset, D&Q does not lose to ChatGPT in 67% of cases; on the HotPotQA question-only setting, D&Q achieved an F1 score of 59.6%. Our code is available at https://github.com/alkaidpku/DQ-ToolQA. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.06158 [pdf, other]

Language Models can be Logical Solvers

Authors: Jiazhan Feng, Ruochen Xu, Junheng Hao, Hiteshi Sharma, Yelong Shen, Dongyan Zhao, Weizhu Chen

Abstract: Logical reasoning is a fundamental aspect of human intelligence and a key component of tasks like problem-solving and decision-making. Recent advancements have enabled Large Language Models (LLMs) to potentially exhibit reasoning capabilities, but complex logical reasoning remains a challenge. The state-of-the-art, solver-augmented language models, use LLMs to parse natural language logical questi… ▽ More Logical reasoning is a fundamental aspect of human intelligence and a key component of tasks like problem-solving and decision-making. Recent advancements have enabled Large Language Models (LLMs) to potentially exhibit reasoning capabilities, but complex logical reasoning remains a challenge. The state-of-the-art, solver-augmented language models, use LLMs to parse natural language logical questions into symbolic representations first and then adopt external logical solvers to take in the symbolic representations and output the answers. Despite their impressive performance, any parsing errors will inevitably result in the failure of the execution of the external logical solver and no answer to the logical questions. In this paper, we introduce LoGiPT, a novel language model that directly emulates the reasoning processes of logical solvers and bypasses the parsing errors by learning to strict adherence to solver syntax and grammar. LoGiPT is fine-tuned on a newly constructed instruction-tuning dataset derived from revealing and refining the invisible reasoning process of deductive solvers. Experimental results on two public deductive reasoning datasets demonstrate that LoGiPT outperforms state-of-the-art solver-augmented LMs and few-shot prompting methods on competitive LLMs like ChatGPT or GPT-4. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: Preprint

arXiv:2311.04145 [pdf, other]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Authors: Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, **gren Zhou

Abstract: Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure s… ▽ More Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: Project page: https://i2vgen-xl.github.io

arXiv:2311.01767 [pdf, other]

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Authors: Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Nan Duan

Abstract: Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, w… ▽ More Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}. △ Less

Submitted 7 November, 2023; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: LLM evaluation, PPT task completion

arXiv:2311.00426 [pdf, other]

Enhanced Generalization through Prioritization and Diversity in Self-Imitation Reinforcement Learning over Procedural Environments with Sparse Rewards

Authors: Alain Andres, Daochen Zha, Javier Del Ser

Abstract: Exploration poses a fundamental challenge in Reinforcement Learning (RL) with sparse rewards, limiting an agent's ability to learn optimal decision-making due to a lack of informative feedback signals. Self-Imitation Learning (self-IL) has emerged as a promising approach for exploration, leveraging a replay buffer to store and reproduce successful behaviors. However, traditional self-IL methods, w… ▽ More Exploration poses a fundamental challenge in Reinforcement Learning (RL) with sparse rewards, limiting an agent's ability to learn optimal decision-making due to a lack of informative feedback signals. Self-Imitation Learning (self-IL) has emerged as a promising approach for exploration, leveraging a replay buffer to store and reproduce successful behaviors. However, traditional self-IL methods, which rely on high-return transitions and assume singleton environments, face challenges in generalization, especially in procedurally-generated (PCG) environments. Therefore, new self-IL methods have been proposed to rank which experiences to persist, but they replay transitions uniformly regardless of their significance, and do not address the diversity of the stored demonstrations. In this work, we propose tailored self-IL sampling strategies by prioritizing transitions in different ways and extending prioritization techniques to PCG environments. We also address diversity loss through modifications to counteract the impact of generalization requirements and bias introduced by prioritization techniques. Our experimental analysis, conducted over three PCG sparse reward environments, including MiniGrid and ProcGen, highlights the benefits of our proposed modifications, achieving a new state-of-the-art performance in the MiniGrid-MultiRoom-N12-S10 environment. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 7 pages, 5 figures

arXiv:2311.00134 [pdf, other]

Joint Depth Prediction and Semantic Segmentation with Multi-View SAM

Authors: Mykhailo Shvets, Dongxu Zhao, Marc Niethammer, Roni Sengupta, Alexander C. Berg

Abstract: Multi-task approaches to joint depth and segmentation prediction are well-studied for monocular images. Yet, predictions from a single-view are inherently limited, while multiple views are available in many robotics applications. On the other end of the spectrum, video-based and full 3D methods require numerous frames to perform reconstruction and segmentation. With this work we propose a Multi-Vi… ▽ More Multi-task approaches to joint depth and segmentation prediction are well-studied for monocular images. Yet, predictions from a single-view are inherently limited, while multiple views are available in many robotics applications. On the other end of the spectrum, video-based and full 3D methods require numerous frames to perform reconstruction and segmentation. With this work we propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM). This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder. We report the mutual benefit that both tasks enjoy in our quantitative and qualitative studies on the ScanNet dataset. Our approach consistently outperforms single-task MVS and segmentation models, along with multi-task monocular methods. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: To appear in the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision

arXiv:2310.20669 [pdf, other]

Modeling multi-legged robot locomotion with slip** and its experimental validation

Authors: Ziyou Wu, Dan Zhao, Shai Revzen

Abstract: Multi-legged robots with six or more legs are not in common use, despite designs with superior stability, maneuverability, and a low number of actuators being available for over 20 years. This may be in part due to the difficulty in modeling multi-legged motion with slip** and producing reliable predictions of body velocity. Here we present a detailed measurement of the foot contact forces in a… ▽ More Multi-legged robots with six or more legs are not in common use, despite designs with superior stability, maneuverability, and a low number of actuators being available for over 20 years. This may be in part due to the difficulty in modeling multi-legged motion with slip** and producing reliable predictions of body velocity. Here we present a detailed measurement of the foot contact forces in a hexapedal robot with multiple sliding contacts, and provide an algorithm for predicting these contact forces and the body velocity. The algorithm relies on the recently published observation that even while slip**, multi-legged robots are principally kinematic, and employ a friction law ansatz that allows us to compute the shape-change to body-velocity connection and the foot contact forces. This results in the ability to simulate motion plans for a large number of potentially slip** legs. In homogeneous environments, this can run in (parallel) logarithmic time of the planning horizon △ Less

Submitted 3 January, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.20198 [pdf, ps, other]

Structured Two-Stage True-Time-Delay Array Codebook Design for Multi-User Data Communication

Authors: Aditya Wadaskar, Ding Zhao, Ibrahim Pehlivan, Danijela Cabric

Abstract: Wideband millimeter-wave and terahertz (THz) systems can facilitate simultaneous data communication with multiple spatially separated users. It is desirable to orthogonalize users across sub-bands by deploying frequency-dependent beams with a sub-band-specific spatial response. True-Time-Delay (TTD) antenna arrays are a promising wideband architecture to implement sub-band-specific dispersion of b… ▽ More Wideband millimeter-wave and terahertz (THz) systems can facilitate simultaneous data communication with multiple spatially separated users. It is desirable to orthogonalize users across sub-bands by deploying frequency-dependent beams with a sub-band-specific spatial response. True-Time-Delay (TTD) antenna arrays are a promising wideband architecture to implement sub-band-specific dispersion of beams across space using a single radio frequency (RF) chain. This paper proposes a structured design of analog TTD codebooks to generate beams that exhibit quantized sub-band-to-angle map**. We introduce a structured Staircase TTD codebook and analyze the frequency-spatial behaviour of the resulting beam patterns. We develop the closed-form two-stage design of the proposed codebook to achieve the desired sub-band-specific beams and evaluate their performance in multi-user communication networks. △ Less

Submitted 15 November, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.19859 [pdf, other]

Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone

Authors: Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, **gren Zhou

Abstract: Parameter-efficient tuning has become a trend in transferring large-scale foundation models to downstream applications. Existing methods typically embed some light-weight tuners into the backbone, where both the design and the learning of the tuners are highly dependent on the base model. This work offers a new tuning paradigm, dubbed Res-Tuning, which intentionally unbinds tuners from the backbon… ▽ More Parameter-efficient tuning has become a trend in transferring large-scale foundation models to downstream applications. Existing methods typically embed some light-weight tuners into the backbone, where both the design and the learning of the tuners are highly dependent on the base model. This work offers a new tuning paradigm, dubbed Res-Tuning, which intentionally unbinds tuners from the backbone. With both theoretical and empirical evidence, we show that popular tuning approaches have their equivalent counterparts under our unbinding formulation, and hence can be integrated into our framework effortlessly. Thanks to the structural disentanglement, we manage to free the design of tuners from the network architecture, facilitating flexible combination of various tuning strategies. We further propose a memory-efficient variant of Res-Tuning, where the bypass i.e., formed by a sequence of tuners) is effectively detached from the main branch, such that the gradients are back-propagated only to the tuners but not to the backbone. Such a detachment also allows one-time backbone forward for multi-task inference. Extensive experiments on both discriminative and generative tasks demonstrate the superiority of our method over existing alternatives from the perspectives of efficacy and efficiency. Project page: $\href{https://res-tuning.github.io/}{\textit{https://res-tuning.github.io/}}$. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2310.19572 [pdf, other]

Improving Input-label Map** with Demonstration Replay for In-context Learning

Authors: Zhuocheng Gong, Jiahao Liu, Qifan Wang, **gang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan

Abstract: In-context learning (ICL) is an emerging capability of large autoregressive language models where a few input-label demonstrations are appended to the input to enhance the model's understanding of downstream NLP tasks, without directly adjusting the model parameters. The effectiveness of ICL can be attributed to the strong language modeling capabilities of large language models (LLMs), which enabl… ▽ More In-context learning (ICL) is an emerging capability of large autoregressive language models where a few input-label demonstrations are appended to the input to enhance the model's understanding of downstream NLP tasks, without directly adjusting the model parameters. The effectiveness of ICL can be attributed to the strong language modeling capabilities of large language models (LLMs), which enable them to learn the map** between input and labels based on in-context demonstrations. Despite achieving promising results, the causal nature of language modeling in ICL restricts the attention to be backward only, i.e., a token only attends to its previous tokens, failing to capture the full input-label information and limiting the model's performance. In this paper, we propose a novel ICL method called Repeated Demonstration with Sliding Causal Attention, (RdSca). Specifically, we duplicate later demonstrations and concatenate them to the front, allowing the model to `observe' the later information even under the causal restriction. Besides, we introduce sliding causal attention, which customizes causal attention to avoid information leakage. Experimental results show that our method significantly improves the input-label map** in ICL demonstrations. We also conduct an in-depth analysis of how to customize the causal attention without training, which has been an unexplored area in previous research. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.19070 [pdf, other]

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

Authors: Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo, Chen Xu, Guangming Shi, Wangmeng Zuo

Abstract: Existing industrial anomaly detection (IAD) methods predict anomaly scores for both anomaly detection and localization. However, they struggle to perform a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, shape, and categories of industrial anomalies. Recently, large multimodal (i.e., vision and language) models (LMMs) have shown eminent perception abilities on multipl… ▽ More Existing industrial anomaly detection (IAD) methods predict anomaly scores for both anomaly detection and localization. However, they struggle to perform a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, shape, and categories of industrial anomalies. Recently, large multimodal (i.e., vision and language) models (LMMs) have shown eminent perception abilities on multiple vision tasks such as image captioning, visual understanding, visual reasoning, etc., making it a competitive potential choice for more comprehensible anomaly detection. However, the knowledge about anomaly detection is absent in existing general LMMs, while training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources. In this paper, we propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad), which leads to definite anomaly detection and high-quality anomaly description. Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs). To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images. Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former to generate IAD domain vision-language tokens according to vision expert prior. Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods under the 1-class and few-shot settings, but also provide definite anomaly prediction along with detailed descriptions in IAD domain. △ Less

Submitted 31 October, 2023; v1 submitted 29 October, 2023; originally announced October 2023.

Comments: 8 pages, 7 figures

arXiv:2310.18257 [pdf, other]

MIM-GAN-based Anomaly Detection for Multivariate Time Series Data

Authors: Shan Lu, Zhicheng Dong, Donghong Cai, Fang Fang, Dongcai Zhao

Abstract: The loss function of Generative adversarial network(GAN) is an important factor that affects the quality and diversity of the generated samples for anomaly detection. In this paper, we propose an unsupervised multiple time series anomaly detection algorithm based on the GAN with message importance measure(MIM-GAN). In particular, the time series data is divided into subsequences using a sliding wi… ▽ More The loss function of Generative adversarial network(GAN) is an important factor that affects the quality and diversity of the generated samples for anomaly detection. In this paper, we propose an unsupervised multiple time series anomaly detection algorithm based on the GAN with message importance measure(MIM-GAN). In particular, the time series data is divided into subsequences using a sliding window. Then a generator and a discriminator designed based on the Long Short-Term Memory (LSTM) are employed to capture the temporal correlations of the time series data. To avoid the local optimal solution of loss function and the model collapse, we introduce an exponential information measure into the loss function of GAN. Additionally, a discriminant reconstruction score consisting on discrimination and reconstruction loss is taken into account. The global optimal solution for the loss function is derived and the model collapse is proved to be avoided in our proposed MIM-GAN-based anomaly detection algorithm. Experimental results show that the proposed MIM-GAN-based anomaly detection algorithm has superior performance in terms of precision, recall, and F1 score. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 7 pages,6 figures

Showing 101–150 of 1,041 results for author: Zha, D