Search | arXiv e-print repository

UniWorld: Autonomous Driving Pre-training via World Models

Authors: Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai

Abstract: In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models… ▽ More In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: 8 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2305.18829

arXiv:2308.06933 [pdf, other]

Radiomics-Informed Deep Learning for Classification of Atrial Fibrillation Sub-Types from Left-Atrium CT Volumes

Authors: Weihang Dai, Xiaomeng Li, Taihui Yu, Di Zhao, Jun Shen, Kwang-Ting Cheng

Abstract: Atrial Fibrillation (AF) is characterized by rapid, irregular heartbeats, and can lead to fatal complications such as heart failure. The disease is divided into two sub-types based on severity, which can be automatically classified through CT volumes for disease screening of severe cases. However, existing classification approaches rely on generic radiomic features that may not be optimal for the… ▽ More Atrial Fibrillation (AF) is characterized by rapid, irregular heartbeats, and can lead to fatal complications such as heart failure. The disease is divided into two sub-types based on severity, which can be automatically classified through CT volumes for disease screening of severe cases. However, existing classification approaches rely on generic radiomic features that may not be optimal for the task, whilst deep learning methods tend to over-fit to the high-dimensional volume inputs. In this work, we propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification. Unlike existing hybrid techniques that mostly rely on naïve feature concatenation, we observe that radiomic feature selection methods can serve as an information prior, and propose supplementing low-level deep neural network (DNN) features with locally computed radiomic features. This reduces DNN over-fitting and allows local variations between radiomic features to be better captured. Furthermore, we ensure complementary information is learned by deep and radiomic features by designing a novel feature de-correlation loss. Combined, our method addresses the limitations of deep learning and radiomic approaches and outperforms state-of-the-art radiomic, deep learning, and hybrid approaches, achieving 86.9% AUC for the AF sub-type classification task. Code is available at https://github.com/xmed-lab/RIDL. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: Accepted by MICCAI23

arXiv:2308.06891 [pdf]

Viia-hand: a Reach-and-grasp Restoration System Integrating Voice interaction, Computer vision and Auditory feedback for Blind Amputees

Authors: Chunhao Peng, Dapeng Yang, Ming Cheng, **ghui Dai, Deyu Zhao, Li Jiang

Abstract: Visual feedback plays a crucial role in the process of amputation patients completing gras** in the field of prosthesis control. However, for blind and visually impaired (BVI) amputees, the loss of both visual and gras** abilities makes the "easy" reach-and-grasp task a feasible challenge. In this paper, we propose a novel multi-sensory prosthesis system hel** BVI amputees with sensing, navi… ▽ More Visual feedback plays a crucial role in the process of amputation patients completing gras** in the field of prosthesis control. However, for blind and visually impaired (BVI) amputees, the loss of both visual and gras** abilities makes the "easy" reach-and-grasp task a feasible challenge. In this paper, we propose a novel multi-sensory prosthesis system hel** BVI amputees with sensing, navigation and grasp operations. It combines modules of voice interaction, environmental perception, grasp guidance, collaborative control, and auditory/tactile feedback. In particular, the voice interaction module receives user instructions and invokes other functional modules according to the instructions. The environmental perception and grasp guidance module obtains environmental information through computer vision, and feedbacks the information to the user through auditory feedback modules (voice prompts and spatial sound sources) and tactile feedback modules (vibration stimulation). The prosthesis collaborative control module obtains the context information of the grasp guidance process and completes the collaborative control of grasp gestures and wrist angles of prosthesis in conjunction with the user's control intention in order to achieve stable grasp of various objects. This paper details a prototy** design (named viia-hand) and presents its preliminary experimental verification on healthy subjects completing specific reach-and-grasp tasks. Our results showed that, with the help of our new design, the subjects were able to achieve a precise reach and reliable grasp of the target objects in a relatively cluttered environment. Additionally, the system is extremely user-friendly, as users can quickly adapt to it with minimal training. △ Less

Submitted 13 August, 2023; originally announced August 2023.

arXiv:2308.04648 [pdf, ps, other]

Communication-Efficient Search under Fully Homomorphic Encryption for Federated Machine Learning

Authors: Dongfang Zhao

Abstract: Homomorphic encryption (HE) has found extensive utilization in federated learning (FL) systems, capitalizing on its dual advantages: (i) ensuring the confidentiality of shared models contributed by participating entities, and (ii) enabling algebraic operations directly on ciphertexts representing encrypted models. Particularly, the approximate fully homomorphic encryption (FHE) scheme, known as CK… ▽ More Homomorphic encryption (HE) has found extensive utilization in federated learning (FL) systems, capitalizing on its dual advantages: (i) ensuring the confidentiality of shared models contributed by participating entities, and (ii) enabling algebraic operations directly on ciphertexts representing encrypted models. Particularly, the approximate fully homomorphic encryption (FHE) scheme, known as CKKS, has emerged as the de facto encryption scheme, notably supporting decimal numbers. While recent research predominantly focuses on enhancing CKKS's encryption rate and evaluation speed in the context of FL, the search operation has been relatively disregarded due to the tendency of some applications to discard intermediate encrypted models. Yet, emerging studies emphasize the importance of managing and searching intermediate models for specific applications like large-scale scientific computing, necessitating robust data provenance and auditing support. To address this, our paper introduces an innovative approach that efficiently searches for a target encrypted value, incurring only a logarithmic number of network interactions. The proposed method capitalizes on CKKS's additive and multiplicative properties on encrypted models, propagating equality comparisons between values through a balanced binary tree structure to ultimately reach a single aggregate. A comprehensive analysis of the proposed algorithm underscores its potential to significantly broaden FL's applicability and impact. △ Less

Submitted 8 August, 2023; originally announced August 2023.

arXiv:2308.00918 [pdf, other]

A Novel Cross-Perturbation for Single Domain Generalization

Authors: Dongjia Zhao, Lei Qi, Xiao Shi, Yinghuan Shi, Xin Geng

Abstract: Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. However, the limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. To address this, data perturbation (augmentation) has emerged as a crucial method to increase data di… ▽ More Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. However, the limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. To address this, data perturbation (augmentation) has emerged as a crucial method to increase data diversity. Nevertheless, existing perturbation methods often focus on either image-level or feature-level perturbations independently, neglecting their synergistic effects. To overcome these limitations, we propose CPerb, a simple yet effective cross-perturbation method. Specifically, CPerb utilizes both horizontal and vertical operations. Horizontally, it applies image-level and feature-level perturbations to enhance the diversity of the training data, mitigating the issue of limited diversity in single-source domains. Vertically, it introduces multi-route perturbation to learn domain-invariant features from different perspectives of samples with the same semantic category, thereby enhancing the generalization capability of the model. Additionally, we propose MixPatch, a novel feature-level perturbation method that exploits local image style information to further diversify the training data. Extensive experiments on various benchmark datasets validate the effectiveness of our method. △ Less

Submitted 7 June, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

Comments: Accepted by IEEE TCSVT

arXiv:2307.15061 [pdf, other]

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

Authors: Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan **g, Pengwei Li, Xiaohua Qi, Cheng **, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng , et al. (18 additional authors not shown)

Abstract: Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summari… ▽ More Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Technical Report; 65 pages, 34 figures, 24 tables; Code at https://github.com/ldkong1205/RoboDepth

arXiv:2307.15049 [pdf, other]

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

Authors: Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, Yujun Shen

Abstract: Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exis… ▽ More Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here (https://wuw2019.github.io/R-AMT/). △ Less

Submitted 6 August, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted at ICCV 2023

arXiv:2307.14778 [pdf, other]

MATNilm: Multi-appliance-task Non-intrusive Load Monitoring with Limited Labeled Data

Authors: **g Xiong, Tianqi Hong, Dongbo Zhao, Yu Zhang

Abstract: Non-intrusive load monitoring (NILM) identifies the status and power consumption of various household appliances by disaggregating the total power usage signal of an entire house. Efficient and accurate load monitoring facilitates user profile establishment, intelligent household energy management, and peak load shifting. This is beneficial for both the end-users and utilities by improving the ove… ▽ More Non-intrusive load monitoring (NILM) identifies the status and power consumption of various household appliances by disaggregating the total power usage signal of an entire house. Efficient and accurate load monitoring facilitates user profile establishment, intelligent household energy management, and peak load shifting. This is beneficial for both the end-users and utilities by improving the overall efficiency of a power distribution network. Existing approaches mainly focus on develo** an individual model for each appliance. Those approaches typically rely on a large amount of household-labeled data which is hard to collect. In this paper, we propose a multi-appliance-task framework with a training-efficient sample augmentation (SA) scheme that boosts the disaggregation performance with limited labeled data. For each appliance, we develop a shared-hierarchical split structure for its regression and classification tasks. In addition, we also propose a two-dimensional attention mechanism in order to capture spatio-temporal correlations among all appliances. With only one-day training data and limited appliance operation profiles, the proposed SA algorithm can achieve comparable test performance to the case of training with the full dataset. Finally, simulation results show that our proposed approach features a significantly improved performance over many baseline models. The relative errors can be reduced by more than 50% on average. The codes of this work are available at https://github.com/jxiong22/MATNilm △ Less

Submitted 29 July, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.12533 [pdf, ps, other]

PUMA: Secure Inference of LLaMA-7B in Five Minutes

Authors: Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, ** Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen

Abstract: With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these fra… ▽ More With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the model provider. Previous studies have studied secure inference for Transformer models using secure multiparty computation (MPC), where model parameters and clients' prompts are kept secret. Despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. To address these limitations, we propose framework PUMA to enable fast and secure Transformer model inference. Our framework designs high quality approximations for expensive functions such as GeLU and softmax, and significantly reduce the cost of secure inference while preserving the model performance. Additionally, we design secure Embedding and LayerNorm procedures that faithfully implement the desired functionality without undermining the Transformer architecture. PUMA is about $2\times$ faster than the state-of-the-art framework MPCFORMER(ICLR 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). PUMA can even evaluate LLaMA-7B in around 5 minutes to generate 1 token. To our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under MPC. PUMA has been open-sourced in the Github repository of SecretFlow-SPU. △ Less

Submitted 26 September, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

arXiv:2307.12334 [pdf, other]

doi 10.3847/1538-4357/acf835

Toward a Physical Understanding of Galaxy-Halo Alignment

Authors: Kun Xu, Y. P. **g, Donghai Zhao

Abstract: We investigate the alignment of galaxy and halo orientations using the TNG300-1 hydrodynamical simulation. Our analysis reveals that the distribution of the 2D misalignment angle $θ_{\rm{2D}}$ can be well described by a truncated shifted exponential (TSE) distribution with only {\textit{one}} free parameter across different redshifts and galaxy/halo properties. We demonstrate that the galaxy-ellip… ▽ More We investigate the alignment of galaxy and halo orientations using the TNG300-1 hydrodynamical simulation. Our analysis reveals that the distribution of the 2D misalignment angle $θ_{\rm{2D}}$ can be well described by a truncated shifted exponential (TSE) distribution with only {\textit{one}} free parameter across different redshifts and galaxy/halo properties. We demonstrate that the galaxy-ellipticity (GI) correlations of galaxies can be reproduced by perturbing halo orientations with the obtained $θ_{\rm{2D}}$ distribution, with only a small bias ($<3^{\circ}$) possibly arising from unaccounted couplings between $θ_{\rm{2D}}$ and other factors. We find that both the 2D and 3D misalignment angles $θ_{\rm{2D}}$ and $θ_{\rm{3D}}$ decrease with ex situ stellar mass fraction $F_{\rm{acc}}$, halo mass $M_{\rm{vir}}$ and stellar mass $M_{*}$, while increasing with disk-to-total stellar mass fraction $F_{\rm{disk}}$ and redshift. These dependences are in good agreement with our recent observational study based on the BOSS galaxy samples. Our results suggest that $F_{\rm{acc}}$ is a key factor in determining the galaxy-halo alignment. Grou** galaxies by $F_{\rm{acc}}$ nearly eliminates the dependence of $θ_{\rm{3D}}$ on $M_{\rm{vir}}$ for all three principle axes, and also reduces the redshift dependence. For $θ_{\rm{2D}}$, we find a more significant redshift dependence than for $θ_{\rm{3D}}$ even after controlling $F_{\rm{acc}}$, which may be attributed to the evolution of galaxy and halo shapes. Our findings present a valuable model for observational studies and enhance our understanding of galaxy-halo alignment. △ Less

Submitted 5 November, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

Comments: 19 pages, 12 figures, Published in ApJ

Journal ref: The Astrophysical Journal, Volume 957, 2023, Number 1

arXiv:2307.10485 [pdf, other]

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Authors: Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, Daochen Zha

Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial te… ▽ More Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial text datasets available, and BloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the training logs were released). In light of this, we aim to democratize Internet-scale financial data for LLMs, which is an open challenge due to diverse data sources, low signal-to-noise ratio, and high time-validity. To address the challenges, we introduce an open-sourced and data-centric framework, Financial Generative Pre-trained Transformer (FinGPT), that automates the collection and curation of real-time financial data from 34 diverse sources on the Internet, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. Additionally, we propose a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from general-purpose LLMs at a low cost. Finally, we showcase several FinGPT applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance. The codes have been open-sourced. △ Less

Submitted 14 November, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: 43 pages, 8 tables, and 2 figures

arXiv:2307.09823 [pdf, other]

Multi-modal Learning based Prediction for Disease

Authors: Yaran Chen, Xueyu Chen, Yu Han, Haoran Li, Dongbin Zhao, **gzhong Li, Xu Wang

Abstract: Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is invasive, expensive, and prone to sampling errors. Therefore, non-invasive studies are extremely promising, yet they are still in their infancy due to the lack of c… ▽ More Non alcoholic fatty liver disease (NAFLD) is the most common cause of chronic liver disease, which can be predicted accurately to prevent advanced fibrosis and cirrhosis. While, a liver biopsy, the gold standard for NAFLD diagnosis, is invasive, expensive, and prone to sampling errors. Therefore, non-invasive studies are extremely promising, yet they are still in their infancy due to the lack of comprehensive research data and intelligent methods for multi-modal data. This paper proposes a NAFLD diagnosis system (DeepFLDDiag) combining a comprehensive clinical dataset (FLDData) and a multi-modal learning based NAFLD prediction method (DeepFLD). The dataset includes over 6000 participants physical examinations, laboratory and imaging studies, extensive questionnaires, and facial images of partial participants, which is comprehensive and valuable for clinical studies. From the dataset, we quantitatively analyze and select clinical metadata that most contribute to NAFLD prediction. Furthermore, the proposed DeepFLD, a deep neural network model designed to predict NAFLD using multi-modal input, including metadata and facial images, outperforms the approach that only uses metadata. Satisfactory performance is also verified on other unseen datasets. Inspiringly, DeepFLD can achieve competitive results using only facial images as input rather than metadata, paving the way for a more robust and simpler non-invasive NAFLD diagnosis. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.09481 [pdf, other]

AnyDoor: Zero-shot Object-level Image Customization

Authors: Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao

Abstract: This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characte… ▽ More This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/. △ Less

Submitted 7 May, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: CVPR2024

arXiv:2307.08918 [pdf, other]

Measuring Scale-dependent Shape Anisotropy by Coarse-Graining: Application to Inhomogeneous Rayleigh-Taylor Turbulence

Authors: Dongxiao Zhao, Hussein Aluie

Abstract: We generalize the `filtering spectrum' [1] to probe scales along different directions by spatial coarse-graining. This multi-dimensional filtering spectrum quantifies the spectral content of flows that are not necessarily homogeneous. From multi-dimensional spectral information, we propose a simple metric for shape anisotropy at various scales. The method is applied to simulations of 2D and 3D Ray… ▽ More We generalize the `filtering spectrum' [1] to probe scales along different directions by spatial coarse-graining. This multi-dimensional filtering spectrum quantifies the spectral content of flows that are not necessarily homogeneous. From multi-dimensional spectral information, we propose a simple metric for shape anisotropy at various scales. The method is applied to simulations of 2D and 3D Rayleigh-Taylor (RT) turbulence, which is inhomogeneous and anisotropic. We show that 3D RT has clear shape anisotropy at large scales with approximately $4:3$ vertical to horizontal aspect ratio, but tends toward isotropy at small scales as expected [2,3,4]. In sharp contrast, we find that RT in 2D simulations, which are still the main modeling framework for many applications, is isotropic at large scales and its shape anisotropy increases at smaller scales where structures tend to be horizontally elongated. While this may be surprising, it is consistent with recent results in [5]; large-scale isotropy in 2D RT is due to the generation of a large-scale overturning circulation via an upscale cascade, while small scale anisotropy is due to the stable stratification resultant from such overturning and the inefficient mixing in 2D. △ Less

Submitted 18 October, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

arXiv:2307.07907 [pdf, other]

Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation

Authors: Wenhao Ding, Laixi Shi, Yuejie Chi, Ding Zhao

Abstract: Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have correlations induced by unobserved confounders. These spurious correlations are ubiqui… ▽ More Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and causal structure, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in avoiding learning spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks. △ Less

Submitted 25 October, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: Accepted to NeurIPS 2023

arXiv:2307.05689 [pdf, other]

Magnetar emergence in a peculiar gamma-ray burst from a compact star merger

Authors: H. Sun, C. -W. Wang, J. Yang, B. -B. Zhang, S. -L. Xiong, Y. -H. I. Yin, Y. Liu, Y. Li, W. -C. Xue, Z. Yan, C. Zhang, W. -J. Tan, H. -W. Pan, J. -C. Liu, H. -Q. Cheng, Y. -Q. Zhang, J. -W. Hu, C. Zheng, Z. -H. An, C. Cai, L. Hu, C. **, D. -Y. Li, X. -Q. Li, H. -Y. Liu , et al. (19 additional authors not shown)

Abstract: The central engine that powers gamma-ray bursts (GRBs), the most powerful explosions in the universe, is still not identified. Besides hyper-accreting black holes, rapidly spinning and highly magnetized neutron stars, known as millisecond magnetars, have been suggested to power both long and short GRBs. The presence of a magnetar engine following compact star mergers is of particular interest as i… ▽ More The central engine that powers gamma-ray bursts (GRBs), the most powerful explosions in the universe, is still not identified. Besides hyper-accreting black holes, rapidly spinning and highly magnetized neutron stars, known as millisecond magnetars, have been suggested to power both long and short GRBs. The presence of a magnetar engine following compact star mergers is of particular interest as it would provide essential constraints on the poorly understood equation of state for neutron stars. Indirect indications of a magnetar engine in these merger sources have been observed in the form of plateau features present in the X-ray afterglow light curves of some short GRBs. Additionally, some X-ray transients lacking gamma-ray bursts (GRB-less) have been identified as potential magnetar candidates originating from compact star mergers. Nevertheless, smoking gun evidence is still lacking for a magnetar engine in short GRBs, and the associated theoretical challenges have been addressed. Here we present a comprehensive analysis of the broad-band prompt emission data of a peculiar, very bright GRB 230307A. Despite its apparently long duration, the prompt emission and host galaxy properties point toward a compact star merger origin, being consistent with its association with a kilonova. More intriguingly, an extended X-ray emission component emerges as the $γ$-ray emission dies out, signifying the emergence of a magnetar central engine. We also identify an achromatic temporal break in the high-energy band during the prompt emission phase, which was never observed in previous bursts and reveals a narrow jet with half opening angle of approximately $3.4^\circ$. △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: 44 pages, 10 figures, 5 tables

arXiv:2307.02869 [pdf, other]

MomentDiff: Generative Video Moment Retrieval from Random to Real

Authors: Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, Yongdong Zhang

Abstract: Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first dif… ▽ More Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description. To achieve this goal, we provide a generative diffusion-based framework called MomentDiff, which simulates a typical human retrieval process from random browsing to gradual localization. Specifically, we first diffuse the real span to random noise, and learn to denoise the random noise to the original span with the guidance of similarity between text and video. This allows the model to learn a map** from arbitrary random locations to real moments, enabling the ability to locate segments from random initialization. Once trained, MomentDiff could sample random temporal segments as initial guesses and iteratively refine them to generate an accurate temporal boundary. Different from discriminative works (e.g., based on learnable proposals or queries), MomentDiff with random initialized spans could resist the temporal location biases from datasets. To evaluate the influence of the temporal location biases, we propose two anti-bias datasets with location distribution shifts, named Charades-STA-Len and Charades-STA-Mom. The experimental results demonstrate that our efficient framework consistently outperforms state-of-the-art methods on three public benchmarks, and exhibits better generalization and robustness on the proposed anti-bias datasets. The code, model, and anti-bias evaluation datasets are available at https://github.com/IMCCretrieval/MomentDiff. △ Less

Submitted 11 October, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: 19 pages, 6 figures

arXiv:2307.02127 [pdf, other]

Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction

Authors: He**g Cao, Dongyan Zhao

Abstract: Grammatical Error Correction (GEC) is the task of correcting errorful sentences into grammatically correct, semantically consistent, and coherent sentences. Popular GEC models either use large-scale synthetic corpora or use a large number of human-designed rules. The former is costly to train, while the latter requires quite a lot of human expertise. In recent years, AMR, a semantic representation… ▽ More Grammatical Error Correction (GEC) is the task of correcting errorful sentences into grammatically correct, semantically consistent, and coherent sentences. Popular GEC models either use large-scale synthetic corpora or use a large number of human-designed rules. The former is costly to train, while the latter requires quite a lot of human expertise. In recent years, AMR, a semantic representation framework, has been widely used by many natural language tasks due to its completeness and flexibility. A non-negligible concern is that AMRs of grammatically incorrect sentences may not be exactly reliable. In this paper, we propose the AMR-GEC, a seq-to-seq model that incorporates denoised AMR as additional knowledge. Specifically, We design a semantic aggregated GEC model and explore denoising methods to get AMRs more reliable. Experiments on the BEA-2019 shared task and the CoNLL-2014 shared task have shown that AMR-GEC performs comparably to a set of strong baselines with a large number of synthetic data. Compared with the T5 model with synthetic data, AMR-GEC can reduce the training time by 32\% while inference time is comparable. To the best of our knowledge, we are the first to incorporate AMR for grammatical error correction. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: 7 pages, 3 figures, Accepted by ACL findings 2023

arXiv:2307.02060 [pdf, other]

doi 10.1002/rob.22209

Traversability Analysis for Autonomous Driving in Complex Environment: A LiDAR-based Terrain Modeling Approach

Authors: Hanzhang Xue, Hao Fu, Liang Xiao, Yiming Fan, Dawei Zhao, Bin Dai

Abstract: For autonomous driving, traversability analysis is one of the most basic and essential tasks. In this paper, we propose a novel LiDAR-based terrain modeling approach, which could output stable, complete and accurate terrain models and traversability analysis results. As terrain is an inherent property of the environment that does not change with different view angles, our approach adopts a multi-f… ▽ More For autonomous driving, traversability analysis is one of the most basic and essential tasks. In this paper, we propose a novel LiDAR-based terrain modeling approach, which could output stable, complete and accurate terrain models and traversability analysis results. As terrain is an inherent property of the environment that does not change with different view angles, our approach adopts a multi-frame information fusion strategy for terrain modeling. Specifically, a normal distributions transform map** approach is adopted to accurately model the terrain by fusing information from consecutive LiDAR frames. Then the spatial-temporal Bayesian generalized kernel inference and bilateral filtering are utilized to promote the stability and completeness of the results while simultaneously retaining the sharp terrain edges. Based on the terrain modeling results, the traversability of each region is obtained by performing geometric connectivity analysis between neighboring terrain regions. Experimental results show that the proposed method could run in real-time and outperforms state-of-the-art approaches. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: accepted to Journal of Field Robotics

Journal ref: Journal of Field Robotics, 2023, 1-25

arXiv:2307.00731 [pdf, other]

doi 10.1088/1674-4527/ace179

Reciprocating Magnetic Fields in the Pulsar Wind Observed from the Black Widow Pulsar J1720-0534

Authors: Chen-Chen Miao, Victoria Blackmon, Wei-Wei Zhu, Dong-Zi Li, Mingyu Ge, Xiao-Peng You, Maura McLaughlin, Di Li, Na Wang, Pei Wang, Jia-Rui Niu, M. Cruces, Jian-** Yuan, Jun-Tao Bai, D. J. Champion, Yu-Tong Chen, Ming-Min Chi, P. C. C. Freire, Yi Feng, Zhen-Ye Gan, M. Kramer, Fei-Fei Kou, Yu-Xi Li, Xue-Li Miao, Ling-Qi Meng , et al. (19 additional authors not shown)

Abstract: We report the radio observations of the eclipsing black widow pulsar J1720-0534, a 3.26 ms pulsar in orbit with a low mass companion of mass 0.029 to 0.034 M$_{\odot}$. We obtain the phase-connected timing ephemeris and polarization profile of this millisecond pulsar (MSP) using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST), the Green Bank Telescope (GBT), and the Parkes Telesco… ▽ More We report the radio observations of the eclipsing black widow pulsar J1720-0534, a 3.26 ms pulsar in orbit with a low mass companion of mass 0.029 to 0.034 M$_{\odot}$. We obtain the phase-connected timing ephemeris and polarization profile of this millisecond pulsar (MSP) using the Five-hundred-meter Aperture Spherical Radio Telescope (FAST), the Green Bank Telescope (GBT), and the Parkes Telescope. For the first time from such a system, an oscillatory polarisation angle change was observed from a particular eclipse egress with partial depolarization, indicating 10-milliGauss-level reciprocating magnetic fields oscillating in a length scale of 5000 km (assuming an orbital inclination angle of 90 degrees) outside the companion's magnetosphere. The dispersion measure variation observed during the ingresses and egresses shows the rapid raising of the electron density in the shock boundary between the companion's magnetosphere and the surrounding pulsar wind. We suggest that the observed oscillatory magnetic fields originate from the pulsar wind outside the companion's magnetosphere. △ Less

Submitted 28 August, 2023; v1 submitted 2 July, 2023; originally announced July 2023.

Comments: 15 pages, 8 figures, 1 table, accepted by RAA

arXiv:2306.15864 [pdf, other]

What Went Wrong? Closing the Sim-to-Real Gap via Differentiable Causal Discovery

Authors: Peide Huang, Xilun Zhang, Ziang Cao, Shiqi Liu, Mengdi Xu, Wenhao Ding, Jonathan Francis, Bingqing Chen, Ding Zhao

Abstract: Training control policies in simulation is more appealing than on real robots directly, as it allows for exploring diverse states in an efficient manner. Yet, robot simulators inevitably exhibit disparities from the real-world \rebut{dynamics}, yielding inaccuracies that manifest as the dynamical simulation-to-reality (sim-to-real) gap. Existing literature has proposed to close this gap by activel… ▽ More Training control policies in simulation is more appealing than on real robots directly, as it allows for exploring diverse states in an efficient manner. Yet, robot simulators inevitably exhibit disparities from the real-world \rebut{dynamics}, yielding inaccuracies that manifest as the dynamical simulation-to-reality (sim-to-real) gap. Existing literature has proposed to close this gap by actively modifying specific simulator parameters to align the simulated data with real-world observations. However, the set of tunable parameters is usually manually selected to reduce the search space in a case-by-case manner, which is hard to scale up for complex systems and requires extensive domain knowledge. To address the scalability issue and automate the parameter-tuning process, we introduce COMPASS, which aligns the simulator with the real world by discovering the causal relationship between the environment parameters and the sim-to-real gap. Concretely, our method learns a differentiable map** from the environment parameters to the differences between simulated and real-world robot-object trajectories. This map** is governed by a simultaneously learned causal graph to help prune the search space of parameters, provide better interpretability, and improve generalization on unseen parameters. We perform experiments to achieve both sim-to-sim and sim-to-real transfer, and show that our method has significant improvements in trajectory alignment and task success rate over strong baselines in several challenging manipulation tasks. △ Less

Submitted 19 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.14436 [pdf, other]

Silca: Singular Caching of Homomorphic Encryption for Outsourced Databases in Cloud Computing

Authors: Dongfang Zhao

Abstract: Ensuring the confidentiality and privacy of sensitive information in cloud computing and outsourced databases is crucial. Homomorphic encryption (HE) offers a solution by enabling computations on encrypted data without decryption, allowing secure outsourcing while maintaining data confidentiality. However, HE faces performance challenges in query-intensive databases. To address this, we propose tw… ▽ More Ensuring the confidentiality and privacy of sensitive information in cloud computing and outsourced databases is crucial. Homomorphic encryption (HE) offers a solution by enabling computations on encrypted data without decryption, allowing secure outsourcing while maintaining data confidentiality. However, HE faces performance challenges in query-intensive databases. To address this, we propose two novel optimizations, Silca and SilcaZ, tailored to outsourced databases in cloud computing. Silca utilizes a singular caching technique to reduce computational overhead, while SilcaZ leverages modular arithmetic operations to ensure the applicability of singular caching for intensive HE operations. We prove the semantic security of Silca and SilcaZ and implement them with CKKS and BGV in HElib as MySQL loadable functions. Extensive experiments with seven real-world datasets demonstrate their superior performance compared to existing HE schemes, bridging the gap between theoretical advancements and practical applications in applying HE schemes on outsourced databases in cloud computing. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.12619 [pdf, other]

Class-Incremental Learning based on Label Generation

Authors: Yijia Shao, Yiduo Guo, Dongyan Zhao, Bing Liu

Abstract: Despite the great success of pre-trained language models, it is still a challenge to use these models for continual learning, especially for the class-incremental learning (CIL) setting due to catastrophic forgetting (CF). This paper reports our finding that if we formulate CIL as a continual label generation problem, CF is drastically reduced and the generalizable representations of pre-trained m… ▽ More Despite the great success of pre-trained language models, it is still a challenge to use these models for continual learning, especially for the class-incremental learning (CIL) setting due to catastrophic forgetting (CF). This paper reports our finding that if we formulate CIL as a continual label generation problem, CF is drastically reduced and the generalizable representations of pre-trained models can be better retained. We thus propose a new CIL method (VAG) that also leverages the sparsity of vocabulary to focus the generation and creates pseudo-replay samples by using label semantics. Experimental results show that VAG outperforms baselines by a large margin. △ Less

Submitted 20 July, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: 12 pages, ACL 2023 Main Conference

arXiv:2306.11546 [pdf, other]

Bullying10K: A Large-Scale Neuromorphic Dataset towards Privacy-Preserving Bullying Recognition

Authors: Yiting Dong, Yang Li, Dongcheng Zhao, Guobin Shen, Yi Zeng

Abstract: The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) camer… ▽ More The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) cameras to detect violent incidents and preserve privacy since it captures pixel brightness variations instead of static imagery. We introduce the Bullying10K dataset, encompassing various actions, complex movements, and occlusions from real-life scenarios. It provides three benchmarks for evaluating different tasks: action recognition, temporal action localization, and pose estimation. With 10,000 event segments, totaling 12 billion events and 255 GB of data, Bullying10K contributes significantly by balancing violence detection and personal privacy persevering. And it also poses a challenge to the neuromorphic dataset. It will serve as a valuable resource for training and develo** privacy-protecting video systems. The Bullying10K opens new possibilities for innovative approaches in these domains. △ Less

Submitted 23 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

arXiv:2306.11251 [pdf, other]

Eliminating Lipschitz Singularities in Diffusion Models

Authors: Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, **gren Zhou, Fan Cheng

Abstract: Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we uncover a vexing propensity of diffusion models: they frequen… ▽ More Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we uncover a vexing propensity of diffusion models: they frequently exhibit the infinite Lipschitz near the zero point of timesteps. This poses a threat to the stability and accuracy of the diffusion process, which relies on integral operations. We provide a comprehensive evaluation of the issue from both theoretical and empirical perspectives. To address this challenge, we propose a novel approach, dubbed E-TSDM, which eliminates the Lipschitz singularity of the diffusion model near zero. Remarkably, our technique yields a substantial improvement in performance, e.g., on the high-resolution FFHQ dataset ($256\times256$). Moreover, as a byproduct of our method, we manage to achieve a dramatic reduction in the Frechet Inception Distance of other acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33$\%$. We conduct extensive experiments on diverse datasets to validate our theory and method. Our work not only advances the understanding of the general diffusion process, but also provides insights for the design of diffusion models. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2306.10280 [pdf, other]

OpenGSL: A Comprehensive Benchmark for Graph Structure Learning

Authors: Zhiyao Zhou, Sheng Zhou, Bochao Mao, Xuanyi Zhou, Jiawei Chen, Qiaoyu Tan, Daochen Zha, Yan Feng, Chun Chen, Can Wang

Abstract: Graph Neural Networks (GNNs) have emerged as the de facto standard for representation learning on graphs, owing to their ability to effectively integrate graph topology and node attributes. However, the inherent suboptimal nature of node connections, resulting from the complex and contingent formation process of graphs, presents significant challenges in modeling them effectively. To tackle this i… ▽ More Graph Neural Networks (GNNs) have emerged as the de facto standard for representation learning on graphs, owing to their ability to effectively integrate graph topology and node attributes. However, the inherent suboptimal nature of node connections, resulting from the complex and contingent formation process of graphs, presents significant challenges in modeling them effectively. To tackle this issue, Graph Structure Learning (GSL), a family of data-centric learning approaches, has garnered substantial attention in recent years. The core concept behind GSL is to jointly optimize the graph structure and the corresponding GNN models. Despite the proposal of numerous GSL methods, the progress in this field remains unclear due to inconsistent experimental protocols, including variations in datasets, data processing techniques, and splitting strategies. In this paper, we introduce OpenGSL, the first comprehensive benchmark for GSL, aimed at addressing this gap. OpenGSL enables a fair comparison among state-of-the-art GSL methods by evaluating them across various popular datasets using uniform data processing and splitting strategies. Through extensive experiments, we observe that existing GSL methods do not consistently outperform vanilla GNN counterparts. We also find that there is no significant correlation between the homophily of the learned structure and task performance, challenging the common belief. Moreover, we observe that the learned graph structure demonstrates a strong generalization ability across different GNN models, despite the high computational and space consumption. We hope that our open-sourced library will facilitate rapid and equitable evaluation and inspire further innovative research in this field. The code of the benchmark can be found in https://github.com/OpenGSL/OpenGSL. △ Less

Submitted 23 December, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

Comments: 25 pages, 21 figures. Camera-ready version for NeurIPS Datasets and Benchmarks Track 2023

arXiv:2306.09303 [pdf, other]

Datasets and Benchmarks for Offline Safe Reinforcement Learning

Authors: Zuxin Liu, Zijian Guo, Haohong Lin, Yihang Yao, Jiacheng Zhu, Zhepeng Cen, Hanjiang Hu, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding Zhao

Abstract: This paper presents a comprehensive benchmarking suite tailored to offline safe reinforcement learning (RL) challenges, aiming to foster progress in the development and evaluation of safe learning algorithms in both the training and deployment phases. Our benchmark suite contains three packages: 1) expertly crafted safe policies, 2) D4RL-styled datasets along with environment wrappers, and 3) high… ▽ More This paper presents a comprehensive benchmarking suite tailored to offline safe reinforcement learning (RL) challenges, aiming to foster progress in the development and evaluation of safe learning algorithms in both the training and deployment phases. Our benchmark suite contains three packages: 1) expertly crafted safe policies, 2) D4RL-styled datasets along with environment wrappers, and 3) high-quality offline safe RL baseline implementations. We feature a methodical data collection pipeline powered by advanced safe RL algorithms, which facilitates the generation of diverse datasets across 38 popular safe RL tasks, from robot control to autonomous driving. We further introduce an array of data post-processing filters, capable of modifying each dataset's diversity, thereby simulating various data collection conditions. Additionally, we provide elegant and extensible implementations of prevalent offline safe RL algorithms to accelerate research in this area. Through extensive experiments with over 50000 CPU and 800 GPU hours of computations, we evaluate and compare the performance of these baseline algorithms on the collected datasets, offering insights into their strengths, limitations, and potential areas of improvement. Our benchmarking framework serves as a valuable resource for researchers and practitioners, facilitating the development of more robust and reliable offline safe RL solutions in safety-critical applications. The benchmark website is available at \url{www.offline-saferl.org}. △ Less

Submitted 16 June, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 22 pages.13 figures, 7 tables

arXiv:2306.09273 [pdf, other]

Your Room is not Private: Gradient Inversion Attack on Reinforcement Learning

Authors: Miao Li, Wenhao Ding, Ding Zhao

Abstract: The prominence of embodied Artificial Intelligence (AI), which empowers robots to navigate, perceive, and engage within virtual environments, has attracted significant attention, owing to the remarkable advancements in computer vision and large language models. Privacy emerges as a pivotal concern within the realm of embodied AI, as the robot accesses substantial personal information. However, the… ▽ More The prominence of embodied Artificial Intelligence (AI), which empowers robots to navigate, perceive, and engage within virtual environments, has attracted significant attention, owing to the remarkable advancements in computer vision and large language models. Privacy emerges as a pivotal concern within the realm of embodied AI, as the robot accesses substantial personal information. However, the issue of privacy leakage in embodied AI tasks, particularly in relation to reinforcement learning algorithms, has not received adequate consideration in research. This paper aims to address this gap by proposing an attack on the value-based algorithm and the gradient-based algorithm, utilizing gradient inversion to reconstruct states, actions, and supervision signals. The choice of using gradients for the attack is motivated by the fact that commonly employed federated learning techniques solely utilize gradients computed based on private user data to optimize models, without storing or transmitting the data to public servers. Nevertheless, these gradients contain sufficient information to potentially expose private data. To validate our approach, we conduct experiments on the AI2THOR simulator and evaluate our algorithm on active perception, a prevalent task in embodied AI. The experimental results demonstrate the effectiveness of our method in successfully reconstructing all information from the data across 120 room layouts. △ Less

Submitted 17 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 7 pages, 4 figures, 2 tables

arXiv:2306.07707 [pdf, other]

Incentive-Compatible Selection for One or Two Influentials

Authors: Yuxin Zhao, Yao Zhang, Dengji Zhao

Abstract: Selecting influentials in networks against strategic manipulations has attracted many researchers' attention and it also has many practical applications. Here, we aim to select one or two influentials in terms of progeny (the influential power) and prevent agents from manipulating their edges (incentive compatibility). The existing studies mostly focused on selecting a single influential for this… ▽ More Selecting influentials in networks against strategic manipulations has attracted many researchers' attention and it also has many practical applications. Here, we aim to select one or two influentials in terms of progeny (the influential power) and prevent agents from manipulating their edges (incentive compatibility). The existing studies mostly focused on selecting a single influential for this setting. Zhang et al. [2021] studied the problem of selecting one agent and proved an upper bound of 1/(1+ln2) to approximate the optimal selection. In this paper, we first design a mechanism to actually reach the bound. Then, we move this forward to choosing two agents and propose a mechanism to achieve an approximation ratio of (3+ln2)/(4(1+ln2)) (approx. 0.54). △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: To Appear on IJCAI 2023

arXiv:2306.07239 [pdf, ps, other]

Nonparametric empirical Bayes biomarker imputation and estimation

Authors: Alton Barbehenn, Sihai Dave Zhao

Abstract: Biomarkers are often measured in bulk to diagnose patients, monitor patient conditions, and research novel drug pathways. The measurement of these biomarkers often suffers from detection limits that result in missing and untrustworthy measurements. Frequently, missing biomarkers are imputed so that down-stream analysis can be conducted with modern statistical methods that cannot normally handle da… ▽ More Biomarkers are often measured in bulk to diagnose patients, monitor patient conditions, and research novel drug pathways. The measurement of these biomarkers often suffers from detection limits that result in missing and untrustworthy measurements. Frequently, missing biomarkers are imputed so that down-stream analysis can be conducted with modern statistical methods that cannot normally handle data subject to informative censoring. This work develops an empirical Bayes $g$-modeling method for imputing and denoising biomarker measurements. We establish superior estimation properties compared to popular methods in simulations and demonstrate the utility of the estimated biomarker measurements for down-stream analysis. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.06317 [pdf, other]

The DESI One-Percent survey: constructing galaxy-halo connections for ELGs and LRGs using auto and cross correlations

Authors: Hongyu Gao, Y. P. **g, Shanquan Gui, Kun Xu, Yun Zheng, Donghai Zhao, Jessica Nicole Aguilar, Steven Ahlen, David Brooks, Todd Claybaugh, Kyle Dawson, Axel de la Macorra, Peter Doel, Kevin Fanning, Jaime E. Forero-Romero, Satya Gontcho A Gontcho, Julien Guy, Klaus Honscheid, Robert Kehoe, Martin Landriau, Marc Manera, Aaron Meisner, Ramon Miquel, John Moustakas, Jeffrey A. Newman , et al. (9 additional authors not shown)

Abstract: In the current Dark Energy Spectroscopic Instrument (DESI) survey, emission line galaxies (ELGs) and luminous red galaxies (LRGs) are essential for map** the dark matter distribution at $z \sim 1$. We measure the auto and cross correlation functions of ELGs and LRGs at $0.8<z\leq 1.0$ from the DESI One-Percent survey. Following Gao et al. (2022), we construct the galaxy-halo connections for ELGs… ▽ More In the current Dark Energy Spectroscopic Instrument (DESI) survey, emission line galaxies (ELGs) and luminous red galaxies (LRGs) are essential for map** the dark matter distribution at $z \sim 1$. We measure the auto and cross correlation functions of ELGs and LRGs at $0.8<z\leq 1.0$ from the DESI One-Percent survey. Following Gao et al. (2022), we construct the galaxy-halo connections for ELGs and LRGs simultaneously. With the stellar-halo mass relation (SHMR) for the whole galaxy population (i.e. normal galaxies), LRGs can be selected directly by stellar mass, while ELGs can also be selected randomly based on the observed number density of each stellar mass, once the probability $P_{\mathrm{sat}}$ of a satellite galaxy becoming an ELG is determined. We demonstrate that the observed small scale clustering prefers a halo mass-dependent $P_{\mathrm{sat}}$ model rather than a constant. With this model, we can well reproduce the auto correlations of LRGs and the cross correlations between LRGs and ELGs at $r_{\mathrm{p}}>0.1$ $\mathrm{Mpc}\,h^{-1}$. We can also reproduce the auto correlations of ELGs at $r_{\mathrm{p}}>0.3$ $\mathrm{Mpc}\,h^{-1}$ ($s>1$ $\mathrm{Mpc}\,h^{-1}$) in real (redshift) space. Although our model has only seven parameters, we show that it can be extended to higher redshifts and reproduces the observed auto correlations of ELGs in the whole range of $0.8<z<1.6$, which enables us to generate a lightcone ELG mock for DESI. With the above model, we further derive halo occupation distributions (HODs) for ELGs which can be used to produce ELG mocks in coarse simulations without resolving subhalos. △ Less

Submitted 18 July, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 27 pages, 16 figures, accepted by ApJ

arXiv:2306.05696 [pdf, other]

Embodied Executable Policy Learning with Language-based Scene Summarization

Authors: Jielin Qiu, Mengdi Xu, William Han, Seungwhan Moon, Ding Zhao

Abstract: Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-exp… ▽ More Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: 15 pages. arXiv admin note: text overlap with arXiv:2107.06912 by other authors

arXiv:2306.04227 [pdf, other]

High-Performance Caching of Homomorphic Encryption for Cloud Databases

Authors: Dongfang Zhao

Abstract: While homomorphic encryption (HE) has garnered significant research interest in cloud-based outsourced databases due to its algebraic properties over ciphertexts, the computational overhead associated with HE has hindered its widespread adoption in production database systems. Recently, a caching technique called Radix-based additive caching of homomorphic encryption (Rache) was proposed in SIGMOD… ▽ More While homomorphic encryption (HE) has garnered significant research interest in cloud-based outsourced databases due to its algebraic properties over ciphertexts, the computational overhead associated with HE has hindered its widespread adoption in production database systems. Recently, a caching technique called Radix-based additive caching of homomorphic encryption (Rache) was proposed in SIGMOD'23. The primary objective of this paper is to address the performance overhead resulting from the expensive randomization process in Rache. To achieve this, we propose a novel encryption algorithm called $ASEnc$, which replaces the computationally intensive full scan of radixes with the caching of a polynomial number of radix-powers during an offline stage. This design significantly reduces the performance impact caused by randomization. Furthermore, this paper aims to extend Rache's capabilities to support floating-point numbers. To accomplish this, we introduce a new encryption algorithm named $FSEnc$, leveraging efficient constant multiplication available in state-of-the-art fully homomorphic encryption (FHE) schemes. Notably, $FSEnc$ offers the flexibility to cache the coefficients instead of the radixes themselves, which may result in a large number of cached ciphertexts. However, we manage this efficiently by streaming the dynamically cached ciphertexts through a vector of circular buffers. We demonstrate that both encryption algorithms guarantee semantic security (IND-CPA). To validate their performance, we implement both algorithms as loadable functions in MySQL 8.0 and deploy the system prototype on a 96-core server hosted in the Chameleon Cloud. Experimental results showcase that $ASEnc$ outperforms Rache by 2.3--3.3$\times$, while $FSEnc$ surpasses the state-of-the-art floating-point FHE CKKS by 1.8--5.6$\times$. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2306.04216 [pdf, other]

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Authors: Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire **, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, Bo Li, Lijuan Wang

Abstract: Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges. To address these challenges and provide a comprehensive dataset for thi… ▽ More Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries for both video and textual content, providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization, spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods, including \textit{video summarization}, \textit{text summarization}, and \textit{multimodal summarization}. To champion accessibility and collaboration, we will release the \textbf{MMSum} dataset and the data collection tool as fully open-source resources, fostering transparency and accelerating future developments. Our project website can be found at~\url{https://mmsum-dataset.github.io/} △ Less

Submitted 19 November, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: Project website: https://mmsum-dataset.github.io/

arXiv:2306.04170 [pdf, other]

From the One, Judge of the Whole: Typed Entailment Graph Construction with Predicate Generation

Authors: Zhibin Chen, Yansong Feng, Dongyan Zhao

Abstract: Entailment Graphs (EGs) have been constructed based on extracted corpora as a strong and explainable form to indicate context-independent entailment relations in natural languages. However, EGs built by previous methods often suffer from the severe sparsity issues, due to limited corpora available and the long-tail phenomenon of predicate distributions. In this paper, we propose a multi-stage meth… ▽ More Entailment Graphs (EGs) have been constructed based on extracted corpora as a strong and explainable form to indicate context-independent entailment relations in natural languages. However, EGs built by previous methods often suffer from the severe sparsity issues, due to limited corpora available and the long-tail phenomenon of predicate distributions. In this paper, we propose a multi-stage method, Typed Predicate-Entailment Graph Generator (TP-EGG), to tackle this problem. Given several seed predicates, TP-EGG builds the graphs by generating new predicates and detecting entailment relations among them. The generative nature of TP-EGG helps us leverage the recent advances from large pretrained language models (PLMs), while avoiding the reliance on carefully prepared corpora. Experiments on benchmark datasets show that TP-EGG can generate high-quality and scale-controllable entailment graphs, achieving significant in-domain improvement over state-of-the-art EGs and boosting the performance of down-stream inference tasks. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: 9 pages, 3 figures, accepted to ACL 2023

arXiv:2306.02252 [pdf, other]

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

Authors: Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng

Abstract: We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding. Despite the notable progress that has been witnessed in the realm of video understanding, most prior works fail to present tasks and models to address holistic video understanding and the innate visual narrative structures existing in long-form videos. To tackle this quandary, we p… ▽ More We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding. Despite the notable progress that has been witnessed in the realm of video understanding, most prior works fail to present tasks and models to address holistic video understanding and the innate visual narrative structures existing in long-form videos. To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models by reshuffling the shot, frame, and clip layers of movie segments in the presence of video-dialogue information. We start by establishing a carefully refined dataset based on MovieNet by dissecting movies into hierarchical layers and randomly permuting the orders. Besides benchmarking the MoviePuzzle with prior arts on movie understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC) model that considers the underlying structure and visual semantic orders for movie reordering. Specifically, through a pairwise and contrastive learning approach, we train models to predict the correct order of each layer. This equips them with the knack for deciphering the visual narrative structure of movies and handling the disorder lurking in video data. Experiments show that our approach outperforms existing state-of-the-art methods on the \MoviePuzzle benchmark, underscoring its efficacy. △ Less

Submitted 14 June, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.02070 [pdf, ps, other]

Adaptive Approximation-Based Control for Nonlinear Systems: A Unified Solution with Accurate and Inaccurate Measurements

Authors: Dong Zhao

Abstract: A unified solution to adaptive approximation-based control for nonlinear systems with accurate and inaccurate state measurement is synthesized in this study. Starting from the standard adaptive approximation-based controller with accurate state measurement, its corresponding physical interpretation, stability conclusion, and learning ability are rigorously addressed when facing additive measuremen… ▽ More A unified solution to adaptive approximation-based control for nonlinear systems with accurate and inaccurate state measurement is synthesized in this study. Starting from the standard adaptive approximation-based controller with accurate state measurement, its corresponding physical interpretation, stability conclusion, and learning ability are rigorously addressed when facing additive measurement inaccuracy, and explicit answers are obtained in the framework of both controller matching and system matching. Finally, it proves that, with a certain condition, the standard adaptive approximation-based controller works as a unified solution for the cases with accurate and inaccurate measurement, and the solution can be extended to the nonlinear system control problems with extra unknown dynamics or faults in actuator and/or process dynamics. A single-link robot arm example is used for the simulation demonstration of the unified solution. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.02020 [pdf, ps, other]

Replay Attack Detection Based on Parity Space Method for Cyber-Physical Systems

Authors: Dong Zhao, Yang Shi, Steven X. Ding, Yueyang Li, Fangzhou Fu

Abstract: The replay attack detection problem is studied from a new perspective based on parity space method in this paper. The proposed detection methods have the ability to distinguish system fault and replay attack, handle both input and output data replay, maintain certain control performance, and can be implemented conveniently and efficiently. First, the replay attack effect on the residual is derived… ▽ More The replay attack detection problem is studied from a new perspective based on parity space method in this paper. The proposed detection methods have the ability to distinguish system fault and replay attack, handle both input and output data replay, maintain certain control performance, and can be implemented conveniently and efficiently. First, the replay attack effect on the residual is derived and analyzed. The residual change induced by replay attack is characterized explicitly and the detection performance analysis based on two different test statistics are given. Second, based on the replay attack effect characterization, targeted passive and active design for detection performance enhancement are proposed. Regarding the passive design, four optimization schemes regarding different cost functions are proposed with optimal parity matrix solutions, and the unified solution to the passive optimization schemes is obtained; the active design is enabled by a marginally stable filter so as to enlarge the replay attack effect on the residual for detection. Simulations and comparison studies are given to show the effectiveness of the proposed methods. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.02018 [pdf, other]

VideoComposer: Compositional Video Synthesis with Motion Controllability

Authors: Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, **gren Zhou

Abstract: The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer… ▽ More The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models will be publicly available at https://videocomposer.github.io. △ Less

Submitted 5 June, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

Comments: The first four authors contributed equally. Project page: https://videocomposer.github.io

arXiv:2306.02016 [pdf, ps, other]

Converse negative imaginary theorems

Authors: Sei Zhen Khong, Di Zhao, Alexander Lanzon

Abstract: Converse negative imaginary theorems for linear time-invariant systems are derived. In particular, we provide necessary and sufficient conditions for a feedback system to be robustly stable against various types of negative imaginary (NI) uncertainty. Uncertainty classes of marginally stable NI systems and stable strictly NI systems with restrictions on their static or instantaneous gains are cons… ▽ More Converse negative imaginary theorems for linear time-invariant systems are derived. In particular, we provide necessary and sufficient conditions for a feedback system to be robustly stable against various types of negative imaginary (NI) uncertainty. Uncertainty classes of marginally stable NI systems and stable strictly NI systems with restrictions on their static or instantaneous gains are considered. It is shown that robust stability against the former class entails the strictly NI property, whereas the latter class entails the NI property. We also establish a non-existence result that no stable system can robustly stabilise all marginally stable NI uncertainty, thereby showing that the uncertainty class of NI systems is too large as far as robust feedback stability is concerned, thus justifying the consideration of subclasses of NI systems with constrained static or instantaneous gains. △ Less

Submitted 20 November, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

Comments: This paper has been submitted for possible publication at Automatica

arXiv:2306.00435 [pdf, other]

How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension

Authors: Chen Zhang, Jiuheng Lin, Xiao Liu, Yuxuan Lai, Yansong Feng, Dongyan Zhao

Abstract: The multi-answer phenomenon, where a question may have multiple answers scattered in the document, can be well handled by humans but is challenging enough for machine reading comprehension (MRC) systems. Despite recent progress in multi-answer MRC, there lacks a systematic analysis of how this phenomenon arises and how to better address it. In this work, we design a taxonomy to categorize commonly… ▽ More The multi-answer phenomenon, where a question may have multiple answers scattered in the document, can be well handled by humans but is challenging enough for machine reading comprehension (MRC) systems. Despite recent progress in multi-answer MRC, there lacks a systematic analysis of how this phenomenon arises and how to better address it. In this work, we design a taxonomy to categorize commonly-seen multi-answer MRC instances, with which we inspect three multi-answer datasets and analyze where the multi-answer challenge comes from. We further analyze how well different paradigms of current multi-answer MRC models deal with different types of multi-answer instances. We find that some paradigms capture well the key information in the questions while others better model the relationship between questions and contexts. We thus explore strategies to make the best of the strengths of different paradigms. Experiments show that generation models can be a promising platform to incorporate different paradigms. Our annotations and code are released for further research. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Findings of ACL 2023

arXiv:2306.00350 [pdf, other]

Score-Based Equilibrium Learning in Multi-Player Finite Games with Imperfect Information

Authors: Runyu Lu, Yuanheng Zhu, Dongbin Zhao

Abstract: Real-world games, which concern imperfect information, multiple players, and simultaneous moves, are less frequently discussed in the existing literature of game theory. While reinforcement learning (RL) provides a general framework to extend the game theoretical algorithms, the assumptions that guarantee their convergence towards Nash equilibria may no longer hold in real-world games. Starting fr… ▽ More Real-world games, which concern imperfect information, multiple players, and simultaneous moves, are less frequently discussed in the existing literature of game theory. While reinforcement learning (RL) provides a general framework to extend the game theoretical algorithms, the assumptions that guarantee their convergence towards Nash equilibria may no longer hold in real-world games. Starting from the definition of the Nash distribution, we construct a continuous-time dynamic named imperfect-information exponential-decay score-based learning (IESL) to find approximate Nash equilibria in games with the above-mentioned features. Theoretical analysis demonstrates that IESL yields equilibrium-approaching policies in imperfect information simultaneous games with the basic assumption of concavity. Experimental results show that IESL manages to find approximate Nash equilibria in four canonical poker scenarios and significantly outperforms three other representative algorithms in 3-player Leduc poker, manifesting its equilibrium-finding ability even in practical sequential games. Furthermore, related to the concept of game hypomonotonicity, a trade-off between the convergence of the IESL dynamic and the ultimate NashConv of the convergent policies is observed from the perspectives of both theory and experiment. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2306.00342 [pdf, other]

Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks

Authors: Dan Zhao

Abstract: Works on implicit regularization have studied gradient trajectories during the optimization process to explain why deep networks favor certain kinds of solutions over others. In deep linear networks, it has been shown that gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks. Adding depth not only improves performance on these tasks but also ac… ▽ More Works on implicit regularization have studied gradient trajectories during the optimization process to explain why deep networks favor certain kinds of solutions over others. In deep linear networks, it has been shown that gradient descent implicitly regularizes toward low-rank solutions on matrix completion/factorization tasks. Adding depth not only improves performance on these tasks but also acts as an accelerative pre-conditioning that further enhances this bias towards low-rankedness. Inspired by this, we propose an explicit penalty to mirror this implicit bias which only takes effect with certain adaptive gradient optimizers (e.g. Adam). This combination can enable a degenerate single-layer network to achieve low-rank approximations with generalization error comparable to deep linear networks, making depth no longer necessary for learning. The single-layer network also performs competitively or out-performs various approaches for matrix completion over a range of parameter and data regimes despite its simplicity. Together with an optimizer's inductive bias, our findings suggest that explicit regularization can play a role in designing different, desirable forms of regularization and that a more nuanced understanding of this interplay may be necessary. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Journal ref: Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 3024--3038

arXiv:2306.00014 [pdf, other]

PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models

Authors: Zhuocheng Gong, Jiahao Liu, Qifan Wang, Yang Yang, **gang Wang, Wei Wu, Yunsen Xian, Dongyan Zhao, Rui Yan

Abstract: While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization met… ▽ More While transformer-based pre-trained language models (PLMs) have dominated a number of NLP applications, these models are heavy to deploy and expensive to use. Therefore, effectively compressing large-scale PLMs becomes an increasingly important problem. Quantization, which represents high-precision tensors with low-bit fix-point format, is a viable solution. However, most existing quantization methods are task-specific, requiring customized training and quantization with a large number of trainable parameters on each individual task. Inspired by the observation that the over-parameterization nature of PLMs makes it possible to freeze most of the parameters during the fine-tuning stage, in this work, we propose a novel ``quantize before fine-tuning'' framework, PreQuant, that differs from both quantization-aware training and post-training quantization. PreQuant is compatible with various quantization strategies, with outlier-aware parameter-efficient fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5. We also provide an empirical investigation into the workflow of PreQuant, which sheds light on its efficacy. △ Less

Submitted 30 May, 2023; originally announced June 2023.

Comments: Findings of ACL2023

arXiv:2305.19327 [pdf, other]

Cones 2: Customizable Image Synthesis with Multiple Subjects

Authors: Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, **gren Zhou, Yang Cao

Abstract: Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficie… ▽ More Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.19213 [pdf, other]

The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code

Authors: Xiao Liu, Da Yin, Chen Zhang, Yansong Feng, Dongyan Zhao

Abstract: Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Given the fact that programming code may express causal relations more often and explicitly with conditional… ▽ More Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Given the fact that programming code may express causal relations more often and explicitly with conditional statements like ``if``, we want to explore whether Code-LLMs acquire better causal reasoning abilities. Our experiments show that compared to text-only LLMs, Code-LLMs with code prompts are significantly better in causal reasoning. We further intervene on the prompts from different aspects, and discover that the programming structure is crucial in code prompt design, while Code-LLMs are robust towards format perturbations. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Findings of ACL 2023. Code and data are available at https://github.com/xxxiaol/magic-if

arXiv:2305.18829 [pdf, other]

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving

Authors: Chen Min, Liang Xiao, Dawei Zhao, Yiming Nie, Bin Dai

Abstract: Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we pr… ▽ More Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we propose the first multi-camera unified pre-training framework, called UniScene, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, we employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world through pre-training. A significant benefit of UniScene is its capability to utilize a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniScene. △ Less

Submitted 27 April, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted by RAL2024

arXiv:2305.18760 [pdf, other]

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

Authors: Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng

Abstract: We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters'… ▽ More We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: To appear at ACL 2023 Findings

arXiv:2305.18756 [pdf, other]

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

Authors: Yuxuan Wang, Zilong Zheng, Xueliang Zhao, **peng Li, Yueqian Wang, Dongyan Zhao

Abstract: Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions.… ▽ More Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: To appear at ACL 2023

arXiv:2305.17607 [pdf, other]

More than Classification: A Unified Framework for Event Temporal Relation Extraction

Authors: Quzhe Huang, Yutong Hu, Shengqi Zhu, Yansong Feng, Chang Liu, Dongyan Zhao

Abstract: Event temporal relation extraction~(ETRE) is usually formulated as a multi-label classification task, where each type of relation is simply treated as a one-hot label. This formulation ignores the meaning of relations and wipes out their intrinsic dependency. After examining the relation definitions in various ETRE tasks, we observe that all relations can be interpreted using the start and end tim… ▽ More Event temporal relation extraction~(ETRE) is usually formulated as a multi-label classification task, where each type of relation is simply treated as a one-hot label. This formulation ignores the meaning of relations and wipes out their intrinsic dependency. After examining the relation definitions in various ETRE tasks, we observe that all relations can be interpreted using the start and end time points of events. For example, relation \textit{Includes} could be interpreted as event 1 starting no later than event 2 and ending no earlier than event 2. In this paper, we propose a unified event temporal relation extraction framework, which transforms temporal relations into logical expressions of time points and completes the ETRE by predicting the relations between certain time point pairs. Experiments on TB-Dense and MATRES show significant improvements over a strong baseline and outperform the state-of-the-art model by 0.3\% on both datasets. By representing all relations in a unified framework, we can leverage the relations with sufficient data to assist the learning of other relations, thus achieving stable improvement in low-data scenarios. When the relation definitions are changed, our method can quickly adapt to the new ones by simply modifying the logic expressions that map time points to new event relations. The code is released at \url{https://github.com/AndrewZhe/A-Unified-Framework-for-ETRE}. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Journal ref: ACL 2023 Main Conference

Showing 201–250 of 1,041 results for author: Zha, D